This series is meant to introduce you to the basic concepts of Statistics, one of the Data Scientists’ most valuable tools. Intended audience are complete beginners in the field with no prior experience in advanced mathematics. Let’s start off with a term you will come across very frequently- Variable.
What is a variable?
A variable is a series of data points that varies.
- If I asked a bunch of people what their eye color is, the variable will be eye color and it will vary. Some people will tell me blue, some brown, and so on.
- If I ask people how tall they are, they might tell me 60 inches, or 5 feet 9, or whatever it is and so, height will be the variable.
We need to know the kind of variable we have in order to know what kind of statistic to use, So for that we split the variables into an independent variable and a dependent variable. And then we say, okay, with that kind of variable, I can do this kind of statistical test.
So what is the difference between an independent variable and a dependent variable?
Well, one of them is the cause and the other is the effect. The independent variable is the cause and the dependent variable is the effect.
- Another example, supposing my two variables are how many calories you eat a day and your weight. Now your weight doesn’t cause how many calories you eat every day.
- It’s the other way round. How many calories you eat causes your weight.
- So, in that case calories is the independent variable because it’s causing something and the dependent variable is the outcome. It’s your weight in that case. It’s the effect of how you eat.
Types of Data:
Numeric data have meaning as a measurement, such as a person’s height, weight, IQ, or blood pressure; or they’re a count, such as the number of stock shares a person owns, how many teeth a dog has, or how many pages you can read of your favorite book before you fall asleep. Statisticians also call numerical data quantitative data.
It is further broken down into:
- Discrete: Discrete data represent items that can be counted; they take on possible values that can be listed out. The list of possible values may be fixed (also called finite); or it may go from 0, 1, 2, on to infinity (making it countably infinite).
- Continuous: Continuous data represent measurements; their possible values cannot be counted and can only be described using intervals on the real number line.
For example: Your shoe size is discrete and your foot size is continuous.
Categorical data represent characteristics such as a person’s gender, marital status, hometown, or the types of movies they like.
Levels of Measurement is a really key characteristic of any particular variable. There are actually four levels of measurement,
Nominal variables are organized into non-numeric categories that cannot be ranked or compared quantitatively. This type of data is often referred to as qualitative.
○ Appropriate mathematical operation: counting the number of cases per category.
- Nominal means that the variable just tells us something about the classification of the variable.
- There’s no ordering in that particular characteristic for example like eye color. You’re brown or you’re blue or you’re green or whatever color you are. But there’s no particular ordering amongst them that one is more or less than another.
- For example: Jersey number for athletes.
Ordinal variables are organized into rank-able categories.
○ Appropriate mathematical operations: counting and ranking.
- For example: How was the service at a restaurant?
- Good, fair, poor, very good, excellent. That’s an ordinal scale.
- Another example: Rank order of winners.
Interval variables have an exact interval between categories, allowing for a direct comparison between categories, such that the difference between any two sequential data points is exactly the same as the difference between any other two sequential data points.
○ Appropriate mathematical operations: counting, ordering, and addition, subtraction, multiplication and division of the interval between values (but not the values themselves).
- Example: time of the day: 10:00 am, 10:20 am, noon, 4:00 pm, 8:00 pm, etc. In this example, we can say that 10:20 is exactly 20 minutes later than 10:00, but we can’t say that 8:00 is “twice as late” as 4:00, and it doesn’t make sense to add noon + 4:00.
- Another example: The Fahrenheit and Celsius scales of temperature. You can talk about 30 degrees being 60 degrees less than 90 degrees, so differences do make sense. However, 0 degrees (in both scales), cold as it may be, does not represent the total absence of temperature.
Ratio variables have all the characteristics of nominal, ordinal and interval variables, but also have a meaningful zero point.
○ Appropriate mathematical operations: counting, ordering, and addition, subtraction, multiplication and division of the interval between values as well as the values themselves.
- Due to the presence of a zero, it now makes sense to compare the ratios of measurements. Phrases such as “four times” and “twice” are meaningful at the ratio level.
- Distances, in any system of measurement give us data at the ratio level. A measurement such as 0 feet does make sense, as it represents no length. Furthermore 2 feet is twice as long than 1 foot. So, ratios can be formed between the data.
Find me on LinkedIn: https://www.linkedin.com/in/biraj-parikh-ab5622103/