Descriptive Statistics - Data Science Cheatsheet

Interactive Statistics Visualization

How to use: Add or remove data points by clicking on the plot, or use the sliders to adjust values. Watch how mean, variance, and other statistics change in real-time!

Data Distribution

Measures of Central Tendency

Mean (μ)

-

Median

-

Mode

-

Measures of Spread

Variance (σ²)

-

Std Deviation (σ)

-

Range

-

Other Statistics

Q1 (25%)

-

Q3 (75%)

-

IQR

-

Count (n)

-

Arithmetic Mean

What is Arithmetic Mean?

The arithmetic mean (average) is the sum of all values divided by the count. It's the most common measure of central tendency.

Sample Mean Formula

$$\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{x_1 + x_2 + \cdots + x_n}{n}$$

where:
• $\bar{x}$ - sample mean (x-bar)
• n - number of observations
• x_i - individual data point

Properties:
• Sensitive to outliers (extreme values pull the mean)
• Best for symmetric distributions
• Minimizes sum of squared deviations
• Can be affected by skewness

Mathematical Expectation (Expected Value)

What is Expected Value?

Expected value is the theoretical mean of a probability distribution - the long-run average if you repeated an experiment infinitely many times.

Discrete Random Variable

$$E[X] = \mu = \sum_{i=1}^{n} x_i \cdot P(X = x_i)$$

Example: Rolling a fair die
$$E[X] = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 5 \cdot \frac{1}{6} + 6 \cdot \frac{1}{6} = 3.5$$

Continuous Random Variable

$$E[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx$$

where:
• f(x) - probability density function (PDF)
• Integration over all possible values weighted by probability

Properties of Expectation

Linearity:
$$E[aX + bY] = aE[X] + bE[Y]$$
Constant:
$$E[c] = c$$
Function of random variable:
$$E[g(X)] = \sum g(x_i) \cdot P(X = x_i)$$

Variance and Standard Deviation

What is Variance?

Variance measures how spread out the data is from the mean. It's the average of squared deviations from the mean. Standard deviation is the square root of variance (same units as data).

Population Variance

$$\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2$$

where:
• σ² - population variance (sigma squared)
• N - population size
• μ - population mean
• Used when you have the entire population

Sample Variance

$$s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2$$

where:
• s² - sample variance
• n - sample size
• $\bar{x}$ - sample mean
• Divided by (n-1) for unbiased estimator (Bessel's correction)
• Used when you have a sample from population

Variance using Expectation

$$\text{Var}(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2$$

Computational formula: Often easier to calculate
"Expected value of square minus square of expected value"

Standard Deviation

$$\sigma = \sqrt{\sigma^2} \quad \text{or} \quad s = \sqrt{s^2}$$

Advantage: Same units as original data (easier to interpret)
Example: If data is in dollars, std is in dollars (not dollars²)

Properties of Variance

Constant:
$$\text{Var}(c) = 0$$
Scaling:
$$\text{Var}(aX) = a^2 \text{Var}(X)$$
Shifting (adding constant):
$$\text{Var}(X + b) = \text{Var}(X)$$
Independent variables:
$$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$$

Median and Mode

Median

The middle value when data is sorted. Splits data into two equal halves.

If n is odd:
$$\text{Median} = x_{(n+1)/2}$$
If n is even:
$$\text{Median} = \frac{x_{n/2} + x_{(n/2)+1}}{2}$$

Properties:
• Robust to outliers (not affected by extreme values)
• Better than mean for skewed distributions
• 50th percentile (Q2)

Mode

The most frequently occurring value in the dataset.

Properties:
• Can have multiple modes (bimodal, multimodal)
• Can have no mode (all values unique)
• Most useful for categorical data
• Not affected by outliers

Example: [1, 2, 2, 3, 4, 4, 4, 5] → Mode = 4

Quartiles and Percentiles

Quartiles

Quartiles divide the data into four equal parts.

• Q1 (25th percentile): 25% of data below this value
• Q2 (50th percentile): Median
• Q3 (75th percentile): 75% of data below this value

Interquartile Range (IQR):
$$\text{IQR} = Q3 - Q1$$
Measures spread of middle 50% of data (robust to outliers)

Outlier Detection using IQR

Lower fence: $Q1 - 1.5 \times \text{IQR}$
Upper fence: $Q3 + 1.5 \times \text{IQR}$

Values outside these fences are considered outliers

Mean vs Median vs Mode

Use Mean When:

✅ Data is symmetric
✅ No outliers
✅ Interval/ratio data
✅ Need to use all data points

❌ Sensitive to outliers
❌ Skewed distributions

Use Median When:

✅ Data is skewed
✅ Outliers present
✅ Ordinal data
✅ Want robust measure

✅ Income, house prices
✅ Better for rankings

Use Mode When:

✅ Categorical data
✅ Finding most common
✅ Nominal data
✅ Discrete distributions

✅ Most popular product
✅ Most frequent rating

Key Insights

Relationship in symmetric distributions:
Mean = Median = Mode (perfectly symmetric like Normal distribution)

Right-skewed (positive skew):
Mode < Median < Mean (e.g., income distribution)

Left-skewed (negative skew):
Mean < Median < Mode (e.g., age at retirement)

Why n-1 in sample variance?
Bessel's correction: Using sample mean underestimates population variance. Dividing by (n-1) instead of n gives unbiased estimator.

Variance units problem:
Variance is in squared units (hard to interpret). Standard deviation fixes this by taking square root.

Coefficient of Variation (CV):
$$CV = \frac{\sigma}{\mu} \times 100\%$$
Relative variability - useful for comparing spread across different scales