Understanding Distributions

ECDFs and Q-Q Plots

STAT 80: Data Visualization

Week 4, Day 1

Multiple Distributions & Comparisons

Today’s focus:

  • What is a distribution?
  • ECDFs (Empirical Cumulative Distribution Functions)
  • Q-Q plots for comparing distributions
  • When to use different tools
  • Reminder about your proposal

What Do We Mean by “Distribution”?

Distribution = The pattern of how data values are spread out

Think of it as asking:

  • What values appear in my data?
  • How often does each value appear?
  • Are values clustered together or spread out?

Everyday example:

Heights of students in class:

  • Most people: 5’4” - 6’0”
  • Some shorter, some taller
  • Very few people below 5’0” or above 6’6”

This pattern is the distribution of heights

Why Do Distributions Matter?

Understanding distributions helps you answer questions like:

  • What’s typical? (Where are most values?)
  • What’s unusual? (Are there outliers?)
  • How much variety? (Are values similar or spread out?)
  • Are there patterns? (Clusters? Gaps?)

In visualization: Showing the distribution clearly helps viewers understand your data’s story

Review: Histogram

You’ve already seen one way to visualize distributions!

  • Height of bars = how many values in each range
  • Shows the shape of the distribution

But Histograms Have Limitations…

Problem 1: Bin choices matter

  • Too few bins - lose detail
  • Too many bins - looks noisy

Different bin choices → different impressions of the same data!

But Histograms Have Limitations…

Problem 2: Hard to Compare Across Groups

When you have multiple groups, histograms get messy:

  • Which group is higher at each point?
  • Where do they overlap?
  • Hard to see precise differences

Enter: The ECDF

ECDF = Empirical Cumulative Distribution Function

(Don’t worry about the fancy name!)

What it shows:

For any value x, the ECDF tells you:

“What percentage of my data is less than or equal to x?”

Reading an ECDF:

ECDF: Step by Step

Let’s build one together!

Data: Test scores: 90, 70, 75, 65, 75, 80, 85, 95, 100

Step 1: Sort the data

Step 2: For each value, calculate what % of data is ≤ that value

  • 65: 1/9 = 11%
  • 70: 2/9 = 22%
  • 75: 4/9 = 44% (two students got 75!)
  • 80: 5/9 = 56%
  • …and so on

ECDF: Plotting It

The line steps up each time we encounter a data point

Data: Test scores: 65, 70, 75,75, 80, 85, 90, 95, 100

Reading an ECDF: Examples

Question: What % scored 80 or below?

Answer: Find 80 on x-axis, go up to the line, read y-axis ≈ 56%

Reading an ECDF: Examples

Question: What score is at the 75th percentile? (75% scored this or lower)

Answer: Find 0.75 on y-axis, go across to line, read x-axis ≈ 87

ECDF vs Histogram: Advantages

ECDF Advantages:

  • ✅ No arbitrary choices (no bins to choose!)
  • ✅ Shows exact percentiles
  • ✅ Easy to overlay multiple groups
  • ✅ Shows all the data (every point matters)

Histogram Advantages:

  • ✅ Easier for most people to read initially
  • ✅ Shows density/“peakiness” more clearly
  • ✅ More familiar to general audiences

When to Use ECDFs

Best for:

  • Comparing two or more distributions
  • Finding exact percentiles
  • When you want to avoid bin choices
  • Technical/scientific audiences

Avoid when:

  • Your audience isn’t familiar with them
  • You want to emphasize the “shape” (peaks, valleys)
  • You’re showing very simple distributions

Activity: Reading ECDFs

  1. What percentage of Group A scored below 70?
  2. What percentage of Group B scored below 70?
  3. Which group performed better overall?
  4. At what score do the groups have the same cumulative percentage?

Q-Q Plots: Comparing Distributions

Q-Q = Quantile-Quantile

(Another fancy name for a simple idea!)

Purpose: Compare two distributions to see if they have the same shape

Key idea: If two distributions are the same shape, their percentiles should match

What’s a Quantile/Percentile?

Percentile = a value below which a certain % of data falls

Examples you know:

  • 25th percentile = 25% of data is below this value
  • 50th percentile = median (half below, half above)
  • 75th percentile = 75% of data is below this value

Quantile = just another word for percentile (but written as 0.25, 0.50, 0.75 instead of 25%, 50%, 75%)

How Q-Q Plots Work

For each percentile:

  1. Find that percentile in Distribution A
  2. Find that percentile in Distribution B
  3. Plot them against each other

If distributions match: points fall on a straight diagonal line

If distributions differ: points deviate from the line

Q-Q Plot: Perfect Match

Points on the diagonal = distributions are identical!

Q-Q Plot: Different Locations

Parallel to diagonal but shifted = same shape, different average

Q-Q Plot: Different Spreads

Steeper or flatter = one distribution is more/less spread out

Q-Q Plot: Different Shapes

Curved pattern = distributions have fundamentally different shapes

When to Use Q-Q Plots

Best for:

  • Checking if two datasets have similar distributions
  • Statistical analysis (checking assumptions)
  • Identifying specific differences in shape
  • Technical audiences

Avoid when:

  • You just want to know “which group is higher” (use ECDF or boxplot)
  • Your audience is general public
  • You have more than 2 distributions to compare

Summary: ECDFs

What: Cumulative percentage plot

Reads: “What % of data is ≤ this value?”

Best for:

  • Comparing 2+ groups
  • Finding exact percentiles
  • Avoiding arbitrary bin choices

Watch out:

  • Can be unfamiliar to general audiences
  • Doesn’t show “peakiness” as clearly as histogram

Summary: Q-Q Plots

What: Percentile-to-percentile comparison

Reads: “Do these distributions have the same shape?”

Best for:

  • Comparing exactly 2 distributions
  • Identifying how distributions differ
  • Technical/statistical work

Watch out:

  • Requires explanation for non-technical audiences
  • Only compares 2 distributions at once

Pair and solve on paper

Constructing a QQ Plot by Hand

Two groups of students took different versions of a test:

Group X (Morning Section):

52, 58, 61, 65, 68, 70, 72, 74, 76, 78, 
80, 82, 84, 86, 88, 90, 92, 94, 96, 98

Group Y (Afternoon Section):

48, 54, 60, 62, 66, 68, 72, 74, 76, 78,
80, 82, 84, 86, 90, 92, 94, 96, 98, 100

Pair and solve on paper

  • Step 1: Order the Data

Task: Both datasets are already sorted. Verify you have 20 observations in each group.

  • Step 2: Pair the Quantiles

Task: Match corresponding quantiles from each group.

Observation Group X Group Y
1st (5th percentile) 52 48
2nd (10th percentile) 58 54
3rd 61 60
10th (50th percentile) 78 78
20th (100th percentile) 98 100
  • Step 3: Plot the Quantiles

  • Step 4: Answer the following Analysis Questions

Pair and solve on paper

  • What does the dashed line in the middle represent? What would it mean if all points fell exactly on this line?
  • Describe the pattern you see. Do the points follow the reference line closely, or do they deviate systematically?
  • Look at the lower quantiles (left side) and upper quantiles (right side). Which group tends to have lower scores at the low end? Which has higher scores at the high end?

When you are done, please submit your work making sure both of your names are included, you have a QQ plot, and you answered all 3 questions

Next Class Preview

We’ll look at even more ways to compare distributions:

  • Boxplots - compact summaries
  • Violin plots - shape + density
  • Ridgeline plots - elegant overlays
  • Small multiples - comparing many groups

Before that, make sure you are working on your project proposal, it’s due this Friday. I have office hours today after class, in case you have questions.