Visualizing Relationships I

Scatter Plots and Correlation

STAT 80B - Data Visualization

Welcome to Week 5!

Today’s Goals

By the end of this lecture, you will be able to:

  • Understand what scatter plots reveal about relationships
  • Interpret correlation coefficients
  • Identify different types of relationships in data
  • Create effective scatter plots by hand

Part 1: What Are We Looking For?

The Big Question

When we have two quantitative variables, we want to know:

Do they relate to each other?

  • As one increases, what happens to the other?
  • Is there a pattern or just randomness?
  • How strong is the relationship?

Examples:

  • Height vs. weight
  • Study hours vs. exam score
  • Temperature vs. ice cream sales
  • Age vs. reaction time

Reading: Wilke Chapter 12

Required Reading

Chapter 12: Visualizing Associations

Sections to focus on:

  • 12.1 Scatter plots (essential!)
  • 12.2 Correlograms
  • 12.4 Paired data

Available at: clauswilke.com/dataviz

Terminology: What Goes Where?

The Language of Scatter Plots

We plot the variable on the y-axis AGAINST the variable on the x-axis

  • Y-axis: Response variable (what we’re measuring)
  • X-axis: Explanatory variable (what might explain it)

Example: Plot head length (y) against body mass (x)

Part 2: Understanding Correlation

What is Correlation?

Correlation coefficient (r): A number between -1 and +1 that measures how two variables covary

r = +1

Perfect positive relationship

As x ↑, y ↑

r = 0

No relationship

No pattern

r = -1

Perfect negative relationship

As x ↑, y ↓

Visualizing Correlation Strength

Think of correlation in ranges:

  • |r| = 0.9 to 1.0: Very strong relationship
  • |r| = 0.7 to 0.9: Strong relationship
  • |r| = 0.4 to 0.7: Moderate relationship
  • |r| = 0.2 to 0.4: Weak relationship
  • |r| = 0.0 to 0.2: Very weak/no relationship

Note: |r| means the absolute value (ignore the sign)

Important: Correlation ≠ Causation!

Critical Thinking Required!

Just because two variables are correlated does NOT mean one causes the other!

Examples of spurious correlations:

  • Ice cream sales and drowning deaths (both caused by summer heat)
  • Number of Nicolas Cage movies and swimming pool drownings
  • Per capita cheese consumption and people who died tangled in bedsheets

🎯 Activity 1: Correlation by Eye (10 min)

Your Task

Work with a partner:

  1. I will show you 6 scatter plots (on next slides)
  2. Estimate the correlation for each one
  3. Rank them from strongest to weakest relationship
  4. Write down your estimates (we’ll check answers together)

Plot A

    20 |                    *
       |              *  *     *
    15 |        *  *           *
       |    *        *     *
    10 |  *    *        *
       | *         *
     5 |    *  *
       |___________________________
         5   10  15  20  25  30

Your estimate: r = ____

Plot B

    20 |                       *
       |                    *
    15 |                 *
       |              *
    10 |           *
       |        *
     5 |     *
       |  *
       |___________________________
         5   10  15  20  25  30

Your estimate: r = ____

Plot C

    20 | *  *     *  *     *    *
       |    *  *    *   *     *
    15 | *     *  *   *    *
       |  *  *     *     *   *
    10 |    *   *    *  *     *
       | *    *   *       *
     5 |   *    *    *  *    *
       |___________________________
         5   10  15  20  25  30

Your estimate: r = ____

Plot D

    20 |  *
       |     *
    15 |        *
       |           *
    10 |              *
       |                 *
     5 |                    *
       |                       *
       |___________________________
         5   10  15  20  25  30

Your estimate: r = ____

Plot E

    20 |              *     *
       |         *  *    *
    15 |      *        *      *
       |    *      *       *
    10 |        *      *
       |  *       *        *
     5 |      *       *
       |___________________________
         5   10  15  20  25  30

Your estimate: r = ____

Plot F

    20 |                       *
       |                    *
    15 |                 *
       |              *
    10 |           *
       |        *
     5 |     *
       |  *
       |___________________________
         5   10  15  20  25  30

Your estimate: r = ____

Part 3: Making Scatter Plots

Anatomy of a Scatter Plot

Every scatter plot needs:

  1. Two quantitative variables (continuous or discrete)
  2. Clear axis labels with units
  3. Appropriate scale on both axes
  4. Each observation as one point
  5. Title explaining what’s being compared

Real Example: Blue Jay Data

From Wilke Chapter 12 - measurements on 123 blue jays:

  • Head length (mm): from tip of bill to back of head
  • Body mass (grams)
  • Skull size (mm): head length minus bill length

Research question: Do heavier birds have longer heads?

What the Data Shows

Head length vs body mass:

  • Moderate positive correlation
  • Heavier birds tend to have longer heads
  • But there’s variation (it’s biology, not physics!)
  • Some small birds have long heads
  • Some heavy birds have shorter heads

This is typical of biological data - relationships exist but aren’t perfect!

🎯 Activity 2: Create Your Own Scatter Plot (15 min)

The Data: Study Habits

Here are exam scores and study hours for 10 students:

Student Study Hours Exam Score
A 2 65
B 5 78
C 3 68
D 8 92
E 1 58
F 6 85
G 4 75
H 7 88
I 9 95
J 3 70

Your Task

On graph paper or blank paper:

  1. Draw axes

    • X-axis: Study Hours (0-10)
    • Y-axis: Exam Score (50-100)
  2. Label everything clearly

    • Axis titles with units
    • Title: “Exam Scores vs Study Hours”
  3. Plot each student as a point

  4. Answer these questions:

    • What is the general trend?
    • Estimate the correlation coefficient
    • Would you describe this as strong, moderate, or weak?
    • Is there anyone who doesn’t follow the pattern?

Activity Discussion

Let’s share what you found:

  • What correlation did you estimate?
  • Who found outliers or unusual points?
  • What does this relationship suggest about studying?
  • What other factors might affect exam scores?

Part 4: Types of Relationships

Not All Relationships Are Linear!

Correlation (r) only measures linear relationships.

Other patterns exist:

  • Curved relationships (quadratic, exponential)
  • No relationship (random scatter)
  • Clusters or groups (different subpopulations)
  • Outliers (unusual observations)

Example: Curved Relationship

Exercise intensity vs. enjoyment:

  • Low intensity: boring (low enjoyment)
  • Medium intensity: fun! (high enjoyment)
  • High intensity: exhausting (low enjoyment)

This is an inverted U-shape - correlation might be near zero even though there’s a clear pattern!

Example: Groups Within Data

Blue jay data with sex included:

When Wilke colored points by bird sex:

  • Male and female birds form somewhat separate groups
  • Within each sex, head length and body mass correlate
  • But females are lighter with shorter heads overall
  • Context matters!

🎯 Activity 3: Identify the Relationship (10 min)

Scenario Card Sort

Work in groups of 3-4. For each scenario, decide:

  1. Would you expect a positive, negative, or no relationship?
  2. Would it be linear or curved?
  3. Would correlation be strong, moderate, or weak?

Scenarios

A. Age of a car vs. its resale value

B. Outside temperature vs. hot chocolate sales

C. Number of slugs visible on campus vs. humidity in Santa Cruz

D. Hours of sleep vs. test performance

E. Distance from equator vs. average temperature

F. Coffee consumption vs. alertness (think about too much coffee!)

G. Years of education vs. lifetime earnings

H. Number of bikes rented in SC per day vs. ammount of daily rain in SC

Discuss Your Answers

What did you decide for each scenario?

  • Which ones might have curved relationships?
  • Which ones might have confounding variables?
  • When would correlation be misleading?

Part 5: Common Pitfalls

Mistakes to Avoid

Watch Out For:

  1. Assuming causation from correlation
  2. Ignoring outliers (they can distort correlation)
  3. Extrapolating beyond your data range
  4. Missing nonlinear patterns
  5. Not considering subgroups in your data
  6. Comparing variables measured in different units without thinking

The Outlier Problem

A single extreme point can:

  • Make correlation appear stronger (or weaker) than it really is
  • Mislead you about the overall pattern
  • Hide the true relationship for most of the data

Always look at your scatter plot! Don’t just calculate r.

What’s Next?

Next Lecture Preview

Next class we’ll cover:

  • Adding multiple variables to scatter plots (color, size, shape)
  • Handling overplotting (transparency, jittering)
  • Introduction to correlograms
  • Paired data and before/after comparisons

Homework: Read Wilke Chapter 12 sections 12.2-12.4

Remember These Points

  1. Scatter plots show relationships between two quantitative variables
  2. Correlation (r) measures linear relationship strength from -1 to +1
  3. Correlation ≠ causation!
  4. Always plot your data - don’t just calculate statistics
  5. Look for patterns, outliers, and subgroups
  6. Context matters in interpreting relationships

Questions?

Office hours: Check Canvas for times

Concept Map 2 is coming: Start thinking about how to organize chart types and when to use different visualizations.