| Flipper Length (mm) | Body Mass (g) |
|---|---|
| 191 | 4150 |
| 190 | 3400 |
| 230 | 5700 |
| 190 | 3700 |
| 209 | 4600 |
| 190 | 4250 |
Week 8
A question from evolutionary biology:
Can we predict how heavy a penguin is just by measuring its flipper?
Researchers at Palmer Station, Antarctica measured 344 penguins across three species:
Today weβll use this dataset to understand how two numerical variables relate β and how to use one to predict the other.
A random sample of 6 penguins from the dataset:
| Flipper Length (mm) | Body Mass (g) |
|---|---|
| 191 | 4150 |
| 190 | 3400 |
| 230 | 5700 |
| 190 | 3700 |
| 209 | 4600 |
| 190 | 4250 |
Just looking at numbers is hard. We need a visualization.
A scatterplot displays the relationship between two numerical variables.
When describing a scatterplot, always comment on:
1. Direction β Does the relationship go up or down?
As flipper length increases, body mass tends to increase (positive)
2. Form β Is the pattern linear or curved?
The relationship appears roughly linear
3. Strength β How tightly do the points cluster around the pattern?
Points cluster fairly closely β moderately strong
4. Outliers β Any unusual points?
A few points fall away from the main trend
[Poll Everywhere β respond now!]
Look at this scatterplot description:
βAs daily average temperature increases, hot chocolate sales decrease. The relationship is fairly linear and strong, with one outlier on a very cold day.β
Discuss with your neighbor (2 min):
β Report your answer to Q3 on Poll Everywhere
We can measure the strength and direction of a linear relationship with the correlation coefficient r.
\[r = \frac{1}{n-1}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)\]
Properties of r:
For the penguins: Flipper length vs. body mass gives r = 0.87
Interpretation: Strong, positive linear relationship between flipper length and body mass.
r = 0.87 means flipper length and body mass are strongly associated β but does longer flippers cause heavier penguins?
Possible explanations for an association:
Famous spurious correlations:
[Poll Everywhere β respond now!]
Researchers find that students who eat breakfast regularly have higher GPAs (r = 0.45, p < 0.001).
Discuss with your neighbor (2 min):
β Report your confounding variable on Poll Everywhere
Correlation tells us how strongly two variables are related.
Regression gives us a line to describe or predict the response from the explanatory variable.
The least squares regression line minimizes the sum of squared residuals:
\[\hat{y} = b_0 + b_1 x\]
where:
For our penguins:
\[\widehat{\text{body mass}} = -5781 + 49.7 \times \text{flipper length}\]
The least squares formulas for slope and intercept:
Slope:
\[b_1 = r \cdot \frac{s_y}{s_x}\]
where \(r\) is the correlation, \(s_y\) is the SD of the response, and \(s_x\) is the SD of the explanatory variable.
Intercept:
\[b_0 = \bar{y} - b_1 \bar{x}\]
The line always passes through the point \((\bar{x},\ \bar{y})\) β the means of both variables.
For our penguins:
\[b_1 = 0.87 \cdot \frac{802}{44} = 49.7 \text{ g/mm} \qquad b_0 = 4202 - 49.7 \times 200 = -5781 \text{ g}\]
Slope (bβ = 49.7): For each additional 1 mm of flipper length, predicted body mass increases by 49.7 grams, on average.
Intercept (bβ = -5781): A penguin with flipper length = 0 mm would be predicted to weigh -5781 grams β π¨ not meaningful, just a mathematical anchor.
\[\widehat{\text{body mass}} = -5781 + 49.7 \times \text{flipper length}\]
Example: A penguin has a flipper length of 200 mm. Predict its body mass.
\[\hat{y} = -5781 + 49.7 \times 200 = 4159 \text{ g}\]
β οΈ Extrapolation warning:
The penguin data ranges from flipper lengths of 172β231 mm.
Predicting for flipper = 100 mm or flipper = 300 mm is extrapolation β we have no data there and the linear pattern may not hold!
Call:
lm(formula = body_mass_g ~ flipper_length_mm, data = penguins_clean)
Residuals:
Min 1Q Median 3Q Max
-1058.80 -259.27 -26.88 247.33 1288.69
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5780.831 305.815 -18.90 <2e-16 ***
flipper_length_mm 49.686 1.518 32.72 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 394.3 on 340 degrees of freedom
Multiple R-squared: 0.759, Adjusted R-squared: 0.7583
F-statistic: 1071 on 1 and 340 DF, p-value: < 2.2e-16
What does RΒ² = 0.759 tell us?
RΒ² (R-squared) = the proportion of variability in the response variable explained by the model.
RΒ² = 0.759 means:
75.9% of the variability in penguin body mass is explained by flipper length.
The remaining 24.1% is due to other factors (species, sex, diet, age, etc.)
Relationship to r:
\[R^2 = r^2 = (0.87)^2 = 0.7569 \approx 0.759\]
For simple linear regression, RΒ² is literally the correlation squared!
While you rest, think about:
Weβve predicted body mass from flipper length. But how well does our line actually fit? How do we check?
See you in 10!
The residual for each observation is the difference between the actual and predicted values:
\[e_i = y_i - \hat{y}_i\]
Example:
A positive residual β actual value is above the line (we underpredicted)
A negative residual β actual value is below the line (we overpredicted)
The least squares line is chosen to make the sum of squared residuals as small as possible.
To check whether a linear model is appropriate, we plot residuals vs. fitted values.
What we want to see: No pattern β random scatter around zero.
β Good: Random scatter around zero
β Linear model is appropriate; constant variance
β Problem: Curved pattern
β The relationship is not truly linear; consider transformations
β Problem: Fan shape (variance increases)
β Constant variance assumption is violated
β Problem: Outliers (one or two extreme points)
β Investigate those observations carefully
Interpretation: Clusters by species reveal that mixing three species creates structure in the residuals β species is a lurking variable!
[Poll Everywhere β respond now!]
A researcher fits a linear regression of plant height (cm) on amount of fertilizer (g). The residual plot shows a clear U-shaped curve.
Discuss with your neighbor (2 min):
β Report your answer to Q3 on Poll Everywhere
For regression to be valid (especially for inference), we need:
Linearity β the relationship between x and y is linear (check scatterplot)
Independence β observations are independent of each other (study design)
Normal residuals β residuals are approximately normally distributed (check histogram of residuals)

Equal variance β residuals have roughly constant spread (check residual plot)
Remember the acronym: LINE
Scatterplots visualize the relationship between two numerical variables β describe direction, form, strength, and outliers
Correlation (r) quantifies linear association: ranges from β1 to +1
Correlation β Causation β association could be due to lurking variables
Regression line (\(\hat{y} = b_0 + b_1 x\)) predicts the response from the explanatory variable
Slope: change in predicted Ε· per 1-unit increase in x; Intercept: predicted Ε· when x = 0
RΒ²: proportion of variability in y explained by the model
Residual plots: check for linearity and equal variance assumptions
Next class we cover:
The penguins will be back! Weβll compare bill length across all three species.
STAT 7 Β· Winter 2026