HW7: Correlation, Regression & ANOVA
Patterns, Predictions, and Penguin Science
🐧 The Case
Welcome back, Statistical Detective!
Dr. Elena Fischer, a wildlife ecologist at the Polar Research Institute, has spent five seasons at Palmer Station, Antarctica, collecting data on three penguin species: Adelie, Chinstrap, and Gentoo. She approaches you with a problem.
“I have measurements on hundreds of penguins — flippers, bills, body mass, all of it. But I need help making sense of the relationships. Can flipper length really predict how heavy a penguin is? Are the three species genuinely different? And how do I know if my regression model is any good?”
Your mission this week: use correlation, simple linear regression, and ANOVA to answer Dr. Fischer’s biological questions — and learn to distinguish what the data can and cannot tell us.
Question 1: Reading Scatterplots
Dr. Fischer shows you four scatterplots from her dataset. For each description below, identify the direction, form, and strength of the relationship, and match it to the most plausible correlation coefficient from the list: \(r \in \{-0.85, -0.22, +0.10, +0.87\}\).
Scatterplot A: As flipper length increases, body mass increases strongly. Points cluster closely around a straight line with slight scatter.
Scatterplot B: There is virtually no pattern. Points are scattered randomly across the plot.
Scatterplot C: Birds with deeper bills tend to have slightly shorter bills (bill depth vs. bill length). The downward trend is weak and there is substantial scatter.
Scatterplot D: Body mass and flipper length show a very strong positive trend, almost as strong as Scatterplot A but with slightly more scatter.
a. For each scatterplot, state the direction, form, and strength, and assign the most plausible \(r\) value. Explain your reasoning for each assignment in 1–2 sentences.
b. For Scatterplot A, a student says: “The strong positive correlation proves that longer flippers make penguins heavier.” What is wrong with this reasoning? Provide two alternative explanations for the association.
c. Scatterplot C has a negative correlation. Does a negative correlation indicate a weaker relationship than a positive one? Explain.
Question 2: Calculating and Interpreting Correlation
Dr. Fischer records flipper length (mm) and body mass (g) for a small sample of 8 Gentoo penguins:
| Penguin | Flipper Length (mm) | Body Mass (g) |
|---|---|---|
| 1 | 211 | 4500 |
| 2 | 214 | 4700 |
| 3 | 215 | 4425 |
| 4 | 217 | 4875 |
| 5 | 220 | 5200 |
| 6 | 221 | 5050 |
| 7 | 228 | 5550 |
| 8 | 232 | 5850 |
a. Calculate the mean and standard deviation for both flipper length (\(\bar{x} = 219.8\) mm, \(s_x = 7.1\) mm) and body mass (\(\bar{y} = 5019\) g, \(s_y = 477.9\) g). (These are provided — verify by checking one calculation.)
b. Using the formula below, calculate the correlation coefficient for penguins 1 and 8 only (the first two standardized products), then state whether you’d expect the overall \(r\) to be positive or negative and roughly how strong, based on the full table.
\[r = \frac{1}{n-1}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)\]
c. The actual correlation for all 8 penguins is \(r = 0.98\). Interpret this value in one sentence.
d. \(r = 0.98\) is extremely high. Name one reason we should be cautious about drawing strong conclusions from a sample of only 8 penguins, even with such a high correlation.
Question 3: Fitting and Interpreting a Regression Line
Using the full Palmer Penguins dataset (n = 344), the following regression output was obtained:
Call:
lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5780.83 305.81 -18.90 <2e-16 ***
flipper_length_mm 49.69 1.52 32.72 <2e-16 ***
Residual standard error: 394.1 on 340 degrees of freedom
Multiple R-squared: 0.7592, Adjusted R-squared: 0.7585
F-statistic: 1071 on 1 and 340 DF, p-value: < 2.2e-16
a. Write the regression equation in the form \(\hat{y} = b_0 + b_1 x\).
b. Interpret the slope in one complete sentence in biological context. Be precise: include the units and the direction.
c. The y-intercept is −5780.83 g. What would this mean literally (a penguin with flipper length = 0)? Why is it not meaningful to interpret this value in context?
d. Use the regression equation to predict the body mass of a penguin with a flipper length of 195 mm. Show your work.
e. Dr. Fischer wants to predict the body mass of a penguin with a flipper length of 320 mm (much longer than any observed penguin). What is the predicted value? Why should we be skeptical of this prediction?
f. Interpret \(R^2 = 0.759\) in context. What does the remaining 24.1% represent?
Question 4: Residuals and Model Checking
a. A Chinstrap penguin has a flipper length of 200 mm and an actual body mass of 3700 g. Using your regression equation from Question 3, calculate the residual for this penguin. Is this penguin heavier or lighter than predicted?
b. Explain in your own words what a residual plot (residuals vs. fitted values) is used for, and describe what you would hope to see if the linear model is appropriate.
c. For each residual plot description below, state whether the linear model appears appropriate and explain what the pattern (if any) suggests:
- Plot I: Points scatter randomly around zero with no pattern.
- Plot II: The residuals form a clear curve — negative for small fitted values, positive in the middle, then negative again for large fitted values.
- Plot III: The spread of the residuals increases as the fitted values increase (fan shape opening to the right).
d. The conditions for regression inference are sometimes remembered as LINE. State what each letter stands for and briefly describe how you would check each condition for the penguin regression.
Question 5: Outliers and Influential Points
a. Explain the difference between an outlier (large residual) and an influential point (high leverage that changes the regression line). Can a point be one without the other? Give an example.
b. Suppose a data entry error introduced one penguin into the dataset with flipper length = 350 mm and body mass = 6500 g — values far outside the range of the rest of the data.
- Would this point have high leverage? Why?
- Would this point likely be influential? What might happen to the slope?
- What should the researcher do before re-running the analysis?
c. After discovering the error in (b), the researcher fits the regression without the erroneous point and gets slope = 49.7 g/mm (R² = 0.759). With the erroneous point included, the slope drops to 38.4 g/mm (R² = 0.68). What does this tell us about the influence of that single observation?
Question 6: Back to the DASH Study — Correlation and Regression
The DASH study (Appel et al., 1997) examined the relationship between sodium intake (mg/day) and systolic blood pressure (mmHg) across participants.
a. Suppose the correlation between sodium intake and systolic blood pressure is \(r = +0.43\). Interpret this correlation in context.
b. A simple linear regression gives: \(\widehat{\text{SBP}} = 98.2 + 0.009 \times \text{sodium}\) (in mg/day).
Interpret the slope. If a participant increases their daily sodium intake by 1000 mg/day, what is the predicted change in systolic blood pressure?
c. \(R^2 = 0.185\) for this model. Interpret this value. Does this mean sodium is not related to blood pressure? Explain.
d. Why would it be inappropriate to conclude from this regression that reducing sodium intake will cause a reduction in blood pressure, even if the slope is statistically significant? What kind of study would be needed to establish causation?
Question 7: When t-Tests Multiply — The Problem of Multiple Comparisons
A researcher wants to compare mean resting heart rate across five exercise groups: sedentary, light, moderate, vigorous, and elite athletes.
a. How many pairwise t-tests would be required to compare every pair of groups?
b. If each t-test uses \(\alpha = 0.05\), calculate the probability of making at least one Type I error across all tests. Use \(P(\text{at least one error}) = 1 - (1 - 0.05)^k\), where \(k\) is the number of tests.
c. What does your answer to (b) tell you about the reliability of this approach?
d. What test should the researcher use instead? State the null and alternative hypotheses for this test in the context of comparing heart rates across five groups.
Question 8: ANOVA — Penguin Bill Length
Dr. Fischer wants to know whether mean bill length (mm) differs across the three penguin species. Here is the ANOVA output:
Df Sum Sq Mean Sq F value Pr(>F)
species 2 7194 3597 397.3 <2e-16 ***
Residuals 339 3069 9
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And summary statistics by species:
| Species | n | Mean Bill Length | SD |
|---|---|---|---|
| Adelie | 152 | 38.8 mm | 2.7 mm |
| Chinstrap | 68 | 48.8 mm | 3.3 mm |
| Gentoo | 124 | 47.5 mm | 3.1 mm |
a. State the null and alternative hypotheses for this ANOVA in symbols and words.
b. What are the degrees of freedom for “species” and for “residuals”? Show how each is calculated from the study design.
c. Verify that \(F = \text{Mean Sq}_{\text{species}} / \text{Mean Sq}_{\text{residuals}} = 397.3\). Show your calculation.
d. Interpret the F-statistic conceptually: what does a large F tell us about variation between vs. within species?
e. State your conclusion at \(\alpha = 0.001\). Write your conclusion in one sentence using biological language.
f. Check the equal variance condition. Is it reasonably satisfied for this dataset? Use the summary statistics to justify your answer.
g. A classmate concludes: “Since the ANOVA was significant, Adelie penguins have shorter bills than Gentoo penguins.” Is this a valid conclusion from the ANOVA? Explain why or why not.
Question 9: Connecting Regression and ANOVA
Both regression and ANOVA appear in the same R output. Look at the regression output from Question 3:
F-statistic: 1071 on 1 and 340 DF, p-value: < 2.2e-16
a. The F-statistic at the bottom of regression output tests \(H_0: \beta_1 = 0\) (no linear relationship). For simple linear regression (one predictor), this F-test gives the same p-value as the t-test on the slope. Verify: \(F = t^2 = (32.72)^2 \approx\) ?
b. In plain language, what is the F-test in regression asking? How does this relate to the F-test in ANOVA?
c. Both ANOVA and regression ask: “Is the variation explained by my model larger than the unexplained variation?” In your own words, explain how this is the same question in two different contexts.
Question 10: Synthesis and Critical Thinking
a. You fit a regression and get \(R^2 = 0.92\) and a significant slope (p < 0.001). A colleague says: “Great! Our model explains 92% of the variability, so it’s practically perfect.” What three important checks should you still perform before accepting this conclusion?
b. You fit a regression and find \(R^2 = 0.15\) (slope still significant). Is this model useless? Describe a scenario in biology or health science where a model with low \(R^2\) but a statistically significant slope would still be scientifically valuable.
c. A researcher runs an ANOVA comparing blood pressure across three diet groups (DASH, Mediterranean, Standard American Diet) and gets p = 0.003. She immediately concludes: “The DASH diet is best.” List three things she cannot conclude from ANOVA alone, and describe what additional analysis she would need.
💭 Question 11: Detective’s Reflection
Write 6–8 sentences reflecting on the week’s material:
- How does a scatterplot help us before we calculate a correlation or fit a regression? Why not skip straight to the numbers?
- Explain in your own words why correlation does not imply causation. Use an example from this assignment or from your own field of study.
- What is the connection between \(r^2\) (correlation squared) and \(R^2\) from regression? What does this connection tell us?
- Why is checking a residual plot an essential step — not optional — after fitting a regression?
- What is the “multiple comparisons problem” and why does ANOVA solve it? Why can’t researchers just use Bonferroni-corrected t-tests all the time?
🎉 Excellent work, Statistical Detective! From scatterplots to ANOVA, you’ve built a powerful toolkit for understanding relationships in biological data. Dr. Fischer now knows not just that her penguin species differ — but exactly how to quantify and test those differences with rigor.
Remember: A good statistician always checks their assumptions — before celebrating a significant p-value!
