HW7: Correlation, Regression & ANOVA

Patterns, Predictions, and Penguin Science

🐧 The Case

Welcome back, Statistical Detective!

Dr. Elena Fischer, a wildlife ecologist at the Polar Research Institute, has spent five seasons at Palmer Station, Antarctica, collecting data on three penguin species: Adelie, Chinstrap, and Gentoo. She approaches you with a problem.

“I have measurements on hundreds of penguins — flippers, bills, body mass, all of it. But I need help making sense of the relationships. Can flipper length really predict how heavy a penguin is? Are the three species genuinely different? And how do I know if my regression model is any good?”

Your mission this week: use correlation, simple linear regression, and ANOVA to answer Dr. Fischer’s biological questions — and learn to distinguish what the data can and cannot tell us.


Question 1: Reading Scatterplots

Dr. Fischer shows you four scatterplots from her dataset. For each description below, identify the direction, form, and strength of the relationship, and match it to the most plausible correlation coefficient from the list: \(r \in \{-0.85, -0.22, +0.10, +0.87\}\).

Scatterplot A: As flipper length increases, body mass increases strongly. Points cluster closely around a straight line with slight scatter.

Scatterplot B: There is virtually no pattern. Points are scattered randomly across the plot.

Scatterplot C: Birds with deeper bills tend to have slightly shorter bills (bill depth vs. bill length). The downward trend is weak and there is substantial scatter.

Scatterplot D: Body mass and flipper length show a very strong positive trend, almost as strong as Scatterplot A but with slightly more scatter.

a. For each scatterplot, state the direction, form, and strength, and assign the most plausible \(r\) value. Explain your reasoning for each assignment in 1–2 sentences.

b. For Scatterplot A, a student says: “The strong positive correlation proves that longer flippers make penguins heavier.” What is wrong with this reasoning? Provide two alternative explanations for the association.

c. Scatterplot C has a negative correlation. Does a negative correlation indicate a weaker relationship than a positive one? Explain.


Question 2: Calculating and Interpreting Correlation

Dr. Fischer records flipper length (mm) and body mass (g) for a small sample of 8 Gentoo penguins:

Penguin Flipper Length (mm) Body Mass (g)
1 211 4500
2 214 4700
3 215 4425
4 217 4875
5 220 5200
6 221 5050
7 228 5550
8 232 5850

a. Calculate the mean and standard deviation for both flipper length (\(\bar{x} = 219.8\) mm, \(s_x = 7.1\) mm) and body mass (\(\bar{y} = 5019\) g, \(s_y = 477.9\) g). (These are provided — verify by checking one calculation.)

b. Using the formula below, calculate the correlation coefficient for penguins 1 and 8 only (the first two standardized products), then state whether you’d expect the overall \(r\) to be positive or negative and roughly how strong, based on the full table.

\[r = \frac{1}{n-1}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)\]

c. The actual correlation for all 8 penguins is \(r = 0.98\). Interpret this value in one sentence.

d. \(r = 0.98\) is extremely high. Name one reason we should be cautious about drawing strong conclusions from a sample of only 8 penguins, even with such a high correlation.


Question 3: Fitting and Interpreting a Regression Line

Using the full Palmer Penguins dataset (n = 344), the following regression output was obtained:

Call:
lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -5780.83     305.81  -18.90   <2e-16 ***
flipper_length_mm    49.69       1.52   32.72   <2e-16 ***

Residual standard error: 394.1 on 340 degrees of freedom
Multiple R-squared:  0.7592,   Adjusted R-squared:  0.7585 
F-statistic: 1071 on 1 and 340 DF,  p-value: < 2.2e-16

a. Write the regression equation in the form \(\hat{y} = b_0 + b_1 x\).

b. Interpret the slope in one complete sentence in biological context. Be precise: include the units and the direction.

c. The y-intercept is −5780.83 g. What would this mean literally (a penguin with flipper length = 0)? Why is it not meaningful to interpret this value in context?

d. Use the regression equation to predict the body mass of a penguin with a flipper length of 195 mm. Show your work.

e. Dr. Fischer wants to predict the body mass of a penguin with a flipper length of 320 mm (much longer than any observed penguin). What is the predicted value? Why should we be skeptical of this prediction?

f. Interpret \(R^2 = 0.759\) in context. What does the remaining 24.1% represent?


Question 4: Residuals and Model Checking

a. A Chinstrap penguin has a flipper length of 200 mm and an actual body mass of 3700 g. Using your regression equation from Question 3, calculate the residual for this penguin. Is this penguin heavier or lighter than predicted?

b. Explain in your own words what a residual plot (residuals vs. fitted values) is used for, and describe what you would hope to see if the linear model is appropriate.

c. For each residual plot description below, state whether the linear model appears appropriate and explain what the pattern (if any) suggests:

  • Plot I: Points scatter randomly around zero with no pattern.
  • Plot II: The residuals form a clear curve — negative for small fitted values, positive in the middle, then negative again for large fitted values.
  • Plot III: The spread of the residuals increases as the fitted values increase (fan shape opening to the right).

d. The conditions for regression inference are sometimes remembered as LINE. State what each letter stands for and briefly describe how you would check each condition for the penguin regression.


Question 5: Outliers and Influential Points

a. Explain the difference between an outlier (large residual) and an influential point (high leverage that changes the regression line). Can a point be one without the other? Give an example.

b. Suppose a data entry error introduced one penguin into the dataset with flipper length = 350 mm and body mass = 6500 g — values far outside the range of the rest of the data.

    1. Would this point have high leverage? Why?
    1. Would this point likely be influential? What might happen to the slope?
    1. What should the researcher do before re-running the analysis?

c. After discovering the error in (b), the researcher fits the regression without the erroneous point and gets slope = 49.7 g/mm (R² = 0.759). With the erroneous point included, the slope drops to 38.4 g/mm (R² = 0.68). What does this tell us about the influence of that single observation?


Question 6: Back to the DASH Study — Correlation and Regression

The DASH study (Appel et al., 1997) examined the relationship between sodium intake (mg/day) and systolic blood pressure (mmHg) across participants.

a. Suppose the correlation between sodium intake and systolic blood pressure is \(r = +0.43\). Interpret this correlation in context.

b. A simple linear regression gives: \(\widehat{\text{SBP}} = 98.2 + 0.009 \times \text{sodium}\) (in mg/day).

Interpret the slope. If a participant increases their daily sodium intake by 1000 mg/day, what is the predicted change in systolic blood pressure?

c. \(R^2 = 0.185\) for this model. Interpret this value. Does this mean sodium is not related to blood pressure? Explain.

d. Why would it be inappropriate to conclude from this regression that reducing sodium intake will cause a reduction in blood pressure, even if the slope is statistically significant? What kind of study would be needed to establish causation?


Question 7: When t-Tests Multiply — The Problem of Multiple Comparisons

A researcher wants to compare mean resting heart rate across five exercise groups: sedentary, light, moderate, vigorous, and elite athletes.

a. How many pairwise t-tests would be required to compare every pair of groups?

b. If each t-test uses \(\alpha = 0.05\), calculate the probability of making at least one Type I error across all tests. Use \(P(\text{at least one error}) = 1 - (1 - 0.05)^k\), where \(k\) is the number of tests.

c. What does your answer to (b) tell you about the reliability of this approach?

d. What test should the researcher use instead? State the null and alternative hypotheses for this test in the context of comparing heart rates across five groups.


Question 8: ANOVA — Penguin Bill Length

Dr. Fischer wants to know whether mean bill length (mm) differs across the three penguin species. Here is the ANOVA output:

            Df  Sum Sq  Mean Sq  F value   Pr(>F)    
species      2    7194     3597    397.3   <2e-16 ***
Residuals  339    3069        9                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And summary statistics by species:

Species n Mean Bill Length SD
Adelie 152 38.8 mm 2.7 mm
Chinstrap 68 48.8 mm 3.3 mm
Gentoo 124 47.5 mm 3.1 mm

a. State the null and alternative hypotheses for this ANOVA in symbols and words.

b. What are the degrees of freedom for “species” and for “residuals”? Show how each is calculated from the study design.

c. Verify that \(F = \text{Mean Sq}_{\text{species}} / \text{Mean Sq}_{\text{residuals}} = 397.3\). Show your calculation.

d. Interpret the F-statistic conceptually: what does a large F tell us about variation between vs. within species?

e. State your conclusion at \(\alpha = 0.001\). Write your conclusion in one sentence using biological language.

f. Check the equal variance condition. Is it reasonably satisfied for this dataset? Use the summary statistics to justify your answer.

g. A classmate concludes: “Since the ANOVA was significant, Adelie penguins have shorter bills than Gentoo penguins.” Is this a valid conclusion from the ANOVA? Explain why or why not.


Question 9: Connecting Regression and ANOVA

Both regression and ANOVA appear in the same R output. Look at the regression output from Question 3:

F-statistic: 1071 on 1 and 340 DF,  p-value: < 2.2e-16

a. The F-statistic at the bottom of regression output tests \(H_0: \beta_1 = 0\) (no linear relationship). For simple linear regression (one predictor), this F-test gives the same p-value as the t-test on the slope. Verify: \(F = t^2 = (32.72)^2 \approx\) ?

b. In plain language, what is the F-test in regression asking? How does this relate to the F-test in ANOVA?

c. Both ANOVA and regression ask: “Is the variation explained by my model larger than the unexplained variation?” In your own words, explain how this is the same question in two different contexts.


Question 10: Synthesis and Critical Thinking

a. You fit a regression and get \(R^2 = 0.92\) and a significant slope (p < 0.001). A colleague says: “Great! Our model explains 92% of the variability, so it’s practically perfect.” What three important checks should you still perform before accepting this conclusion?

b. You fit a regression and find \(R^2 = 0.15\) (slope still significant). Is this model useless? Describe a scenario in biology or health science where a model with low \(R^2\) but a statistically significant slope would still be scientifically valuable.

c. A researcher runs an ANOVA comparing blood pressure across three diet groups (DASH, Mediterranean, Standard American Diet) and gets p = 0.003. She immediately concludes: “The DASH diet is best.” List three things she cannot conclude from ANOVA alone, and describe what additional analysis she would need.


💭 Question 11: Detective’s Reflection

Write 6–8 sentences reflecting on the week’s material:

  • How does a scatterplot help us before we calculate a correlation or fit a regression? Why not skip straight to the numbers?
  • Explain in your own words why correlation does not imply causation. Use an example from this assignment or from your own field of study.
  • What is the connection between \(r^2\) (correlation squared) and \(R^2\) from regression? What does this connection tell us?
  • Why is checking a residual plot an essential step — not optional — after fitting a regression?
  • What is the “multiple comparisons problem” and why does ANOVA solve it? Why can’t researchers just use Bonferroni-corrected t-tests all the time?

🎉 Excellent work, Statistical Detective! From scatterplots to ANOVA, you’ve built a powerful toolkit for understanding relationships in biological data. Dr. Fischer now knows not just that her penguin species differ — but exactly how to quantify and test those differences with rigor.

Remember: A good statistician always checks their assumptions — before celebrating a significant p-value!


End of Assignment