HW6: Paired Tests, Independent Samples, and Power

Designing Studies That Change the World

Author

🥗 The Case

Welcome back, Statistical Detective!

Dr. Maya Patel, Director of Nutrition Research at the National Institutes of Health, has a challenge for you. “Every day, people make dietary choices based on what they think is healthy,” she explains. “But how do we know a diet actually works? How do we design studies that give us trustworthy answers — and make sure those studies are powerful enough to detect the effects we care about?”

She hands you a thick folder. “In 1997, a landmark study called the DASH trial changed national dietary guidelines for millions of Americans. Today, your job is to understand why that study was so successful — and to apply those same methods to new questions.”

Your mission: Use paired tests, independent samples comparisons, and power analysis to evaluate dietary interventions and help researchers design studies that are both statistically rigorous and clinically meaningful!

Question 1: Paired vs. Independent Samples

Before diving into calculations, it’s critical to identify the right type of test for the right design.

a. Identify the design

For each scenario below, determine whether a paired or independent samples design is appropriate. Explain your reasoning in 1–2 sentences.

Scenario 1: A nutritionist wants to compare the DASH diet to a Mediterranean diet. She randomly assigns 60 participants to either DASH (n = 30) or Mediterranean (n = 30), then measures blood pressure changes after 8 weeks.

Scenario 2: A physical therapist measures grip strength in the dominant and non-dominant hand of 25 patients recovering from stroke to assess symmetry of recovery.

Scenario 3: An environmental scientist measures dissolved oxygen levels in 15 streams before and after implementing riparian buffer zones to assess water quality improvement.

Scenario 4: A researcher compares cortisol levels between 40 nurses working day shifts and 40 different nurses working night shifts.

Question 2: Dietary Intervention — Paired t-Test

Dr. Patel’s team tests whether a plant-based diet reduces inflammation. Twelve participants have their C-reactive protein (CRP, an inflammation marker) measured at baseline and again after 6 weeks on the diet.

Data (CRP in mg/L):

Participant	Baseline	Week 6
1	3.8	2.9
2	4.5	3.2
3	3.2	2.8
4	5.1	3.9
5	4.2	3.5
6	3.9	3.1
7	4.7	3.6
8	3.5	2.7
9	4.8	3.8
10	4.1	3.3
11	3.7	2.9
12	4.6	3.4

a. Calculate the difference for each participant (Baseline − Week 6). Then calculate \(\bar{d}\) and \(s_d\).

b. State the null and alternative hypotheses for testing whether the diet reduces CRP levels.

c. Check the conditions for a paired t-test. Are they reasonably satisfied?

d. Calculate the test statistic and find the p-value (you may use software or tables). Use \(\alpha = 0.05\) for a one-sided test.

e. Write a conclusion addressing both statistical and practical significance. Note: CRP reductions of 0.5+ mg/L are considered clinically meaningful.

f. Calculate a 95% confidence interval for the mean reduction in CRP.

Question 3: Comparing Two Programs — Independent Samples

A researcher compares two weight-loss programs. Twenty participants are randomly assigned to Program A (behavioral modification) and a different twenty to Program B (meal replacement). Weight loss (kg) after 12 weeks is recorded.

Summary Statistics:

Program A: \(\bar{x}_1 = 5.8\) kg, \(s_1 = 2.3\) kg, \(n_1 = 20\)
Program B: \(\bar{x}_2 = 7.2\) kg, \(s_2 = 2.8\) kg, \(n_2 = 20\)

a. State the null and alternative hypotheses for testing whether the programs differ in effectiveness (two-sided test).

b. Calculate the standard error of the difference in means.

c. Calculate the test statistic. Use \(df = \min(n_1 - 1, n_2 - 1)\) for simplicity.

d. Find the p-value for this two-sided test.

e. At \(\alpha = 0.05\), what is your decision and conclusion?

f. Calculate and interpret a 95% confidence interval for the difference in mean weight loss between programs.

Question 4: Back to the DASH Study — Statistical vs. Practical Significance

Let’s look closely at what made the DASH study so impactful.

The original DASH study (Appel et al., 1997) enrolled 459 participants. One key comparison showed:

DASH diet: Mean systolic BP reduction = 5.5 mmHg
Control diet: Mean systolic BP reduction = 0.9 mmHg
Difference: 4.6 mmHg (p < 0.001)

Meanwhile, a separate large-scale study with 2,000 participants testing a mindfulness app found:

Mean BP reduction = 0.8 mmHg (p = 0.01)

a. Both results are statistically significant (p < 0.05). Explain what statistical significance means for each study — what is each p-value actually telling us?

b. Which result is more practically/clinically significant? Explain your reasoning. Note: BP reductions of 5+ mmHg are generally considered clinically meaningful.

c. The mindfulness app study had a much larger sample size. Explain the mechanism by which large samples can make very small effects statistically significant, even if they’re not clinically important.

d. If you were a physician advising a patient with elevated blood pressure, which intervention would you recommend and why?

Question 5: Power Analysis — Omega-3 Supplementation

Dr. Patel’s team wants to study whether omega-3 supplements reduce triglyceride levels. Based on previous research, the standard deviation of triglyceride changes is approximately \(\sigma = 25\) mg/dL.

Part A: Calculating Power

The team wants to detect a mean reduction of \(\Delta = 20\) mg/dL using a two-sided test at \(\alpha = 0.05\). They plan a paired design with each participant measured before and after 12 weeks of supplementation.

a. If the study uses \(n = 30\) participants, calculate the standard error of \(\bar{d}\).

b. For a two-sided test at \(\alpha = 0.05\), the rejection region boundaries are approximately \(\pm 1.96 \times SE\) from zero. Calculate these boundaries.

c. If the true mean reduction is 20 mg/dL, convert the lower boundary from (b) to a z-score under this alternative distribution.

\[z = \frac{\text{boundary} - 20}{SE}\]

d. The power is approximately \(P(Z < z)\) from part (c). Estimate the power. Is it adequate (≥ 80%)?

Part B: Finding the Right Sample Size

e. The team decides they want 85% power. Use the relationship:

\[(z_{1-\alpha/2} + z_{1-\beta}) \times SE = \Delta\]

where \(z_{1-\alpha/2} = 1.96\) and \(z_{1-\beta} = 1.04\) (for 85% power). Calculate the required standard error.

f. Using \(SE = \frac{\sigma}{\sqrt{n}}\) with \(\sigma = 25\), solve for the required sample size. Remember to round up!

g. In plain language, what does “85% power” mean for this study?

Question 6: Paired vs. Independent — Why Design Matters

Two research teams study the same question — does exercise reduce depression? — but use different designs.

Study A (Paired Design):

25 participants with mild depression
Depression inventory administered before and after an 8-week exercise program
Mean reduction: 8 points (0–50 scale), SD of differences: 6 points
\(t = \frac{8}{6/\sqrt{25}} = 6.67\), df = 24, p < 0.001

Study B (Independent Samples Design):

25 participants in an exercise group, 25 different participants on a waitlist (no treatment)
Exercise group mean: 28 points (SD = 9); Waitlist group mean: 35 points (SD = 10)
Mean difference: 7 points
\(t = 2.66\), df = 48, p = 0.011

a. Both studies found similar effect sizes (7–8 point reduction). Why did Study A produce a much smaller p-value despite having the same total number of participants?

b. What specific advantage does the paired design offer in this context?

c. Describe a research question about depression treatment where an independent samples design would be necessary rather than a paired design. Be specific.

d. If you were planning a follow-up study, which design would you choose and why? Consider both statistical power and practical constraints.

Question 7: Checking Conditions and Study Validity

Good statistics requires checking conditions before drawing conclusions. For each scenario, identify which condition(s) for inference might be violated, explain the potential consequence, and suggest a remedy.

a. A researcher measures anxiety scores in 10 patients before and after meditation training. The differences (after − before) are: −5, −3, −4, −6, −2, +15, −3, −4, −5, −3.

b. A study compares vitamin D levels between outdoor workers (recruited from a single construction site) and indoor workers (recruited from a single office building). The goal is to determine if occupation affects vitamin D.

c. A researcher compares morning vs. evening blood pressure in 8 participants by measuring each person once in the morning and once in the evening on the same day. She plans to use an independent samples t-test.

Question 8: Planning a Nutrition Study — Power and Tradeoffs

Dr. Patel wants to study whether increasing vegetable intake improves gut health, measured by a microbiome diversity score. Previous studies suggest:

Standard deviation: \(\sigma = 12\) points
Minimum important difference: \(\Delta = 8\) points
Budget allows for a maximum of 40 participants total

a. Calculate the sample size per group needed for 80% power using a two-sided test at \(\alpha = 0.05\):

\[n = \frac{\sigma^2(z_{1-\alpha/2} + z_{1-\beta})^2}{\Delta^2}\]

where \(z_{1-\alpha/2} = 1.96\) and \(z_{1-\beta} = 0.84\).

b. The required sample size exceeds the budget. Suggest three ways Dr. Patel could modify the study design to make it feasible. For each, describe the tradeoff involved.

Options to consider: accepting lower power; targeting a larger effect size; switching to a paired design; reducing variability through stricter inclusion criteria; seeking additional funding; changing the significance level.

c. Which modification would you recommend and why?

Question 9: The DASH Study’s Legacy

Read this excerpt:

“The DASH trial enrolled 459 adults across three diet groups. It detected a 5.5 mmHg reduction in systolic blood pressure for the DASH diet compared to a control diet. This finding, published in the New England Journal of Medicine in 1997, led to national dietary guidelines and has influenced cardiovascular disease prevention for over 25 years.”

a. Why was detecting a 5.5 mmHg reduction — rather than, say, a 1 mmHg reduction — so important for the study’s impact on public health policy?

b. The study compared three diet groups rather than just DASH vs. control. What statistical and scientific advantage does including multiple groups provide?

c. In 2–3 sentences, what does the DASH study’s lasting impact teach us about the relationship between rigorous statistical design and scientific influence?

Question 10: Synthesis and Critical Thinking

a. A colleague presents their results at a lab meeting and says: “I got p < 0.05, so my intervention definitely works!” What three important follow-up questions would you ask before accepting this conclusion?

b. You are planning a study and discover that, given your budget, you can only achieve 65% power. Should you proceed? Under what circumstances might this be acceptable — and when would it not be?

💭 Question 11: Detective’s Reflection

Reflect on what you’ve learned this week (6–8 sentences):

Why is it critical to match your statistical test to your study design (paired vs. independent)?
Explain the difference between statistical and practical significance using an example from this homework.
Why do researchers calculate power before collecting data rather than after?
What would happen if every researcher used 50% power for their studies? What would the published scientific literature look like?
How does the DASH study illustrate the connection between careful statistical planning and real-world impact?

🎉 Excellent work, Statistical Detective! The DASH study didn’t change the world by accident — it succeeded because researchers asked the right questions, chose the right design, ensured they had enough power to detect a meaningful effect, and interpreted their results with both statistical and clinical rigor.

Remember: Statistical significance tells us if an effect is real. Practical significance tells us if it matters. Great science requires both!