HW2: Public Health Data Investigation

Understanding Study Design, Data Collection, and Descriptive Statistics

Author

🏥 The Case

Welcome back, Statistical Detective!

Your work has caught the attention of the County Public Health Department. Dr. Sarah Martinez, the Director of Community Health, needs your help analyzing health data and evaluating research studies.

“We’re facing multiple public health challenges,” Dr. Martinez explains. “Childhood obesity rates are climbing, vaccination coverage varies across neighborhoods, and we’re seeing concerning patterns in chronic disease prevalence. We have data from various sources, but we need someone who understands both study design and data analysis to help us make evidence-based decisions. We also need to evaluate published health studies to inform our policies.”

Your mission: Use your knowledge of study design, data collection methods, and descriptive statistics to help the Public Health Department make informed decisions!

Question 1: Recognizing Well-Designed Studies

Dr. Martinez shares three different approaches the department is considering for studying the effectiveness of a new diabetes prevention program.

Study A: The department plans to recruit 200 volunteers who are interested in the program. Half will participate in the 12-week program, and half will be asked to wait. Participants will know which group they’re in, and researchers will measure blood glucose levels before and after.

Study B: The department will randomly assign 300 eligible participants to either the new prevention program or the standard health education program currently offered. Neither participants nor the health educators measuring outcomes will know which program is the “new” one. The study will be conducted at three different community health centers.

Study C: The department will offer the new program at one community center and compare outcomes to another community center that continues with standard care. The same nurse will measure blood glucose for all participants.

a. Key features of study design

For each study (A, B, and C), identify whether the following features are present. Fill in the table:

Feature	Study A	Study B	Study C
Random assignment	Yes / No	Yes / No	Yes / No
Control group	Yes / No	Yes / No	Yes / No
Blinding (participants)	Yes / No	Yes / No	Yes / No
Blinding (outcome assessors)	Yes / No	Yes / No	Yes / No
Replication (multiple sites/groups)	Yes / No	Yes / No	Yes / No

b. Best study design

Which study (A, B, or C) has the strongest design? Explain why, referencing at least THREE specific design features that make it superior.

c. Potential problems

For the study you didn’t choose as strongest: - Identify at least TWO design weaknesses - Explain how each weakness could affect the study’s conclusions

Question 2: Identifying Biases in Data Collection

The Public Health Department wants to understand vaccination rates in the community. They’re considering different data collection approaches.

Scenario 1: Post a survey on the department’s Facebook page asking parents about their children’s vaccination status.

Scenario 2: Send a survey to a random sample of households selected from the county’s registered voter list.

Scenario 3: Call households randomly selected from county birth records of children born 2-5 years ago.

Scenario 4: Ask pediatricians to have parents in their waiting rooms fill out surveys.

a. Define bias types

In your own words, define:

Convenience sampling:
Voluntary response bias:
Confounding:

b. Identify biases

For each scenario (1-4), identify what type(s) of bias might be present and explain how this bias could affect the results:

Scenario	Type(s) of Bias	Impact on Conclusions
1
2
3
4

c. Best approach

Which scenario would give the most reliable estimate of vaccination rates in the community? Justify your answer.

d. Real-world confounding

Suppose the department finds that neighborhoods with lower vaccination rates also have higher rates of childhood illness. A researcher concludes: “Low vaccination rates cause higher illness rates.”

Identify at least TWO potential confounding variables that might explain this relationship
Explain how each confounding variable could affect the conclusion

Question 3: Google Sheets - Data Entry and Organization

Dr. Martinez provides you with raw data on childhood BMI (Body Mass Index) measurements from school health screenings. You need to organize this data in Google Sheets.

The Data:

Child 1: Age 8, Height 127cm, Weight 28kg, BMI 17.4
Child 2: Age 9, Height 134cm, Weight 35kg, BMI 19.5
Child 3: Age 7, Height 119cm, Weight 24kg, BMI 17.0
Child 4: Age 8, Height 130cm, Weight 42kg, BMI 24.9
Child 5: Age 9, Height 138cm, Weight 38kg, BMI 19.9
Child 6: Age 7, Height 122cm, Weight 26kg, BMI 17.5
Child 7: Age 8, Height 125cm, Weight 31kg, BMI 19.8
Child 8: Age 9, Height 141cm, Weight 45kg, BMI 22.6

a. Create a spreadsheet

Create a Google Sheet with this data. Your spreadsheet should have:

Appropriate column headers
One row per child
Proper data organization (one variable per column)

Take a screenshot of your organized spreadsheet and paste it here.

b. Data entry best practices

List THREE best practices for entering data in a spreadsheet that you followed:

Question 4: Calculating Summary Statistics

Using the BMI data from Question 3, calculate the following using Google Sheets functions. Show both the formula you used and the result.

a. Measures of center

Statistic	Formula in Google Sheets	Result
Mean BMI
Median BMI
Mode (if any)

b. Measures of spread

Statistic	Formula in Google Sheets	Result
Range
Standard Deviation
Q1 (First Quartile)
Q3 (Third Quartile)
IQR

c. Interpretation

Write 2-3 sentences interpreting what these statistics tell Dr. Martinez about BMI in this group of children. Include comments about both the typical BMI and the variation.

Question 5: Describing Distributions

a. Blood pressure data

Dr. Martinez shares data on systolic blood pressure (mmHg) from 50 adult patients:

118, 122, 125, 128, 130, 132, 134, 136, 138, 140,
120, 124, 126, 129, 131, 133, 135, 137, 139, 142,
119, 123, 127, 128, 130, 132, 134, 136, 138, 141,
121, 125, 126, 129, 131, 133, 135, 137, 139, 143,
122, 124, 127, 130, 132, 134, 136, 138, 140, 165

Create a histogram of this data in Google Sheets (you can do this manually or use the data).

Based on your histogram and the data, describe the distribution using: - Shape: (symmetric, right-skewed, left-skewed, bimodal, etc.) - Center: (estimate using mean or median, whichever is more appropriate) - Spread: (use range and/or standard deviation) - Unusual features: (gaps, outliers, multiple peaks)

b. Identifying outliers

Using the IQR method: 1. Calculate Q1, Q3, and IQR for the blood pressure data 2. Calculate the lower fence: Q1 - 1.5 × IQR 3. Calculate the upper fence: Q3 + 1.5 × IQR 4. Identify any outliers 5. Discuss: What might this outlier represent in a public health context? Should it be removed from analysis? Why or why not?

Question 6: Creating Appropriate Visualizations

Dr. Martinez has data from a community health survey. You need to create appropriate visualizations.

Dataset: Use Google Sheets to enter and visualize this data about 30 survey respondents:

Age group: 18-30 (n=8), 31-50 (n=12), 51+ (n=10)
Exercise frequency: Daily (n=7), Weekly (n=15), Rarely (n=8)
Hours of sleep: 5, 6, 6.5, 7, 7, 7, 7.5, 7.5, 8, 8, 8, 8, 8.5, 8.5, 9, 9, 5.5, 6, 6.5, 7, 7.5, 7.5, 8, 8, 8.5, 9, 9.5, 6, 7, 9

a. Choose appropriate visualizations

For each variable, state what type of chart is most appropriate and why:

Variable	Chart Type	Reasoning
Age group (categorical)
Exercise frequency (categorical)
Hours of sleep (numerical)

b. Create visualizations

Create the following in Google Sheets: 1. A bar chart showing the distribution of exercise frequency 2. A histogram showing the distribution of hours of sleep 3. A box plot of hours of sleep

Paste screenshots of your three visualizations here.

c. Interpretation

For each visualization, write one sentence describing what it reveals about the data.

Question 7: Exploring Relationships Between Variables

Dr. Martinez wants to understand if there’s a relationship between hours of physical activity per week and resting heart rate.

Data: 15 community members

Person	Hours of Activity/Week	Resting Heart Rate (bpm)
1	0	78
2	1	76
3	2	74
4	2.5	72
5	3	70
6	3.5	69
7	4	68
8	4.5	66
9	5	65
10	5.5	64
11	6	63
12	7	61
13	8	60
14	9	58
15	10	56

a. Scatterplot (numerical-numerical relationship)

Create a scatterplot in Google Sheets with: - Hours of activity on the x-axis - Resting heart rate on the y-axis

Paste a screenshot here.

b. Describe the relationship

Looking at your scatterplot, describe: - Direction: Positive or negative association? - Strength: Strong, moderate, or weak? - Form: Linear or non-linear? - Outliers: Any unusual points?

c. Public health interpretation

What does this relationship suggest about the connection between physical activity and heart health? Write 2-3 sentences.

Question 8: Categorical-Numerical Relationships

Dr. Martinez has data on average daily sugar intake (grams) for people in three different diet categories:

Standard Diet: 85, 92, 88, 95, 90, 87, 93, 89, 91, 94 Reduced Sugar Diet: 45, 52, 48, 55, 50, 47, 53, 49, 51, 54 Low Sugar Diet: 25, 32, 28, 35, 30, 27, 33, 29, 31, 34

a. Side-by-side box plots

Create side-by-side box plots in Google Sheets comparing sugar intake across the three diet groups.

Paste a screenshot here.

b. Compare distributions

Compare the three groups using: - Centers: Which group has the highest/lowest median sugar intake? - Spreads: Which group has the most/least variability? - Overall pattern: What do these box plots tell you about the ammount of sugar of different diet interventions?

c. Recommendation

Based on this data, what would you recommend to someone trying to reduce their sugar intake to below 40 grams per day? Use specific evidence from the data.

Question 9: Reading and Interpreting Published Health Studies

Dr. Martinez shows you a graph from a published study on the effectiveness of a smoking cessation program:

Study Details: - 200 participants enrolled in a 12-week smoking cessation program - Participants tracked as: “Quit successfully” (45%), “Reduced smoking” (30%), “No change” (25%) - A bar chart shows these three outcomes

![Hypothetical bar chart showing: Quit successfully 45%, Reduced smoking 30%, No change 25%]

a. Interpret the visualization

What type of variable is “outcome” (quit successfully, reduced, no change)?
Is a bar chart appropriate for this data? Why or why not?
What percentage of participants showed improvement (quit or reduced)?

b. Critical evaluation

The study concluded: “Our program is highly effective, with 75% of participants showing improvement.”

List THREE questions you should ask about the study design before accepting this conclusion:

c. Study table interpretation

The published paper includes this summary table:

Outcome	Men (n=100)	Women (n=100)
Quit	40 (40%)	50 (50%)
Reduced	35 (35%)	25 (25%)
No Change	25 (25%)	25 (25%)

Based on this table: - Calculate the success rate (quit + reduced) for men and women separately - Do men and women appear to respond differently to the program? - What additional analysis might you want to see?

Question 10: Putting It All Together - Mini Research Proposal

Dr. Martinez asks you to design a small study to investigate whether a new school-based nutrition education program improves children’s fruit and vegetable consumption.

Write a brief research proposal (300-400 words total) that includes:

a. Research question

State a clear, specific research question.

b. Study design

Describe your study design, making sure to include: - How you will use random assignment - What your control group will be - Any blinding procedures - How you will ensure replication

c. Data collection

What data will you collect?
How will you avoid bias in data collection?
What potential confounding variables should you consider?

d. Data analysis plan

What summary statistics will you calculate?
What visualizations will you create?
How will you compare the treatment and control groups?

💭 Question 11: Detective’s Reflection

Reflect on your public health investigation (5-7 sentences):

Why is random assignment so important in health studies?
How can bias in data collection lead to incorrect public health conclusions?
What’s the difference between describing a single variable (like blood pressure) versus exploring relationships between two variables (like exercise and heart rate)?
Which summary statistic (mean, median, mode, standard deviation, IQR) do you think is most useful for public health decisions? Why?
Name one specific way that understanding data visualization could help improve community health programs.
What surprised you most about evaluating study designs?

📊 Submission Instructions

Submit your assignment as: 1. A PDF or Google Doc containing all written (by hand) answers, no need to include the screenshots, if needed, you can sketch the image you obtained in Google Sheets.

🎉 Excellent work, Statistical Detective! The Public Health Department thanks you for your thorough analysis!

Remember: Good public health decisions require good data, collected carefully, and analyzed thoughtfully. Your work matters!