HW2: Public Health Data Investigation
Understanding Study Design, Data Collection, and Descriptive Statistics
🏥 The Case
Welcome back, Statistical Detective!
Your work has caught the attention of the County Public Health Department. Dr. Sarah Martinez, the Director of Community Health, needs your help analyzing health data and evaluating research studies.
“We’re facing multiple public health challenges,” Dr. Martinez explains. “Childhood obesity rates are climbing, vaccination coverage varies across neighborhoods, and we’re seeing concerning patterns in chronic disease prevalence. We have data from various sources, but we need someone who understands both study design and data analysis to help us make evidence-based decisions. We also need to evaluate published health studies to inform our policies.”
Your mission: Use your knowledge of study design, data collection methods, and descriptive statistics to help the Public Health Department make informed decisions!
Question 1: Recognizing Well-Designed Studies
Dr. Martinez shares three different approaches the department is considering for studying the effectiveness of a new diabetes prevention program.
Study A: The department plans to recruit 200 volunteers who are interested in the program. Half will participate in the 12-week program, and half will be asked to wait. Participants will know which group they’re in, and researchers will measure blood glucose levels before and after.
Study B: The department will randomly assign 300 eligible participants to either the new prevention program or the standard health education program currently offered. Neither participants nor the health educators measuring outcomes will know which program is the “new” one. The study will be conducted at three different community health centers.
Study C: The department will offer the new program at one community center and compare outcomes to another community center that continues with standard care. The same nurse will measure blood glucose for all participants.
a. Key features of study design
For each study (A, B, and C), identify whether the following features are present. Fill in the table:
| Feature | Study A | Study B | Study C |
|---|---|---|---|
| Random assignment | Yes / No | Yes / No | Yes / No |
| Control group | Yes / No | Yes / No | Yes / No |
| Blinding (participants) | Yes / No | Yes / No | Yes / No |
| Blinding (outcome assessors) | Yes / No | Yes / No | Yes / No |
| Replication (multiple sites/groups) | Yes / No | Yes / No | Yes / No |
b. Best study design
Which study (A, B, or C) has the strongest design? Explain why, referencing at least THREE specific design features that make it superior.
c. Potential problems
For the study you didn’t choose as strongest: - Identify at least TWO design weaknesses - Explain how each weakness could affect the study’s conclusions
Question 2: Identifying Biases in Data Collection
The Public Health Department wants to understand vaccination rates in the community. They’re considering different data collection approaches.
Scenario 1: Post a survey on the department’s Facebook page asking parents about their children’s vaccination status.
Scenario 2: Send a survey to a random sample of households selected from the county’s registered voter list.
Scenario 3: Call households randomly selected from county birth records of children born 2-5 years ago.
Scenario 4: Ask pediatricians to have parents in their waiting rooms fill out surveys.
a. Define bias types
In your own words, define:
- Convenience sampling:
- Voluntary response bias:
- Confounding:
b. Identify biases
For each scenario (1-4), identify what type(s) of bias might be present and explain how this bias could affect the results:
| Scenario | Type(s) of Bias | Impact on Conclusions |
|---|---|---|
| 1 | ||
| 2 | ||
| 3 | ||
| 4 |
c. Best approach
Which scenario would give the most reliable estimate of vaccination rates in the community? Justify your answer.
d. Real-world confounding
Suppose the department finds that neighborhoods with lower vaccination rates also have higher rates of childhood illness. A researcher concludes: “Low vaccination rates cause higher illness rates.”
- Identify at least TWO potential confounding variables that might explain this relationship
- Explain how each confounding variable could affect the conclusion
Question 3: Google Sheets - Data Entry and Organization
Dr. Martinez provides you with raw data on childhood BMI (Body Mass Index) measurements from school health screenings. You need to organize this data in Google Sheets.
The Data:
Child 1: Age 8, Height 127cm, Weight 28kg, BMI 17.4
Child 2: Age 9, Height 134cm, Weight 35kg, BMI 19.5
Child 3: Age 7, Height 119cm, Weight 24kg, BMI 17.0
Child 4: Age 8, Height 130cm, Weight 42kg, BMI 24.9
Child 5: Age 9, Height 138cm, Weight 38kg, BMI 19.9
Child 6: Age 7, Height 122cm, Weight 26kg, BMI 17.5
Child 7: Age 8, Height 125cm, Weight 31kg, BMI 19.8
Child 8: Age 9, Height 141cm, Weight 45kg, BMI 22.6
a. Create a spreadsheet
Create a Google Sheet with this data. Your spreadsheet should have:
- Appropriate column headers
- One row per child
- Proper data organization (one variable per column)
Take a screenshot of your organized spreadsheet and paste it here.
b. Data entry best practices
List THREE best practices for entering data in a spreadsheet that you followed:
Question 4: Calculating Summary Statistics
Using the BMI data from Question 3, calculate the following using Google Sheets functions. Show both the formula you used and the result.
a. Measures of center
| Statistic | Formula in Google Sheets | Result |
|---|---|---|
| Mean BMI | ||
| Median BMI | ||
| Mode (if any) |
b. Measures of spread
| Statistic | Formula in Google Sheets | Result |
|---|---|---|
| Range | ||
| Standard Deviation | ||
| Q1 (First Quartile) | ||
| Q3 (Third Quartile) | ||
| IQR |
c. Interpretation
Write 2-3 sentences interpreting what these statistics tell Dr. Martinez about BMI in this group of children. Include comments about both the typical BMI and the variation.
Question 5: Describing Distributions
a. Blood pressure data
Dr. Martinez shares data on systolic blood pressure (mmHg) from 50 adult patients:
118, 122, 125, 128, 130, 132, 134, 136, 138, 140,
120, 124, 126, 129, 131, 133, 135, 137, 139, 142,
119, 123, 127, 128, 130, 132, 134, 136, 138, 141,
121, 125, 126, 129, 131, 133, 135, 137, 139, 143,
122, 124, 127, 130, 132, 134, 136, 138, 140, 165
Create a histogram of this data in Google Sheets (you can do this manually or use the data).
Based on your histogram and the data, describe the distribution using: - Shape: (symmetric, right-skewed, left-skewed, bimodal, etc.) - Center: (estimate using mean or median, whichever is more appropriate) - Spread: (use range and/or standard deviation) - Unusual features: (gaps, outliers, multiple peaks)
b. Identifying outliers
Using the IQR method: 1. Calculate Q1, Q3, and IQR for the blood pressure data 2. Calculate the lower fence: Q1 - 1.5 × IQR 3. Calculate the upper fence: Q3 + 1.5 × IQR 4. Identify any outliers 5. Discuss: What might this outlier represent in a public health context? Should it be removed from analysis? Why or why not?
Question 6: Creating Appropriate Visualizations
Dr. Martinez has data from a community health survey. You need to create appropriate visualizations.
Dataset: Use Google Sheets to enter and visualize this data about 30 survey respondents:
- Age group: 18-30 (n=8), 31-50 (n=12), 51+ (n=10)
- Exercise frequency: Daily (n=7), Weekly (n=15), Rarely (n=8)
- Hours of sleep: 5, 6, 6.5, 7, 7, 7, 7.5, 7.5, 8, 8, 8, 8, 8.5, 8.5, 9, 9, 5.5, 6, 6.5, 7, 7.5, 7.5, 8, 8, 8.5, 9, 9.5, 6, 7, 9
a. Choose appropriate visualizations
For each variable, state what type of chart is most appropriate and why:
| Variable | Chart Type | Reasoning |
|---|---|---|
| Age group (categorical) | ||
| Exercise frequency (categorical) | ||
| Hours of sleep (numerical) |
b. Create visualizations
Create the following in Google Sheets: 1. A bar chart showing the distribution of exercise frequency 2. A histogram showing the distribution of hours of sleep 3. A box plot of hours of sleep
Paste screenshots of your three visualizations here.
c. Interpretation
For each visualization, write one sentence describing what it reveals about the data.
Question 7: Exploring Relationships Between Variables
Dr. Martinez wants to understand if there’s a relationship between hours of physical activity per week and resting heart rate.
Data: 15 community members
| Person | Hours of Activity/Week | Resting Heart Rate (bpm) |
|---|---|---|
| 1 | 0 | 78 |
| 2 | 1 | 76 |
| 3 | 2 | 74 |
| 4 | 2.5 | 72 |
| 5 | 3 | 70 |
| 6 | 3.5 | 69 |
| 7 | 4 | 68 |
| 8 | 4.5 | 66 |
| 9 | 5 | 65 |
| 10 | 5.5 | 64 |
| 11 | 6 | 63 |
| 12 | 7 | 61 |
| 13 | 8 | 60 |
| 14 | 9 | 58 |
| 15 | 10 | 56 |
a. Scatterplot (numerical-numerical relationship)
Create a scatterplot in Google Sheets with: - Hours of activity on the x-axis - Resting heart rate on the y-axis
Paste a screenshot here.
b. Describe the relationship
Looking at your scatterplot, describe: - Direction: Positive or negative association? - Strength: Strong, moderate, or weak? - Form: Linear or non-linear? - Outliers: Any unusual points?
c. Public health interpretation
What does this relationship suggest about the connection between physical activity and heart health? Write 2-3 sentences.
Question 8: Categorical-Numerical Relationships
Dr. Martinez has data on average daily sugar intake (grams) for people in three different diet categories:
Standard Diet: 85, 92, 88, 95, 90, 87, 93, 89, 91, 94 Reduced Sugar Diet: 45, 52, 48, 55, 50, 47, 53, 49, 51, 54 Low Sugar Diet: 25, 32, 28, 35, 30, 27, 33, 29, 31, 34
a. Side-by-side box plots
Create side-by-side box plots in Google Sheets comparing sugar intake across the three diet groups.
Paste a screenshot here.
b. Compare distributions
Compare the three groups using: - Centers: Which group has the highest/lowest median sugar intake? - Spreads: Which group has the most/least variability? - Overall pattern: What do these box plots tell you about the ammount of sugar of different diet interventions?
c. Recommendation
Based on this data, what would you recommend to someone trying to reduce their sugar intake to below 40 grams per day? Use specific evidence from the data.
Question 9: Reading and Interpreting Published Health Studies
Dr. Martinez shows you a graph from a published study on the effectiveness of a smoking cessation program:
Study Details: - 200 participants enrolled in a 12-week smoking cessation program - Participants tracked as: “Quit successfully” (45%), “Reduced smoking” (30%), “No change” (25%) - A bar chart shows these three outcomes
![Hypothetical bar chart showing: Quit successfully 45%, Reduced smoking 30%, No change 25%]
a. Interpret the visualization
- What type of variable is “outcome” (quit successfully, reduced, no change)?
- Is a bar chart appropriate for this data? Why or why not?
- What percentage of participants showed improvement (quit or reduced)?
b. Critical evaluation
The study concluded: “Our program is highly effective, with 75% of participants showing improvement.”
List THREE questions you should ask about the study design before accepting this conclusion:
c. Study table interpretation
The published paper includes this summary table:
| Outcome | Men (n=100) | Women (n=100) |
|---|---|---|
| Quit | 40 (40%) | 50 (50%) |
| Reduced | 35 (35%) | 25 (25%) |
| No Change | 25 (25%) | 25 (25%) |
Based on this table: - Calculate the success rate (quit + reduced) for men and women separately - Do men and women appear to respond differently to the program? - What additional analysis might you want to see?
Question 10: Putting It All Together - Mini Research Proposal
Dr. Martinez asks you to design a small study to investigate whether a new school-based nutrition education program improves children’s fruit and vegetable consumption.
Write a brief research proposal (300-400 words total) that includes:
a. Research question
State a clear, specific research question.
b. Study design
Describe your study design, making sure to include: - How you will use random assignment - What your control group will be - Any blinding procedures - How you will ensure replication
c. Data collection
- What data will you collect?
- How will you avoid bias in data collection?
- What potential confounding variables should you consider?
d. Data analysis plan
- What summary statistics will you calculate?
- What visualizations will you create?
- How will you compare the treatment and control groups?
💭 Question 11: Detective’s Reflection
Reflect on your public health investigation (5-7 sentences):
- Why is random assignment so important in health studies?
- How can bias in data collection lead to incorrect public health conclusions?
- What’s the difference between describing a single variable (like blood pressure) versus exploring relationships between two variables (like exercise and heart rate)?
- Which summary statistic (mean, median, mode, standard deviation, IQR) do you think is most useful for public health decisions? Why?
- Name one specific way that understanding data visualization could help improve community health programs.
- What surprised you most about evaluating study designs?
📊 Submission Instructions
Submit your assignment as: 1. A PDF or Google Doc containing all written (by hand) answers, no need to include the screenshots, if needed, you can sketch the image you obtained in Google Sheets.
🎉 Excellent work, Statistical Detective! The Public Health Department thanks you for your thorough analysis!
Remember: Good public health decisions require good data, collected carefully, and analyzed thoughtfully. Your work matters!
