Lecture 2: Data Collection and Statistical Thinking

STAT 7 - Statistical Methods for the Biological, Environmental & Health Sciences

09 Jan 2026

Before We Start

📋 Quick Reminders

Syllabus Acknowledgement and Survey 1 will be closed tomorrow, please complete them before then.
HW 1 will be posted tomorrow (Friday)
Discussion sections start next week
Make sure you can access Poll Everywhere and Ed Discussion! talk to me after class if you don’t.

Today’s Plan:

We’ll explore how to think statistically and learn about data collection methods that impact everything we do in this course.

Learning Objectives 🎯

What You’ll Learn Today

By the end of today’s class, you will be able to:

Explain what statistical thinking is and why it matters for scientific research
Identify the key components of a statistical study (population, sample, variables, parameters, statistics)
Distinguish between different types of variables (categorical vs. numerical; nominal, ordinal, binary, discrete, continuous)
Recognize common sources of bias in data collection
Evaluate the quality and reliability of data based on collection methods
Apply these concepts to real-world case studies in biological and health sciences

Motivating Case Study 🥜

Case Study: Preventing Peanut Allergies

The Problem:

Peanut allergies in children were increasing dramatically
Traditional advice: avoid peanuts in early childhood. But was this advice actually helping?

The Study: LEAP (06-09)

640 UK infants with eczema or egg allergy
Randomly assigned to two groups:
- Peanut consumption group
- Peanut avoidance group
Tested at age 5 with oral food challenge

Key Question: Does early peanut consumption prevent allergies?

The Data: Raw Form

participant.ID	treatment.group	overall.V60.outcome
LEAP_100522	Peanut Consumption	PASS OFC
LEAP_103358	Peanut Consumption	PASS OFC
LEAP_105069	Peanut Avoidance	PASS OFC
LEAP_994047	Peanut Avoidance	PASS OFC
LEAP_997608	Peanut Consumption	PASS OFC

Individual-level data for five children shows their treatment group and outcome.

But 640 children is too much to look at individually…

The Data: Summarized

Summary Table

	FAIL OFC	PASS OFC	Total
Peanut Avoidance	36	227	263
Peanut Consumption	5	262	267
Total	41	489	530

Avoidance: 36/263 = 13.7% failed
Consumption: 5/267 = 1.9% failed
Difference: 11.8%

Visual Summary

The Statistical Question

Is the 11.8% difference real or just chance variation?

This is where statistical thinking comes in!

We need tools to: - Understand if this difference is meaningful - Account for uncertainty - Make reliable conclusions

Later in the course, you’ll learn how to answer this question using hypothesis testing

What is Statistical Thinking? 📊

Why Do We Need Statistics?

The Belief Bias Effect

When asked if an argument is logically valid, people tend to be influenced by whether the conclusion seems believable, even when they shouldn’t be.

We naturally let our biases affect our reasoning!

Statistics helps us:

Overcome cognitive biases
Make objective decisions based on evidence
Quantify uncertainty
Distinguish signal from noise

Recommended reading: Chapter 1 from Learning Statistics with R

Activity 1: Belief Bias 💭

What is Statistics?

American Statistical Association:

“Statistics is the science of learning from data, and of measuring, controlling, and communicating uncertainty.”

Data Science:

An interdisciplinary field that uses various methods and processes to extract insight from data and apply actionable insight across application domains.

Key Ideas:

Statistics is about learning from evidence
Every conclusion comes with uncertainty
We need systematic methods to draw reliable conclusions

The Statistical Investigation Cycle

PPDAC: Problem → Plan → Data → Analysis → Conclusion

Every statistical study follows this cycle!

From Chris Wild: What is Statistics?

Two Types of Statistics

📊 Descriptive Statistics

Use numerical and graphical methods to:

Explore data
Look for patterns
Summarize information
Present in convenient form

Example: “In our sample, 13.7% of children in the avoidance group developed allergies”

🔮 Inferential Statistics

Use sample data to:

Make estimates
Make decisions
Make predictions
Generalize to larger populations

Example: “We conclude that peanut consumption reduces allergy risk in the broader population of at-risk children”

This course covers both! We start with description, then move to inference.

Key Statistical Concepts 🔑

The Building Blocks

Unit (Subject): An object we collect data about

Examples: People, animals, batteries, transactions

Population: The full set of units we’re interested in

Examples: All CA residents, entire whale species, all Amazon transactions

Sample: A subset of the population we actually observe

Examples: 1,000 CA residents, 25 blue whales, one hour of transactions

Variable: A characteristic we measure for each unit

Examples: Education level, whale mass, transaction size

Note: A census collects data from every member of a population (rare and expensive!)

Visualizing Population vs. Sample

Population

All units of interest (often unobserved)

Sample

Observed units (subset we measure)

Key insight: We use the sample (what we can observe) to learn about the population (what we want to know about)

Parameters vs. Statistics

📏 Parameter

A numerical measurement describing a population

True but usually unknown
Theoretical value
What we want to know

Examples:

True proportion of CA residents with college degrees
Average mass of all blue whales
Population mean battery lifespan

📊 Statistic

A numerical measurement describing a sample

Observed and known
Empirical value
What we actually calculate

Examples:

Proportion of college degrees in our 1,000-person sample
Average mass of our 25 sampled whales
Mean lifespan of 100 tested batteries

Statistics estimate parameters! We calculate statistics from our sample to guess the parameter values for the population.

Activity 2: LEAP Study Components 🔍

Poll Everywhere: Identify the Components

For the LEAP peanut allergy study, identify each component:

Question 1: What is the unit (subject)?

A. Peanuts
B. Children
C. Allergy tests
D. Treatment groups

Question 2: What is the population?

A. All peanuts produced in the UK
B. UK children with eczema/egg allergy in 2006
C. All allergy tests in 2009
D. The 640 children enrolled

LEAP Study: Answers

Unit: Children (individuals with eczema/egg allergy)

Population: UK children aged 4-11 months with eczema/egg allergy in 2006

Sample: The 530 children with negative skin test who completed the study

Variables: - Treatment group (consumption/avoidance) - OFC result (pass/fail) - Age, severity of eczema, etc.

Parameter: - True proportion of population who would develop allergies under avoidance: p₁ - True proportion under consumption: p₂ - We want to know: p₁ - p₂

Statistic: - Sample proportion with avoidance: 36/263 = 0.137 - Sample proportion with consumption: 5/267 = 0.019 - Observed difference: 0.137 - 0.019 = 0.118 (11.8%)

What About Infinite Populations?

Sometimes populations are conceptually infinite:

All possible coin flips
All possible times in a day to measure something
All future patients who might receive a treatment

In these cases, the parameter represents the theoretical property of the process generating the data.

Break Time! ☕ 5-minute break

Stretch, grab water, chat with neighbors!

We’ll resume with types of variables and data collection.

Types of Variables 📋

Two Main Categories

Categorical Variables

Variables that place individuals into groups or categories

Subtypes:

Nominal: Unordered categories - Blood type (A, B, AB, O) - Species of animal - Brand of detector

Ordinal: Ordered categories - Education level (HS, BS, MS, PhD) - Disease severity (mild, moderate, severe) - Course grade (A, B, C, D, F)

Binary: Only two outcomes - Yes/No, Pass/Fail, Alive/Dead - Gender (in some contexts)

Numerical Variables

Variables that take on numerical values where arithmetic makes sense

Subtypes:

Discrete: Countable values, often integers - Number of children - Number of mutations - Number of emergency room visits

Continuous: Any value in a range - Height, weight, temperature - Blood pressure - Reaction time

Visual Guide to Categorical Variable Types

Art by Allison Horst

Visual Guide to Numerical Variable Types

Art by Allison Horst

Why Do Variable Types Matter?

Different variable types require different methods!

Categorical variables:

Summarize with counts and proportions
Visualize with bar charts, pie charts
Analyze with chi-square tests, logistic regression

Numerical variables:

Summarize with means, medians, standard deviations
Visualize with histograms, boxplots, scatterplots
Analyze with t-tests, ANOVA, linear regression

Bottom line: Correctly identifying variable types is the first step in any data analysis!

Activity 3: Variable Type Practice 🏋️

Classify These Variables

For each variable, identify the type (nominal, ordinal, binary, discrete, or continuous):

Number of doctor visits in the past year
- Discrete (countable)
Patient satisfaction rating (Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied)
- Ordinal (ordered categories)
Blood glucose level (mg/dL)
- Continuous (measurable)
Type of cancer (breast, lung, colon, etc.)
- Nominal (unordered categories)
Did the patient survive? (Yes/No)
- Binary (two outcomes)
Body temperature (°F)
- Continuous (measurable)

CO Detector Study Variables

Recall the carbon monoxide detector study. Classify these variables:

Does the alarm go off under hazardous conditions? (Yes/No)
- Binary categorical
Functionality timing (too late, just in time, too early)
- Ordinal categorical
Distance from CO source (in meters)
- Continuous numerical
Brand (Kidde, First Alert, etc.)
- Nominal categorical
Location in building (basement, first floor, second floor)
- Nominal categorical (or ordinal if you consider height)
Number of CO detectors in building
- Discrete numerical
Age of detector (in months)
- Discrete numerical (or continuous, depending on precision)

Data Collection & Bias ⚠️

Why Data Quality Matters

“The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

— John Tukey, legendary statistician

Key principle:

Your analysis is only as good as your data collection!

Bad data → bad conclusions
Biased sample → biased results
Poorly measured variables → unreliable estimates

Common Sources of Bias

Undercoverage: Some groups are systematically excluded from the sample

Phone surveys miss people without phones

Nonresponse bias: Those who respond differ from those who don’t

Voluntary surveys attract people with strong opinions

Voluntary response bias: People self-select into the sample

Online polls, call-in surveys

Common Sources of Bias

Measurement error: Measurements are systematically wrong

Self-reported weight vs. actual weight
Malfunctioning instruments

Loaded/ambiguous questions: Question wording influences response

“Should the president have line-item veto to eliminate waste?” (97% yes)
“Should the president have line-item veto, or not?” (57% yes)

Question order effects: Earlier questions influence later responses :::

Example: Survivor Bias

WWII Bomber Study

During WWII, military officials wanted to determine where to add extra armor to bombers.

They recorded damage on planes that returned from missions.

Initial thought: Add armor where we see the most damage (wings, fuselage)

Problem: What about the planes that didn’t return?

Correct answer: Add armor where returning planes show little damage (engines, cockpit) — because planes hit there didn’t make it back!

Survivor bias: Only observing “survivors” gives a misleading picture of the full population

Example: Voluntary Response Bias

TikTok Poll

An influencer posted: “Have you ever been bitten by an animal?”

TikTok users choose whether to respond
2,361 responses received
65% said “yes”, 35% said “no”

Questions:

Are there issues with this data collection?
Is 65% a good estimate for all people?

Problems:

People bitten are more likely to respond
TikTok users ≠ general population
No random sampling
Voluntary participation

Likely truth: Far less than 65% of all people have been bitten!

Activity 4: Identifying Bias 🔎

Bias in the Weight-Loss Study

Biases Present:

Volunteer bias: People who respond to ads may be more motivated than general population
Nonresponse bias: Only those who complete 6 months are measured (people who quit or had bad reactions are excluded)
Measurement error: Self-reported weight
- People underestimate weight
- May be especially true if they want the $500!
Financial incentive: $500 may encourage favorable reporting

Better Design:

Random sampling from target population
Measured weight rather than self-reported
Track all participants, including dropouts
Blind measurement (staff measuring don’t know if participant got drug or placebo)
Include control group for comparison

What to Watch Out For

Reported vs. Measured

Better to measure yourself when possible!

People round weight (167 → 160 lb)
People overestimate height
Memory is unreliable for quantities

Misleading Percentages

“Medication reduces migraines by 150%”

Problem: Can’t reduce by > 100%!

Example: 12% of 500 = (12/100) × 500 = 60

Correlation ≠ Causation

Just because two things are related doesn’t mean one causes the other!

Classic example:

Ice cream sales correlate with shark attacks
Both increase in summer!
Ice cream doesn’t cause shark attacks

Hidden variable: Temperature/season

We’ll study causation more when we cover experimental design next week!

Wrap-Up & Looking Ahead 🎯

What We Covered Today

✅ Statistical thinking and the PPDAC cycle

✅ Key concepts: Population, sample, variable, parameter, statistic

✅ Variable types: Categorical (nominal, ordinal, binary) and numerical (discrete, continuous)

✅ Bias in data collection: Survivor bias, volunteer bias, nonresponse bias, measurement error

✅ Real-world applications through the LEAP study and CO detector example

Key Takeaway: Good statistics starts with good data. Always ask: How was this data collected?

Quick Knowledge Check ✅

Rate your confidence (1-25 ⭐s) on Ed Discussion:

Can you now:

Explain what statistical thinking is and why we need it? ⭐⭐⭐⭐⭐
Identify the population, sample, variables, parameters, and statistics in a study? ⭐⭐⭐⭐⭐
Classify variables as categorical or numerical, and identify subtypes? ⭐⭐⭐⭐⭐
Recognize common sources of bias in data collection? ⭐⭐⭐⭐⭐
Evaluate whether a study’s conclusions are trustworthy based on how data was collected? ⭐⭐⭐⭐⭐

If summing all the stars you had more than 16, you’re ready to move forward! 🎉

If not, review Chapter 1 from the textbook and come to office hours.

Exit Ticket & Next Steps

📝 Before You Leave

Exit ticket: Complete today’s lecture summary
Check attendance: Did you complete Poll Everywhere activities?
Complete assignments:
- Survey 1 and syllabus acknowledgment → due tomorrow (Friday)

Coming Next Class:

Data Visualization & Summary Statistics - How to create effective graphs - Summarizing categorical data - Introduction to distributions

Read: Textbook Sections 1.6, 1.7, 2.1

Great work today!

See you next class! 📊✨

Questions? Catch me after class or on Ed Discussion

Lecture 2: Data Collection and Statistical Thinking

Before We Start

Today’s Plan:

Learning Objectives 🎯

What You’ll Learn Today

Motivating Case Study 🥜

Case Study: Preventing Peanut Allergies

The Problem:

The Study: LEAP (06-09)

The Data: Raw Form

The Data: Summarized

Summary Table

Visual Summary

The Statistical Question

What is Statistical Thinking? 📊

Why Do We Need Statistics?

Statistics helps us:

Activity 1: Belief Bias 💭

Think-Pair-Share: Belief Bias

What is Statistics?

American Statistical Association:

Data Science:

Key Ideas:

The Statistical Investigation Cycle

Two Types of Statistics

📊 Descriptive Statistics

🔮 Inferential Statistics

Key Statistical Concepts 🔑

The Building Blocks

Visualizing Population vs. Sample

Population

Sample

Parameters vs. Statistics

📏 Parameter

📊 Statistic

Activity 2: LEAP Study Components 🔍

Poll Everywhere: Identify the Components

More LEAP Questions

LEAP Study: Answers

What About Infinite Populations?

Types of Variables 📋

Two Main Categories

Categorical Variables

Numerical Variables

Visual Guide to Categorical Variable Types

Visual Guide to Numerical Variable Types

Why Do Variable Types Matter?

Activity 3: Variable Type Practice 🏋️

Classify These Variables

CO Detector Study Variables

Data Collection & Bias ⚠️

Why Data Quality Matters

Key principle:

Common Sources of Bias

Common Sources of Bias

Example: Survivor Bias

WWII Bomber Study

Example: Voluntary Response Bias

TikTok Poll

Activity 4: Identifying Bias 🔎

Think-Pair-Share: Bias in Studies

Bias in the Weight-Loss Study

Biases Present:

Better Design:

What to Watch Out For

Reported vs. Measured

Misleading Percentages

Correlation ≠ Causation

Wrap-Up & Looking Ahead 🎯

What We Covered Today

Quick Knowledge Check ✅

Exit Ticket & Next Steps

Coming Next Class: