Lecture 2: Data Collection and Statistical Thinking

STAT 7 - Statistical Methods for the Biological, Environmental & Health Sciences

09 Jan 2026

Before We Start

📋 Quick Reminders

  • Syllabus Acknowledgement and Survey 1 will be closed tomorrow, please complete them before then.
  • HW 1 will be posted tomorrow (Friday)
  • Discussion sections start next week
  • Make sure you can access Poll Everywhere and Ed Discussion! talk to me after class if you don’t.

Today’s Plan:

We’ll explore how to think statistically and learn about data collection methods that impact everything we do in this course.

Learning Objectives 🎯

What You’ll Learn Today

By the end of today’s class, you will be able to:

  1. Explain what statistical thinking is and why it matters for scientific research

  2. Identify the key components of a statistical study (population, sample, variables, parameters, statistics)

  3. Distinguish between different types of variables (categorical vs. numerical; nominal, ordinal, binary, discrete, continuous)

  4. Recognize common sources of bias in data collection

  5. Evaluate the quality and reliability of data based on collection methods

  6. Apply these concepts to real-world case studies in biological and health sciences

Motivating Case Study 🥜

Case Study: Preventing Peanut Allergies

The Problem:

  • Peanut allergies in children were increasing dramatically
  • Traditional advice: avoid peanuts in early childhood. But was this advice actually helping?

The Study: LEAP (06-09)

  • 640 UK infants with eczema or egg allergy
  • Randomly assigned to two groups:
    • Peanut consumption group
    • Peanut avoidance group
  • Tested at age 5 with oral food challenge

Key Question: Does early peanut consumption prevent allergies?

The Data: Raw Form

participant.ID treatment.group overall.V60.outcome
LEAP_100522 Peanut Consumption PASS OFC
LEAP_103358 Peanut Consumption PASS OFC
LEAP_105069 Peanut Avoidance PASS OFC
LEAP_994047 Peanut Avoidance PASS OFC
LEAP_997608 Peanut Consumption PASS OFC

Individual-level data for five children shows their treatment group and outcome.

But 640 children is too much to look at individually…

The Data: Summarized

Summary Table

FAIL OFC PASS OFC Total
Peanut Avoidance 36 227 263
Peanut Consumption 5 262 267
Total 41 489 530
  • Avoidance: 36/263 = 13.7% failed
  • Consumption: 5/267 = 1.9% failed
  • Difference: 11.8%

Visual Summary

The Statistical Question

Is the 11.8% difference real or just chance variation?

This is where statistical thinking comes in!

We need tools to: - Understand if this difference is meaningful - Account for uncertainty - Make reliable conclusions

Later in the course, you’ll learn how to answer this question using hypothesis testing

What is Statistical Thinking? 📊

Why Do We Need Statistics?

The Belief Bias Effect

When asked if an argument is logically valid, people tend to be influenced by whether the conclusion seems believable, even when they shouldn’t be.

We naturally let our biases affect our reasoning!

Statistics helps us:

  • Overcome cognitive biases
  • Make objective decisions based on evidence
  • Quantify uncertainty
  • Distinguish signal from noise

Activity 1: Belief Bias 💭

Think-Pair-Share: Belief Bias

📝 Your Task (5 minutes total)

  1. Think individually (2 min): Can you think of one example of belief bias you’ve encountered? Consider:

    • News headlines you’ve seen
    • Health claims you’ve heard
    • Scientific “facts” people believe
  2. Pair with neighbor (2 min): Share and discuss both examples

  3. Share (1 min): Decide which example best illustrates belief bias and submit to Poll Everywhere (both partners submit the same answer)

PollEv.com/slugstats

What is Statistics?

American Statistical Association:

“Statistics is the science of learning from data, and of measuring, controlling, and communicating uncertainty.”

Data Science:

An interdisciplinary field that uses various methods and processes to extract insight from data and apply actionable insight across application domains.

Key Ideas:

  • Statistics is about learning from evidence
  • Every conclusion comes with uncertainty
  • We need systematic methods to draw reliable conclusions

The Statistical Investigation Cycle

PPDAC: ProblemPlanDataAnalysisConclusion

Every statistical study follows this cycle!

From Chris Wild: What is Statistics?

Two Types of Statistics

📊 Descriptive Statistics

Use numerical and graphical methods to:

  • Explore data
  • Look for patterns
  • Summarize information
  • Present in convenient form

Example: “In our sample, 13.7% of children in the avoidance group developed allergies”

🔮 Inferential Statistics

Use sample data to:

  • Make estimates
  • Make decisions
  • Make predictions
  • Generalize to larger populations

Example: “We conclude that peanut consumption reduces allergy risk in the broader population of at-risk children”

This course covers both! We start with description, then move to inference.

Key Statistical Concepts 🔑

The Building Blocks

Unit (Subject): An object we collect data about

  • Examples: People, animals, batteries, transactions

Population: The full set of units we’re interested in

  • Examples: All CA residents, entire whale species, all Amazon transactions

Sample: A subset of the population we actually observe

  • Examples: 1,000 CA residents, 25 blue whales, one hour of transactions

Variable: A characteristic we measure for each unit

  • Examples: Education level, whale mass, transaction size

Note: A census collects data from every member of a population (rare and expensive!)

Visualizing Population vs. Sample

Population

All units of interest (often unobserved)

Sample

Observed units (subset we measure)

Key insight: We use the sample (what we can observe) to learn about the population (what we want to know about)

Parameters vs. Statistics

📏 Parameter

A numerical measurement describing a population

  • True but usually unknown
  • Theoretical value
  • What we want to know

Examples:

  • True proportion of CA residents with college degrees
  • Average mass of all blue whales
  • Population mean battery lifespan

📊 Statistic

A numerical measurement describing a sample

  • Observed and known
  • Empirical value
  • What we actually calculate

Examples:

  • Proportion of college degrees in our 1,000-person sample
  • Average mass of our 25 sampled whales
  • Mean lifespan of 100 tested batteries

Statistics estimate parameters! We calculate statistics from our sample to guess the parameter values for the population.

Activity 2: LEAP Study Components 🔍

Poll Everywhere: Identify the Components

For the LEAP peanut allergy study, identify each component:

Question 1: What is the unit (subject)?

A. Peanuts
B. Children
C. Allergy tests
D. Treatment groups

Question 2: What is the population?

A. All peanuts produced in the UK
B. UK children with eczema/egg allergy in 2006
C. All allergy tests in 2009
D. The 640 children enrolled

More LEAP Questions

Question 3: Is this a census or sample?

A. Census - measured everyone in the population
B. Sample - measured a subset of the population

Question 4: What is a variable in this study?

A. Children
B. Treatment assignment (consumption vs. avoidance)
C. The United Kingdom
D. 2006-2009

Question 5: The parameter of interest is:

A. The true proportion of all at-risk UK children who would develop allergies under each treatment
B. The 13.7% failure rate in our avoidance group
C. The 530 children in the study
D. Whether a child passed or failed the OFC

LEAP Study: Answers

Unit: Children (individuals with eczema/egg allergy)

Population: UK children aged 4-11 months with eczema/egg allergy in 2006

Sample: The 530 children with negative skin test who completed the study

Variables: - Treatment group (consumption/avoidance) - OFC result (pass/fail) - Age, severity of eczema, etc.

Parameter: - True proportion of population who would develop allergies under avoidance: p₁ - True proportion under consumption: p₂ - We want to know: p₁ - p₂

Statistic: - Sample proportion with avoidance: 36/263 = 0.137 - Sample proportion with consumption: 5/267 = 0.019 - Observed difference: 0.137 - 0.019 = 0.118 (11.8%)

What About Infinite Populations?

Sometimes populations are conceptually infinite:

  • All possible coin flips
  • All possible times in a day to measure something
  • All future patients who might receive a treatment

In these cases, the parameter represents the theoretical property of the process generating the data.

Break Time! ☕ 5-minute break

Stretch, grab water, chat with neighbors!

We’ll resume with types of variables and data collection.

Types of Variables 📋

Two Main Categories

Categorical Variables

Variables that place individuals into groups or categories

Subtypes:

Nominal: Unordered categories - Blood type (A, B, AB, O) - Species of animal - Brand of detector

Ordinal: Ordered categories - Education level (HS, BS, MS, PhD) - Disease severity (mild, moderate, severe) - Course grade (A, B, C, D, F)

Binary: Only two outcomes - Yes/No, Pass/Fail, Alive/Dead - Gender (in some contexts)

Numerical Variables

Variables that take on numerical values where arithmetic makes sense

Subtypes:

Discrete: Countable values, often integers - Number of children - Number of mutations - Number of emergency room visits

Continuous: Any value in a range - Height, weight, temperature - Blood pressure - Reaction time

Visual Guide to Categorical Variable Types

Art by Allison Horst

Visual Guide to Numerical Variable Types

Art by Allison Horst

Why Do Variable Types Matter?

Different variable types require different methods!

Categorical variables:

  • Summarize with counts and proportions
  • Visualize with bar charts, pie charts
  • Analyze with chi-square tests, logistic regression

Numerical variables:

  • Summarize with means, medians, standard deviations
  • Visualize with histograms, boxplots, scatterplots
  • Analyze with t-tests, ANOVA, linear regression

Bottom line: Correctly identifying variable types is the first step in any data analysis!

Activity 3: Variable Type Practice 🏋️

Classify These Variables

For each variable, identify the type (nominal, ordinal, binary, discrete, or continuous):

  1. Number of doctor visits in the past year
    • Discrete (countable)
  2. Patient satisfaction rating (Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied)
    • Ordinal (ordered categories)
  3. Blood glucose level (mg/dL)
    • Continuous (measurable)
  4. Type of cancer (breast, lung, colon, etc.)
    • Nominal (unordered categories)
  5. Did the patient survive? (Yes/No)
    • Binary (two outcomes)
  6. Body temperature (°F)
    • Continuous (measurable)

CO Detector Study Variables

Recall the carbon monoxide detector study. Classify these variables:

  1. Does the alarm go off under hazardous conditions? (Yes/No)
    • Binary categorical
  2. Functionality timing (too late, just in time, too early)
    • Ordinal categorical
  3. Distance from CO source (in meters)
    • Continuous numerical
  4. Brand (Kidde, First Alert, etc.)
    • Nominal categorical
  5. Location in building (basement, first floor, second floor)
    • Nominal categorical (or ordinal if you consider height)
  6. Number of CO detectors in building
    • Discrete numerical
  7. Age of detector (in months)
    • Discrete numerical (or continuous, depending on precision)

Data Collection & Bias ⚠️

Why Data Quality Matters

“The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

— John Tukey, legendary statistician

Key principle:

Your analysis is only as good as your data collection!

  • Bad data → bad conclusions
  • Biased sample → biased results
  • Poorly measured variables → unreliable estimates

Common Sources of Bias

Undercoverage: Some groups are systematically excluded from the sample

  • Phone surveys miss people without phones

Nonresponse bias: Those who respond differ from those who don’t

  • Voluntary surveys attract people with strong opinions

Voluntary response bias: People self-select into the sample

  • Online polls, call-in surveys

Common Sources of Bias

Measurement error: Measurements are systematically wrong

  • Self-reported weight vs. actual weight
  • Malfunctioning instruments

Loaded/ambiguous questions: Question wording influences response

  • “Should the president have line-item veto to eliminate waste?” (97% yes)
  • “Should the president have line-item veto, or not?” (57% yes)

Question order effects: Earlier questions influence later responses :::

Example: Survivor Bias

WWII Bomber Study

During WWII, military officials wanted to determine where to add extra armor to bombers.

They recorded damage on planes that returned from missions.

Initial thought: Add armor where we see the most damage (wings, fuselage)

Problem: What about the planes that didn’t return?

Correct answer: Add armor where returning planes show little damage (engines, cockpit) — because planes hit there didn’t make it back!

Survivor bias: Only observing “survivors” gives a misleading picture of the full population

Example: Voluntary Response Bias

TikTok Poll

An influencer posted: “Have you ever been bitten by an animal?”

  • TikTok users choose whether to respond
  • 2,361 responses received
  • 65% said “yes”, 35% said “no”

Questions:

  1. Are there issues with this data collection?
  2. Is 65% a good estimate for all people?

Problems:

  • People bitten are more likely to respond
  • TikTok users ≠ general population
  • No random sampling
  • Voluntary participation

Likely truth: Far less than 65% of all people have been bitten!

Activity 4: Identifying Bias 🔎

Think-Pair-Share: Bias in Studies

📝 Your Task (10 minutes total)

Consider this scenario:

A pharmaceutical company wants to test a new weight-loss drug. They post ads on social media asking for volunteers. Participants who complete the 6-month study will receive $500. The company measures weight loss by asking participants to self-report their weight at the beginning and end of the study.

Questions to discuss:

  1. Think (2 min): What sources of bias can you identify?

  2. Pair (5 min): Share with neighbor, identify at least 3 different bias types

  3. Share (3 min): Groups share their answers via Poll Everywhere

Bias in the Weight-Loss Study

Biases Present:

  1. Volunteer bias: People who respond to ads may be more motivated than general population

  2. Nonresponse bias: Only those who complete 6 months are measured (people who quit or had bad reactions are excluded)

  3. Measurement error: Self-reported weight

    • People underestimate weight
    • May be especially true if they want the $500!
  4. Financial incentive: $500 may encourage favorable reporting

Better Design:

  • Random sampling from target population
  • Measured weight rather than self-reported
  • Track all participants, including dropouts
  • Blind measurement (staff measuring don’t know if participant got drug or placebo)
  • Include control group for comparison

What to Watch Out For

Reported vs. Measured

Better to measure yourself when possible!

  • People round weight (167 → 160 lb)
  • People overestimate height
  • Memory is unreliable for quantities

Misleading Percentages

“Medication reduces migraines by 150%”

Problem: Can’t reduce by > 100%!

Example: 12% of 500 = (12/100) × 500 = 60

Correlation ≠ Causation

Just because two things are related doesn’t mean one causes the other!

Classic example:

  • Ice cream sales correlate with shark attacks
  • Both increase in summer!
  • Ice cream doesn’t cause shark attacks

Hidden variable: Temperature/season

We’ll study causation more when we cover experimental design next week!

Wrap-Up & Looking Ahead 🎯

What We Covered Today

Statistical thinking and the PPDAC cycle

Key concepts: Population, sample, variable, parameter, statistic

Variable types: Categorical (nominal, ordinal, binary) and numerical (discrete, continuous)

Bias in data collection: Survivor bias, volunteer bias, nonresponse bias, measurement error

Real-world applications through the LEAP study and CO detector example

Key Takeaway: Good statistics starts with good data. Always ask: How was this data collected?

Quick Knowledge Check ✅

Rate your confidence (1-25 ⭐s) on Ed Discussion:

Can you now:

If summing all the stars you had more than 16, you’re ready to move forward! 🎉

If not, review Chapter 1 from the textbook and come to office hours.

Exit Ticket & Next Steps

📝 Before You Leave

  1. Exit ticket: Complete today’s lecture summary

  2. Check attendance: Did you complete Poll Everywhere activities?

  3. Complete assignments:

    • Survey 1 and syllabus acknowledgment → due tomorrow (Friday)

Coming Next Class:

Data Visualization & Summary Statistics - How to create effective graphs - Summarizing categorical data - Introduction to distributions

Read: Textbook Sections 1.6, 1.7, 2.1

Great work today!

See you next class! 📊✨

Questions? Catch me after class or on Ed Discussion