Lecture 3: Study Design & Data Summarization

STAT 7 - Statistical Methods for the Biological, Environmental & Health Sciences

10 Mar 2026

Learning Objectives

By the end of today’s lecture, you will be able to:

  • Distinguish between observational studies and experiments
  • Explain why experiments allow us to determine cause-and-effect relationships
  • Recognize key features of well-designed studies (randomization, control, replication)
  • Identify different sampling methods and when to use them
  • Calculate and interpret basic summary statistics
  • Understand when to use different measures of center and spread

Recap from Last Time

  • Statistics - why statistics?
  • Belief bias
  • Descriptive vs Inferential statistics
  • Data (variables, observational units/cases)
  • Unit (subject), Population, Sample
  • Parameter and Statistic
  • Sample Bias & Anecdotal evidence
  • Types of variables

Today’s Roadmap

  1. Case Study: Preventing peanut allergies
  2. Theory: Experiment vs Observational studies
  3. Activity: Design an observational vs experimental study
  4. Theory: Sampling methods
  5. Activity: Air pollution sampling design
  6. Break
  7. Theory: Numerical data summarization
  8. Activity: Water quality analysis
  9. Theory: Categorical data summarization
  10. Activity: Marine research data tables

Part 1: Study Design

Case Study: Preventing Peanut Allergies

The Problem:

  • Peanut allergies in children were increasing dramatically
  • Traditional advice: avoid peanuts in early childhood. But was this advice actually helping?

The Study: LEAP (06-09)

  • 640 UK infants with eczema or egg allergy
  • Randomly assigned to two groups:
    • Peanut consumption group
    • Peanut avoidance group
  • 2006-2009

Key Question: Does early peanut consumption prevent allergies?

The LEAP Study (cont’d)

At 5 years of age:

  • Each child was tested for peanut allergy using an oral food challenge (OFC)
  • 5 grams of peanut protein in a single dose
  • Pass: No allergic reaction detected
  • Fail: Allergic reaction occurred

Main analysis: 530 children with an earlier negative skin test

Poll Question

Is this an experiment or an observational study?

Go to: PollEv.com/slugstats

Study Design: Key Concepts

Experimental Study

Based on three principles:

  1. Control
  2. Randomization
  3. Replication

Compares responses across treatment levels

✅ Can establish causation

Observational Study

Researchers simply observe:

  • Potential explanatory variables
  • Response variables

Participants may differ in ways that influence response

⚠️ Can only establish correlation

Three Principles of Experiments

1. Control

When selecting participants, researchers work to control for extraneous variables and choose a representative sample

2. Randomization

Randomly assigning patients to treatment groups ensures groups are balanced with respect to both controlled and uncontrolled variables

3. Replication

Larger studies are more reliable; larger samples are more likely to be representative of the population (sometimes)

Activity: Redesigning the LEAP Study

Your Task (3 minutes)

Imagine you have to design a similar study (same objectives) but using an observational design instead of experimental.

Would it work? Why or why not?

Discussion: What would be different? What problems might arise?

Poll: PollEv.com/slugstats

Summary

Part 2: Sampling Methods

Why Does Sampling Matter?

  • Almost all statistical methods are based on the notion of implied randomness
  • If observational data are not collected in a random framework, these statistical methods are not reliable
  • The estimates and errors associated with the estimates cannot be trusted

Sampling Methods Overview

Most commonly used random sampling techniques:

  • Simple Random Sample
  • Stratified Sample
  • Cluster Sample
  • Multistage Sample

Note that there are many other sampling methods (systematic, convenience, etc.) but these are less commonly used in formal studies.

Simple Random Sample

Randomly select cases from the population

  • No implied connection between selected points
  • Every member has equal probability of selection

Stratified Sample

Strata are made up of similar observations

  • Take a simple random sample from each stratum
  • Ensures representation from each subgroup

Cluster Sample

Clusters are usually NOT homogeneous

  • Take a simple random sample of clusters
  • Sample ALL observations in selected clusters
  • Usually preferred for economical reasons

Multistage Sample

Combination approach

  • Take a simple random sample of clusters
  • Then take a simple random sample of observations from selected clusters

Activity: Air Pollution Study Design

Scenario

You are an environmental scientist assessing air pollution (PM2.5 levels) in a metropolitan area with diverse zones:

  • Urban areas (high traffic, high population)
  • Industrial areas (factories, emissions)
  • Suburban areas (moderate population, residential)
  • Rural areas (low population, agricultural)

Activity: Air Pollution Study Design

Your Task

Which sampling method would you choose?

A. Random Sampling across entire region

B. Stratified Sampling (proportional to each zone)

C. Cluster Sampling (select neighborhoods, measure all locations)

Poll: PollEv.com/slugstats

Break Time! ☕ 5-minute break

Stretch, grab water, chat with neighbors!

We’ll resume with types of variables and data collection.

Part 3: Numerical Data Summarization

Water Quality Case Study

You are an environmental scientist studying water quality in a local river. You’ve collected data on nitrate concentrations (mg/L) from 10 sampling points:

3.2, 4.8, 4.8, 6.5, 7.0, 8.2, 8.2, 9.1, 9.1, 10.3

Question: How do we summarize this data?

Measures of Central Tendency

Mean: Average of all values \[\text{Mean} = \bar{x} = \frac{\sum x_i}{n}\]

Median: Middle value when data is ordered

Mode: Most frequently occurring value(s)

Measures of Variability

Standard Deviation (SD): Average distance from the mean \[s = \sqrt{\frac{\sum(x_i - \bar{x})^2}{n-1}}\]

Range: Maximum - Minimum

Interquartile Range (IQR): Q3 - Q1 (middle 50% of data)

Calculating Summary Statistics

Data: 3.2, 4.8, 4.8, 6.5, 7.0, 8.2, 8.2, 9.1, 9.1, 10.3

\[\text{Mean} = \frac{3.2 + 4.8 + 4.8 + 6.5 + 7.0 + 8.2 + 8.2 + 9.1 + 9.1 + 10.3}{10} = \frac{71.2}{10} = 7.12\]

\[\text{Variance} = \frac{\sum(x_i - \bar{x})^2}{n-1} = \frac{46.62}{9} = 5.20\]

\[\text{SD} = \sqrt{5.20} = 2.28 \text{ mg/L}\]

Activity: Second Water Sample

Your Turn! (10 minutes)

Calculate the summary statistics for this second water sample:

0.5, 1.0, 3.0, 5.0, 7.0, 7.2, 9.0, 10.0, 12.0, 18.0

Calculate:

  • Mean
  • Median
  • Mode
  • Standard Deviation
  • Range

Comparing Two Samples

Sample 1:

  • Mean: 7.1
  • Median: 7.6
  • Mode: 4.8, 8.2, 9.1
  • SD: 2.28
  • Range: 7.1

Sample 2:

  • Mean: 7.3
  • Median: 7.1
  • Mode: NA
  • SD: 5.34
  • Range: 17.5

Discussion Questions

  • Are the center and spread measures the same?
  • Is the mean different from the median in both?
  • Is the range comparable?
  • Do we care about the mode(s)?

Part 4: Categorical Data Summarization

Marine Science Research Example

Researcher_ID Ecosystem_Studied Research_Focus Conservation_Challenge
1 Coral Reefs Coral Bleaching Overfishing
2 Open Ocean Marine Biodiversity Climate Change
3 Estuaries Carbon Sequestration Coastal Development
4 Estuaries Nutrient Cycling Pollution
5 Estuaries Hydrothermal Vents Pollution
6 Kelp Forests Species Interactions Ocean Acidification
7 Seagrass Meadows Habitat Restoration Agricultural Runoff
8 Rocky Intertidal Invasive Species Climate Change
9 Salt Marshes Erosion Control Sea Level Rise
10 Open Ocean Fisheries Management Overfishing

Categorical Data: Frequency Tables

How do we summarize categorical data?

  1. Frequency Tables (counts)
  2. Relative Frequency Tables (proportions/percentages)
  3. Cross-tabulation (relationships between variables)
  4. Plots (we’ll see these next week!)

Example: Conservation Challenges

Conservation Challenge Absolute Frequency Relative Frequency
Pollution 2 0.20
Overfishing 2 0.20
Climate Change 2 0.20
Coastal Development 1 0.10
Ocean Acidification 1 0.10
Agricultural Runoff 1 0.10
Sea Level Rise 1 0.10
TOTAL 10 1.00

Activity: Your Turn - Ecosystem Table

Your Task (5 minutes)

Create a frequency table for Ecosystem_Studied using the data provided.

Calculate:

  1. Absolute frequencies (counts)
  2. Relative frequencies (proportions)

Can you also create a cross-table between Ecosystem Studied and Research Focus?

Cross-Tabulation Example

Ecosystem Studied × Conservation Challenge

Estuaries Open Ocean Other Total
Pollution 2 0 0 2
Climate Change 0 1 1 2
Overfishing 0 1 1 2
Other 1 0 3 4
Total 3 2 5 10

Should We Calculate by Hand?

Yes, for learning:

  • Understand the process
  • Check computer output
  • Debug code
  • Build intuition

No, for research:

  • Avoid calculation errors
  • Improve efficiency
  • Enable reproducibility
  • Allow others to verify

Correlation (Preview)

We’ll explore this more next time, but here’s a preview:

Interactive tool: https://istats.shinyapps.io/guesscorr/

Key Takeaways

  1. Experiments allow causal conclusions; observational studies show correlation
  2. Good sampling requires randomization - different methods for different situations
  3. Numerical data: summarize with measures of center and spread
  4. Categorical data: summarize with frequency tables and cross-tabulations
  5. Always consider what question you’re trying to answer when choosing summaries

Quick Knowledge Check ✅

Rate your confidence (1-25 ⭐s) on Ed Discussion:

Can you now:

If summing all the stars you had more than 16, you’re ready to move forward! 🎉

If not, review Chapter 1 from the textbook and come to office hours.

Exit Ticket

  1. Summary: Post your self assessment in Ed Discussion. Please reply to the poll only, no need to leave comments.

  2. Attendance: Did you complete at least one attendance activity? If not, see me now!

  3. Complete:

    • HW 1 (due Friday)
    • DSA 1 (due after your Discussion Section)

Looking Ahead

Next class:

  • Data visualization
  • Relationship between variables
  • Introduction to probability

Readings:

  • Sections 1.6, 1.7, 2.1

Great work today!

See you next class! 📊✨

Questions? Catch me after class or on Ed Discussion