STAT 7 Midterm Review

Statistical Methods for Biological, Environmental, and Health Sciences

Midterm Logistics

When: Thursday 5:20 pm, usual classroom

Format:

15 Multiple Choice Questions (30 points): These test your conceptual understanding and ability to apply concepts to biological, environmental, and health science contexts.
2 Free Response Questions (20 points): These require you to show your work, explain your reasoning, and interpret results in context.
Bring calculator, pen (or pencil) and a photo ID

What’s covered: Everything we have discussed so far.

Midterm Tips

Review homework and class examples
Make sure you have a conceptual understanding of the material
Complete the practice problems (HW4) https://malfaro2.github.io/STAT7/2026/assignments/MidtermPractice.html: These are similar to exam questions and will help you practice, though they won’t be identical to what’s on the exam.
Attend office hours if you have questions

What We’ll REVIEW Today

Foundations: Statistics basics, variables, study design
Descriptive Statistics: Summaries and distributions
Visualization: Graphs and interpretation
Probability Basics: Rules and calculations
Conditional Probability: Sensitivity, specificity, Bayes’ Theorem
Random Variables & Distributions: Binomial and Normal

Questions welcome throughout!

Part 1: Foundations

What is Statistics?

Statistics is the science of collecting, analyzing, and interpreting data to answer questions and make decisions

Why it matters in biological & health sciences:

Design experiments and clinical trials
Analyze patient outcomes
Understand disease patterns
Evaluate treatment effectiveness
Make evidence-based medical decisions

Types of Variables

Numerical (Quantitative):

Continuous: Can take any value in a range
- Examples: height, weight, blood pressure
Discrete: Countable values
- Examples: number of cells, hospital visits, offspring

Categorical (Qualitative):

Nominal: Categories with no natural order
- Examples: blood type, species, treatment group
Ordinal: Categories with natural order
- Examples: disease stage (I, II, III), pain level (mild, moderate, severe)

Study Design: Observational vs Experimental

Observational Study:

Observe and measure without intervention
Cannot establish causation
Example: Survey smokers vs non-smokers on lung health

Experiment:

Researchers impose treatments
Can establish cause-and-effect
Example: Randomly assign patients to drug vs placebo

Important

Only well-designed experiments allow causal conclusions!

Key Features of Well-Designed Studies

Random Assignment: Participants randomly assigned to treatment groups
- Controls for confounding variables
Control Group: Comparison group (often receives placebo)
Blinding:
- Single-blind: Participants don’t know their group
- Double-blind: Neither participants nor researchers know
Replication: Large sample size for reliable results

Common Biases

Convenience Sampling:

Sampling whoever is easiest to reach
May not represent population

Voluntary Response Bias:

People choose to participate
Often those with strong opinions

Confounding:

Third variable affects both treatment and outcome
Makes it unclear what caused the result

Part 2: Descriptive Statistics

Measures of Center

Mean (Average):

Sum of values divided by count
Sensitive to outliers

Median:

Middle value when data ordered
Resistant to outliers

Mode:

Most frequent value
Can have multiple modes or no mode

Tip

Use median for skewed data, mean for symmetric data

Measures of Spread

Range:

Maximum - Minimum
Very sensitive to outliers

Interquartile Range (IQR):

IQR = Q3 - Q1
Middle 50% of data
Resistant to outliers

Standard Deviation (SD):

Average distance from mean
Sensitive to outliers
Used with mean

Distribution Shape

Symmetric:

Mirror image around center
Mean ≈ Median

Right-Skewed (Positive skew):

Long tail to the right
Mean > Median

Left-Skewed (Negative skew):

Long tail to the left
Mean < Median

Describing a Distribution

Always include three components:

Shape: Symmetric, skewed, bimodal?
Center: What’s the typical value? (mean or median)
Spread: How much variability? (SD or IQR)

Also mention:

Outliers (unusual values)
Context (what the data represent)

Part 3: Data Visualization

Choosing the Right Graph

Variable Type(s)	Graph Type
One categorical	Bar chart
One numerical	Histogram or box plot
Two categorical	Segmented bar chart or mosaic plot
Numerical + Categorical	Side-by-side box plots
Two numerical	Scatterplot

Histograms

Purpose: Show distribution of numerical variable

Key features:

Bars touch (continuous data)
X-axis: variable values
Y-axis: frequency or count
Can see shape, center, spread, outliers

Box Plots

Shows five-number summary:

Minimum
Q1 (25th percentile)
Median (Q2, 50th percentile)
Q3 (75th percentile)
Maximum

Box: IQR (middle 50%)

Whiskers: Extend to min/max (excluding outliers)

Outliers: Shown as individual points

Outlier Detection: IQR Method

Step 1: Calculate IQR = Q3 - Q1

Step 2: Calculate fences:

Lower fence = Q1 - 1.5 × IQR
Upper fence = Q3 + 1.5 × IQR

Step 3: Values outside fences are outliers

Impact: Outliers can greatly affect mean and SD, but not median and IQR

Scatterplots

Purpose: Show relationship between two numerical variables

Look for:

Direction: Positive, negative, or no association
Form: Linear, curved, clusters
Strength: How closely points follow pattern
Outliers: Points far from pattern

Part 4: Probability Basics

Probability Vocabulary

Sample Space (S): All possible outcomes

Event (E): Collection of outcomes

Probability: Likelihood an event occurs

\[P(\text{Event}) = \frac{\text{Number of favorable outcomes}}{\text{Total number of outcomes}}\]

Properties:

\(0 \leq P(E) \leq 1\)
\(P(S) = 1\) (something must happen)
\(P(\text{impossible}) = 0\)

Basic Probability Rules

Addition Rule (OR):

For mutually exclusive events: \[P(A \text{ or } B) = P(A) + P(B)\]

For non-mutually exclusive events: \[P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B)\]

Multiplication Rule (AND):

For independent events: \[P(A \text{ and } B) = P(A) \times P(B)\]

Independent vs Dependent Events

Independent:

One event doesn’t affect the other
Example: Flipping a coin twice
\(P(B|A) = P(B)\)

Dependent:

One event affects the probability of the other
Example: Drawing cards without replacement
\(P(B|A) \neq P(B)\)

Conditional Probability

Probability of A given B has occurred:

\[P(A|B) = \frac{P(A \text{ and } B)}{P(B)}\]

Example: What’s the probability a patient has disease given they tested positive?

\[P(\text{Disease}|\text{Positive test}) = \frac{P(\text{Disease and Positive})}{P(\text{Positive})}\]

Tree Diagrams

Useful for:

Multi-step probability problems
Visualizing all possible outcomes
Organizing conditional probabilities

How to use:

Draw branches for each step
Label probabilities on branches
Multiply along paths for final probabilities
Add paths that lead to same outcome

Two-Way Tables

Also called contingency tables

Organize data by two categorical variables

	Disease	No Disease	Total
Test+	A	B	A + B
Test-	C	D	C + D
Total	A + C	B + D	N

Useful for calculating conditional probabilities

Part 5: Medical Testing & Bayes’ Theorem

Diagnostic Testing Terminology

Sensitivity: Probability test is positive given person has disease

\[\text{Sensitivity} = P(\text{Test+}|\text{Disease})\]

Specificity: Probability test is negative given person doesn’t have disease

\[\text{Specificity} = P(\text{Test-}|\text{No Disease})\]

Positive Predictive Value (PPV): Probability of disease given positive test

\[\text{PPV} = P(\text{Disease}|\text{Test+})\]

Negative Predictive Value (NPV): Probability of no disease given negative test

\[\text{NPV} = P(\text{No Disease}|\text{Test-})\]

Bayes’ Theorem

General formula:

\[P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}\]

For medical testing:

\[P(\text{Disease}|\text{Test+}) = \frac{P(\text{Test+}|\text{Disease}) \times P(\text{Disease})}{P(\text{Test+})}\]

Note

PPV depends on disease prevalence! Same test has different PPV in different populations.

Bayes’ Theorem: Expanded Form

When you need to find \(P(\text{Test+})\):

\[P(\text{Disease}|\text{Test+}) = \frac{P(\text{Test+}|\text{Disease}) \times P(\text{Disease})}{P(\text{Test+}|\text{Disease}) \times P(\text{Disease}) + P(\text{Test+}|\text{No Disease}) \times P(\text{No Disease})}\]

This equals:

\[\frac{\text{Sensitivity} \times \text{Prevalence}}{\text{Sensitivity} \times \text{Prevalence} + (1-\text{Specificity}) \times (1-\text{Prevalence})}\]

Part 6: Random Variables & Distributions

Random Variables

Random Variable: Numerical outcome of a random process

Discrete Random Variable:

Countable outcomes
Examples: number of mutations, diseased cells, patients admitted

Continuous Random Variable:

Uncountable outcomes in a range
Examples: blood pressure, height, reaction time

Discrete Random Variable Properties

Probability Distribution:

Lists all possible values and their probabilities
\(\sum P(X = x) = 1\)

Expected Value (Mean):

\[E(X) = \mu = \sum [x \times P(X = x)]\]

Standard Deviation:

\[SD(X) = \sigma = \sqrt{\sum [(x - \mu)^2 \times P(X = x)]}\]

Binomial Distribution

When to use:

Fixed number of trials (n)
Two possible outcomes (success/failure)
Constant probability of success (p)
Independent trials

Examples:

Number of patients who recover out of 20
Number of mutations in 100 DNA sequences
Number of positive tests in 50 samples

Binomial Distribution Formulas

Probability of exactly k successes:

\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]

where \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\)

Mean and Standard Deviation:

\[\mu = np\]

\[\sigma = \sqrt{np(1-p)}\]

Normal Distribution

Characteristics:

Bell-shaped, symmetric curve
Mean = Median = Mode (at center)
Defined by mean (\(\mu\)) and standard deviation (\(\sigma\))

Notation: \(X \sim N(\mu, \sigma)\)

Key feature: Total area under curve = 1

The 68-95-99.7 Rule (Empirical Rule)

For a normal distribution:

68% of data within 1 SD of mean (\(\mu \pm \sigma\))
95% of data within 2 SD of mean (\(\mu \pm 2\sigma\))
99.7% of data within 3 SD of mean (\(\mu \pm 3\sigma\))

Tip

Use this for quick estimates of percentages and unusual values

Z-Scores (Standardization)

Z-score: Number of standard deviations from the mean

\[z = \frac{x - \mu}{\sigma}\]

Interpretation:

\(z = 0\): at the mean
\(z = 1\): one SD above mean
\(z = -2\): two SD below mean

Use: Compare values from different distributions

Standard Normal Distribution

Standard Normal: \(N(0, 1)\)

Mean = 0
Standard deviation = 1

Any normal can be standardized:

If \(X \sim N(\mu, \sigma)\), then \(Z = \frac{X - \mu}{\sigma} \sim N(0,1)\)

Use z-table or technology to find probabilities

Finding Normal Probabilities

Common questions:

What percentage of values below x?
What percentage between a and b?
What value corresponds to the kth percentile?

Process:

Sketch the normal curve
Shade area of interest
Standardize (find z-scores)
Use z-table or technology

Assessing Normality

How to check if data is approximately normal:

Histogram: Should be roughly bell-shaped and symmetric
Normal probability plot (Q-Q plot): Points should fall roughly on straight line
68-95-99.7 rule: Check if actual percentages match expected

Warning

Many statistical methods assume normality, so always check!

Normal Approximation to Binomial

When appropriate:

\(np \geq 10\) AND
\(n(1-p) \geq 10\)

How to use:

Approximate \(X \sim \text{Binomial}(n, p)\) with \(X \sim N(\mu, \sigma)\) where:

\(\mu = np\)
\(\sigma = \sqrt{np(1-p)}\)

Note

Apply continuity correction for better approximation

Practice Problems

Practice Problem 1: Study Design

A researcher wants to know if a new drug reduces blood pressure. She recruits 100 volunteers and lets them choose whether to take the drug or placebo.

Questions:

Is this an observational study or experiment?
Can we conclude the drug causes changes in blood pressure? Why or why not?
What confounding variables might be present?

Practice Problem 2: Probability

In a population, 2% have a disease. A test has:

Sensitivity = 95%
Specificity = 90%

Questions:

If someone tests positive, what’s the probability they have the disease?
Draw a tree diagram for this scenario
Why is the PPV lower than you might expect?

Practice Problem 3: Normal Distribution

Adult female heights are normally distributed with mean 64 inches and SD 2.5 inches.

Questions:

What percentage of women are between 61.5 and 66.5 inches?
How tall must a woman be to be in the tallest 2.5%?
If a woman is 70 inches tall, what is her z-score?

Common Mistakes to Avoid

Confusing causation with correlation in observational studies
Mixing up conditional probabilities: \(P(A|B) \neq P(B|A)\)
Forgetting assumptions for binomial or normal approximation
Using mean/SD for skewed data (use median/IQR instead)
Misinterpreting p-values in probability (wait, we haven’t covered this yet!)
Not checking if events are independent before multiplying probabilities
Forgetting that sensitivity/specificity ≠ PPV/NPV

Exam Preparation Tips

Before the exam:

Review all homework problems
Practice problems from textbook
Redo class examples
Make sure you understand why, not just how
Create formula sheet (if allowed)

During the exam:

Read questions carefully
Show your work
Check units and context
Sketch distributions when helpful
Leave time to review answers

Questions?

Let’s work through any topics you’d like to review!

Good luck on the midterm! 🍀

Remember: You’ve been working with these concepts all quarter. Trust your preparation!