STAT 7 Midterm Review

Statistical Methods for Biological, Environmental, and Health Sciences

Midterm Logistics

When: Thursday 5:20 pm, usual classroom

Format:

  • 15 Multiple Choice Questions (30 points): These test your conceptual understanding and ability to apply concepts to biological, environmental, and health science contexts.
  • 2 Free Response Questions (20 points): These require you to show your work, explain your reasoning, and interpret results in context.
  • Bring calculator, pen (or pencil) and a photo ID

What’s covered: Everything we have discussed so far.

Midterm Tips

  • Review homework and class examples
  • Make sure you have a conceptual understanding of the material
  • Complete the practice problems (HW4) https://malfaro2.github.io/STAT7/2026/assignments/MidtermPractice.html: These are similar to exam questions and will help you practice, though they won’t be identical to what’s on the exam.
  • Attend office hours if you have questions

What We’ll REVIEW Today

  1. Foundations: Statistics basics, variables, study design
  2. Descriptive Statistics: Summaries and distributions
  3. Visualization: Graphs and interpretation
  4. Probability Basics: Rules and calculations
  5. Conditional Probability: Sensitivity, specificity, Bayes’ Theorem
  6. Random Variables & Distributions: Binomial and Normal

Questions welcome throughout!

Part 1: Foundations

What is Statistics?

Statistics is the science of collecting, analyzing, and interpreting data to answer questions and make decisions

Why it matters in biological & health sciences:

  • Design experiments and clinical trials
  • Analyze patient outcomes
  • Understand disease patterns
  • Evaluate treatment effectiveness
  • Make evidence-based medical decisions

Types of Variables

Numerical (Quantitative):

  • Continuous: Can take any value in a range
    • Examples: height, weight, blood pressure
  • Discrete: Countable values
    • Examples: number of cells, hospital visits, offspring

Categorical (Qualitative):

  • Nominal: Categories with no natural order
    • Examples: blood type, species, treatment group
  • Ordinal: Categories with natural order
    • Examples: disease stage (I, II, III), pain level (mild, moderate, severe)

Study Design: Observational vs Experimental

Observational Study:

  • Observe and measure without intervention
  • Cannot establish causation
  • Example: Survey smokers vs non-smokers on lung health

Experiment:

  • Researchers impose treatments
  • Can establish cause-and-effect
  • Example: Randomly assign patients to drug vs placebo

Important

Only well-designed experiments allow causal conclusions!

Key Features of Well-Designed Studies

  1. Random Assignment: Participants randomly assigned to treatment groups

    • Controls for confounding variables
  2. Control Group: Comparison group (often receives placebo)

  3. Blinding:

    • Single-blind: Participants don’t know their group
    • Double-blind: Neither participants nor researchers know
  4. Replication: Large sample size for reliable results

Common Biases

Convenience Sampling:

  • Sampling whoever is easiest to reach
  • May not represent population

Voluntary Response Bias:

  • People choose to participate
  • Often those with strong opinions

Confounding:

  • Third variable affects both treatment and outcome
  • Makes it unclear what caused the result

Part 2: Descriptive Statistics

Measures of Center

Mean (Average):

  • Sum of values divided by count
  • Sensitive to outliers

Median:

  • Middle value when data ordered
  • Resistant to outliers

Mode:

  • Most frequent value
  • Can have multiple modes or no mode

Tip

Use median for skewed data, mean for symmetric data

Measures of Spread

Range:

  • Maximum - Minimum
  • Very sensitive to outliers

Interquartile Range (IQR):

  • IQR = Q3 - Q1
  • Middle 50% of data
  • Resistant to outliers

Standard Deviation (SD):

  • Average distance from mean
  • Sensitive to outliers
  • Used with mean

Distribution Shape

Symmetric:

  • Mirror image around center
  • Mean ≈ Median

Right-Skewed (Positive skew):

  • Long tail to the right
  • Mean > Median

Left-Skewed (Negative skew):

  • Long tail to the left
  • Mean < Median

Describing a Distribution

Always include three components:

  1. Shape: Symmetric, skewed, bimodal?
  2. Center: What’s the typical value? (mean or median)
  3. Spread: How much variability? (SD or IQR)

Also mention:

  • Outliers (unusual values)
  • Context (what the data represent)

Part 3: Data Visualization

Choosing the Right Graph

Variable Type(s) Graph Type
One categorical Bar chart
One numerical Histogram or box plot
Two categorical Segmented bar chart or mosaic plot
Numerical + Categorical Side-by-side box plots
Two numerical Scatterplot

Histograms

Purpose: Show distribution of numerical variable

Key features:

  • Bars touch (continuous data)
  • X-axis: variable values
  • Y-axis: frequency or count
  • Can see shape, center, spread, outliers

Box Plots

Shows five-number summary:

  • Minimum
  • Q1 (25th percentile)
  • Median (Q2, 50th percentile)
  • Q3 (75th percentile)
  • Maximum

Box: IQR (middle 50%)

Whiskers: Extend to min/max (excluding outliers)

Outliers: Shown as individual points

Outlier Detection: IQR Method

Step 1: Calculate IQR = Q3 - Q1

Step 2: Calculate fences:

  • Lower fence = Q1 - 1.5 × IQR
  • Upper fence = Q3 + 1.5 × IQR

Step 3: Values outside fences are outliers

Impact: Outliers can greatly affect mean and SD, but not median and IQR

Scatterplots

Purpose: Show relationship between two numerical variables

Look for:

  • Direction: Positive, negative, or no association
  • Form: Linear, curved, clusters
  • Strength: How closely points follow pattern
  • Outliers: Points far from pattern

Part 4: Probability Basics

Probability Vocabulary

Sample Space (S): All possible outcomes

Event (E): Collection of outcomes

Probability: Likelihood an event occurs

\[P(\text{Event}) = \frac{\text{Number of favorable outcomes}}{\text{Total number of outcomes}}\]

Properties:

  • \(0 \leq P(E) \leq 1\)
  • \(P(S) = 1\) (something must happen)
  • \(P(\text{impossible}) = 0\)

Basic Probability Rules

Addition Rule (OR):

For mutually exclusive events: \[P(A \text{ or } B) = P(A) + P(B)\]

For non-mutually exclusive events: \[P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B)\]

Multiplication Rule (AND):

For independent events: \[P(A \text{ and } B) = P(A) \times P(B)\]

Independent vs Dependent Events

Independent:

  • One event doesn’t affect the other
  • Example: Flipping a coin twice
  • \(P(B|A) = P(B)\)

Dependent:

  • One event affects the probability of the other
  • Example: Drawing cards without replacement
  • \(P(B|A) \neq P(B)\)

Conditional Probability

Probability of A given B has occurred:

\[P(A|B) = \frac{P(A \text{ and } B)}{P(B)}\]

Example: What’s the probability a patient has disease given they tested positive?

\[P(\text{Disease}|\text{Positive test}) = \frac{P(\text{Disease and Positive})}{P(\text{Positive})}\]

Tree Diagrams

Useful for:

  • Multi-step probability problems
  • Visualizing all possible outcomes
  • Organizing conditional probabilities

How to use:

  1. Draw branches for each step
  2. Label probabilities on branches
  3. Multiply along paths for final probabilities
  4. Add paths that lead to same outcome

Two-Way Tables

Also called contingency tables

Organize data by two categorical variables

Disease No Disease Total
Test+ A B A + B
Test- C D C + D
Total A + C B + D N

Useful for calculating conditional probabilities

Part 5: Medical Testing & Bayes’ Theorem

Diagnostic Testing Terminology

Sensitivity: Probability test is positive given person has disease

\[\text{Sensitivity} = P(\text{Test+}|\text{Disease})\]

Specificity: Probability test is negative given person doesn’t have disease

\[\text{Specificity} = P(\text{Test-}|\text{No Disease})\]

Positive Predictive Value (PPV): Probability of disease given positive test

\[\text{PPV} = P(\text{Disease}|\text{Test+})\]

Negative Predictive Value (NPV): Probability of no disease given negative test

\[\text{NPV} = P(\text{No Disease}|\text{Test-})\]

Bayes’ Theorem

General formula:

\[P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}\]

For medical testing:

\[P(\text{Disease}|\text{Test+}) = \frac{P(\text{Test+}|\text{Disease}) \times P(\text{Disease})}{P(\text{Test+})}\]

Note

PPV depends on disease prevalence! Same test has different PPV in different populations.

Bayes’ Theorem: Expanded Form

When you need to find \(P(\text{Test+})\):

\[P(\text{Disease}|\text{Test+}) = \frac{P(\text{Test+}|\text{Disease}) \times P(\text{Disease})}{P(\text{Test+}|\text{Disease}) \times P(\text{Disease}) + P(\text{Test+}|\text{No Disease}) \times P(\text{No Disease})}\]

This equals:

\[\frac{\text{Sensitivity} \times \text{Prevalence}}{\text{Sensitivity} \times \text{Prevalence} + (1-\text{Specificity}) \times (1-\text{Prevalence})}\]

Part 6: Random Variables & Distributions

Random Variables

Random Variable: Numerical outcome of a random process

Discrete Random Variable:

  • Countable outcomes
  • Examples: number of mutations, diseased cells, patients admitted

Continuous Random Variable:

  • Uncountable outcomes in a range
  • Examples: blood pressure, height, reaction time

Discrete Random Variable Properties

Probability Distribution:

  • Lists all possible values and their probabilities
  • \(\sum P(X = x) = 1\)

Expected Value (Mean):

\[E(X) = \mu = \sum [x \times P(X = x)]\]

Standard Deviation:

\[SD(X) = \sigma = \sqrt{\sum [(x - \mu)^2 \times P(X = x)]}\]

Binomial Distribution

When to use:

  1. Fixed number of trials (n)
  2. Two possible outcomes (success/failure)
  3. Constant probability of success (p)
  4. Independent trials

Examples:

  • Number of patients who recover out of 20
  • Number of mutations in 100 DNA sequences
  • Number of positive tests in 50 samples

Binomial Distribution Formulas

Probability of exactly k successes:

\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]

where \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\)

Mean and Standard Deviation:

\[\mu = np\]

\[\sigma = \sqrt{np(1-p)}\]

Normal Distribution

Characteristics:

  • Bell-shaped, symmetric curve
  • Mean = Median = Mode (at center)
  • Defined by mean (\(\mu\)) and standard deviation (\(\sigma\))

Notation: \(X \sim N(\mu, \sigma)\)

Key feature: Total area under curve = 1

The 68-95-99.7 Rule (Empirical Rule)

For a normal distribution:

  • 68% of data within 1 SD of mean (\(\mu \pm \sigma\))
  • 95% of data within 2 SD of mean (\(\mu \pm 2\sigma\))
  • 99.7% of data within 3 SD of mean (\(\mu \pm 3\sigma\))

Tip

Use this for quick estimates of percentages and unusual values

Z-Scores (Standardization)

Z-score: Number of standard deviations from the mean

\[z = \frac{x - \mu}{\sigma}\]

Interpretation:

  • \(z = 0\): at the mean
  • \(z = 1\): one SD above mean
  • \(z = -2\): two SD below mean

Use: Compare values from different distributions

Standard Normal Distribution

Standard Normal: \(N(0, 1)\)

  • Mean = 0
  • Standard deviation = 1

Any normal can be standardized:

If \(X \sim N(\mu, \sigma)\), then \(Z = \frac{X - \mu}{\sigma} \sim N(0,1)\)

Use z-table or technology to find probabilities

Finding Normal Probabilities

Common questions:

  1. What percentage of values below x?
  2. What percentage between a and b?
  3. What value corresponds to the kth percentile?

Process:

  1. Sketch the normal curve
  2. Shade area of interest
  3. Standardize (find z-scores)
  4. Use z-table or technology

Assessing Normality

How to check if data is approximately normal:

  1. Histogram: Should be roughly bell-shaped and symmetric
  2. Normal probability plot (Q-Q plot): Points should fall roughly on straight line
  3. 68-95-99.7 rule: Check if actual percentages match expected

Warning

Many statistical methods assume normality, so always check!

Normal Approximation to Binomial

When appropriate:

  • \(np \geq 10\) AND
  • \(n(1-p) \geq 10\)

How to use:

Approximate \(X \sim \text{Binomial}(n, p)\) with \(X \sim N(\mu, \sigma)\) where:

  • \(\mu = np\)
  • \(\sigma = \sqrt{np(1-p)}\)

Note

Apply continuity correction for better approximation

Practice Problems

Practice Problem 1: Study Design

A researcher wants to know if a new drug reduces blood pressure. She recruits 100 volunteers and lets them choose whether to take the drug or placebo.

Questions:

  1. Is this an observational study or experiment?
  2. Can we conclude the drug causes changes in blood pressure? Why or why not?
  3. What confounding variables might be present?

Practice Problem 2: Probability

In a population, 2% have a disease. A test has:

  • Sensitivity = 95%
  • Specificity = 90%

Questions:

  1. If someone tests positive, what’s the probability they have the disease?
  2. Draw a tree diagram for this scenario
  3. Why is the PPV lower than you might expect?

Practice Problem 3: Normal Distribution

Adult female heights are normally distributed with mean 64 inches and SD 2.5 inches.

Questions:

  1. What percentage of women are between 61.5 and 66.5 inches?
  2. How tall must a woman be to be in the tallest 2.5%?
  3. If a woman is 70 inches tall, what is her z-score?

Common Mistakes to Avoid

  1. Confusing causation with correlation in observational studies

  2. Mixing up conditional probabilities: \(P(A|B) \neq P(B|A)\)

  3. Forgetting assumptions for binomial or normal approximation

  4. Using mean/SD for skewed data (use median/IQR instead)

  5. Misinterpreting p-values in probability (wait, we haven’t covered this yet!)

  6. Not checking if events are independent before multiplying probabilities

  7. Forgetting that sensitivity/specificity ≠ PPV/NPV

Exam Preparation Tips

Before the exam:

  • Review all homework problems
  • Practice problems from textbook
  • Redo class examples
  • Make sure you understand why, not just how
  • Create formula sheet (if allowed)

During the exam:

  • Read questions carefully
  • Show your work
  • Check units and context
  • Sketch distributions when helpful
  • Leave time to review answers

Questions?

Let’s work through any topics you’d like to review!

Good luck on the midterm! 🍀

Remember: You’ve been working with these concepts all quarter. Trust your preparation!