Hypothesis Testing

Making Decisions with Data

STAT 7

Today’s Goals

  • Understand the components of hypothesis testing
  • Calculate and interpret p-values
  • Distinguish between Type I and Type II errors
  • Recognize the difference between statistical significance and practical significance

Mercury in Fish

The EPA safe limit for mercury in fish is 0.3 ppm (parts per million).

A environmental scientist randomly samples 40 fish from a local lake and finds:

  • Sample mean: \(\bar{x}\) = 0.38 ppm
  • Sample standard deviation: s = 0.12 ppm

Key Question: Is there sufficient evidence that the average mercury level in this lake exceeds the safe limit?

Think-Pair-Share #1

Scenario: Returning to our mercury example (\(\bar{x}\) = 0.38 ppm, safe limit = 0.3 ppm).

Questions to discuss (2 minutes):

  1. The sample mean is higher than 0.3. Does this automatically mean the population mean is above the safe limit?
  2. Could we get \(\bar{x}\) = 0.38 just by chance, even if the true average were 0.3 or below?
  3. How would you decide if this difference is “real” or just due to sampling variability?

Quick Report: What makes this a challenging decision? ✋

The Logic of Hypothesis Testing

We need a systematic way to decide: Is our sample result evidence of a real effect, or could it have happened by chance?

Hypothesis testing provides a framework:

  1. Start with a skeptical assumption (null hypothesis)
  2. Calculate how likely our data would be if that assumption were true
  3. If very unlikely, reject the assumption

Step 1: State the Hypotheses

Null Hypothesis (H₀)

  • The “nothing special” claim
  • Assumes no effect/difference
  • What we test against

Alternative Hypothesis (Hₐ)

  • The research claim
  • What we’re trying to find evidence for
  • Can be one- or two-sided

Mercury example:

  • H₀: μ = 0.3 ppm (at the safe limit)
  • Hₐ: μ > 0.3 ppm (above the safe limit)

Think-Pair-Share #2

Scenario: A drug company claims their new medication reduces average recovery time from 10 days. You want to test if this is true.

You give the drug to 50 patients and find an average recovery time of 8.5 days.

Questions to discuss (2 minutes):

  1. What should H₀ be? What should Hₐ be?
  2. Why do we start by assuming H₀ is true (μ = 10 days)?
  3. Is this a one-sided or two-sided alternative? Why?

Quick Report: What are your hypotheses, and what makes Hₐ one-sided vs two-sided? ✋

Step 2: Calculate the Test Statistic

The test statistic measures how far our sample result is from what H₀ predicts, in standard error units.

For a mean:

\[t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}\]

Mercury example:

\[t = \frac{0.38 - 0.30}{0.12/\sqrt{40}} = \frac{0.08}{0.019} = 4.21\]

Interpretation: Our sample mean is 4.21 standard errors above the hypothesized value.

Step 3: Find the P-value

The p-value is the probability of getting a test statistic as extreme as ours (or more extreme) if H₀ were true.

Mercury example (t = 4.21):

  • P-value ≈ 0.0001

Interpretation: If the true mean were 0.3 ppm, there’s only a 0.01% chance we’d observe a sample mean as high as 0.38 ppm (or higher).

Understanding P-values

  • Small p-value (typically < 0.05): Strong evidence against H₀
    • Our data would be very unlikely if H₀ were true
    • We reject H₀
  • Large p-value (≥ 0.05): Weak evidence against H₀
    • Our data could easily happen by chance if H₀ were true
    • We fail to reject H₀

Important: We never “accept” or “prove” H₀! We only fail to find evidence against it.

Think-Pair-Share #3

Scenario A: You test if a coin is fair. After 100 flips, you get 58 heads. The p-value is 0.08.

Scenario B: You test if a new fertilizer increases plant growth. The p-value is 0.03.

Questions to discuss (3 minutes):

  1. In Scenario A, would you reject H₀ (coin is fair)? Why or why not?
  2. In Scenario B, would you reject H₀ (no effect)? Why or why not?
  3. What does p = 0.08 mean in plain language for the coin?
  4. What does p = 0.03 mean in plain language for the fertilizer?

Quick Report: How do you interpret a p-value? ✋

Step 4: Make a Decision

Using a significance level α (commonly α = 0.05):

  • If p-value < α: Reject H₀ (statistically significant)
  • If p-value ≥ α: Fail to reject H₀ (not statistically significant)

Mercury example:

  • p-value (0.0001) < α (0.05)
  • Decision: Reject H₀
  • Conclusion: There is strong evidence that the average mercury level exceeds the safe limit of 0.3 ppm

Type I and Type II Errors

No decision procedure is perfect. Two types of errors can occur:

H₀ is actually TRUE H₀ is actually FALSE
Reject H₀ Type I Error (α) ✓ Correct
Fail to reject H₀ ✓ Correct Type II Error (β)

Understanding the Errors

Type I Error (False Positive)

  • Reject H₀ when it’s actually true
  • Probability = α (significance level)
  • “Crying wolf”

Type II Error (False Negative)

  • Fail to reject H₀ when it’s actually false
  • Probability = β
  • “Missing the wolf”

Mercury example:

  • Type I: Conclude mercury exceeds limit when it doesn’t (close the lake unnecessarily)
  • Type II: Conclude mercury is safe when it actually exceeds limit (dangerous!)

Think-Pair-Share #4

Scenario: A pharmaceutical company tests a new drug for side effects.

  • H₀: The drug has no side effects
  • Hₐ: The drug has side effects

Questions to discuss (3 minutes):

  1. What would a Type I error mean in this context? What are the consequences?
  2. What would a Type II error mean? What are the consequences?
  3. Which error seems more serious here, and why?
  4. If you could only reduce one type of error, which would you choose?

Quick Report: In this medical context, which error is more concerning? ✋

Statistical Significance vs. Practical Significance

Statistical significance (p < 0.05) tells us an effect is unlikely to be due to chance.

Practical significance tells us if an effect is large enough to matter in the real world.

These are different questions!

Example: Statistical but Not Practical

Study: New diet reduces average weight by 0.5 kg in 1000 participants.

  • Statistical significance: p = 0.001 (highly significant!)
  • Practical significance: Is 0.5 kg meaningful for weight loss?
    • Probably not for most people
    • Could be measurement error
    • Not worth the cost/effort of the diet

Example: Practical but Not Statistical

Study: New teaching method increases test scores by 15 points (out of 100) in a small class of 15 students.

  • Statistical significance: p = 0.08 (not significant)
  • Practical significance: 15 points is a huge improvement!
    • Small sample size led to large p-value
    • Effect size suggests it’s worth investigating further
    • Need larger study to confirm

Always Consider Both!

  1. Statistical significance: Does the data provide evidence of an effect?
  2. Practical significance: Is the effect large enough to matter?
  3. Context matters: A small effect might be important in medicine but trivial in other fields
  4. Sample size matters: Large studies find statistical significance easily; small studies might miss important effects

Complete Example: Mercury Testing

Setup: EPA limit is 0.3 ppm. Sample of 40 fish: \(\bar{x}\) = 0.38, s = 0.12

Hypotheses:

  • H₀: μ = 0.3 ppm
  • Hₐ: μ > 0.3 ppm

Test statistic: t = 4.21

P-value: 0.0001

Decision: Reject H₀ (p < 0.05)

Conclusion: Strong evidence that mercury levels exceed the safe limit

Practical significance: 0.38 vs 0.3 ppm is a 27% increase - likely important for public health!

The Hypothesis Testing Process

  1. State hypotheses: H₀ and Hₐ
  2. Choose significance level: Usually α = 0.05
  3. Check conditions: Random sample, appropriate sample size
  4. Calculate test statistic: How many SE’s from H₀?
  5. Find p-value: Probability of result if H₀ true
  6. Make decision: Compare p-value to α
  7. Draw conclusion: In context of the problem
  8. Consider practical significance: Does it matter in real world?

Common Hypothesis Tests

For a population mean (one sample):

  • H₀: μ = μ₀
  • Test statistic: \(t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}\)

For a population proportion:

  • H₀: p = p₀
  • Test statistic: \(z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}\)

Comparing two groups: We’ll cover paired and independent t-tests in DSA6!

Common Misinterpretations

❌ “P-value is the probability that H₀ is true”

✓ P-value is the probability of our data (or more extreme) if H₀ is true

❌ “Failing to reject H₀ proves H₀ is true”

✓ We simply don’t have enough evidence to reject it

❌ “Statistical significance means the result is important”

✓ Must also consider practical significance and context

Key Takeaways

  1. Hypothesis testing helps us decide if sample results reflect real effects
  2. P-values measure how surprising our data would be if H₀ were true
  3. Type I error: Rejecting true H₀ (false positive)
  4. Type II error: Failing to reject false H₀ (false negative)
  5. Statistical ≠ Practical: Always consider both types of significance
  6. Context matters: Interpret results in light of the real-world problem

Questions?

Remember: Statistical inference is about making informed decisions under uncertainty. The tools we’ve learned help us quantify that uncertainty!