STAT 17: Error Types, Effect Sizes & Comparing Two Groups

Prof. Marcela Alfaro Cordoba

Statistics - UCSC

20 Nov 2025

What We’ll Accomplish Today

Quick recap: Last class we covered hypothesis testing — the four-step process, p-values, and confidence interval connections.

Today:

  • Type I and Type II errors (the cost of being wrong)
  • Statistical vs. practical significance
  • Common hypothesis testing mistakes
  • Comparing two independent means (two-sample t-test)
  • Effect sizes (Cohen’s d)
  • Comparing two independent proportions

Type I and Type II Errors

Four possible outcomes in hypothesis testing:


H₀ True H₀ False
Reject H₀ Type I Error (α) Correct Decision (Power)
Fail to Reject H₀ Correct Decision Type II Error (β)


Type I Error (False Positive):

  • Reject H₀ when H₀ is actually true
  • Probability = α (significance level)
  • Example: Approve an ineffective drug

Type II Error (False Negative):

  • Fail to reject H₀ when H₀ is actually false
  • Probability = β
  • Example: Reject an effective drug

Power = 1 − β: Probability of correctly rejecting a false H₀. Higher power is better!

Understanding Errors in Context

Dr. Chen’s drug trial:

  • H₀: Drug reduces BP by ≥ 10 mmHg (effective)
  • H₁: Drug reduces BP by < 10 mmHg (not effective enough)

Type I Error (α):

  • Reject H₀ when true
  • Conclude drug is ineffective when it actually works
  • Consequence: Deny patients an effective treatment
  • Controlled directly by choosing α

Type II Error (β):

  • Fail to reject H₀ when false
  • Conclude drug works when it doesn’t meet the standard
  • Consequence: Approve an ineffective drug
  • Affected by n, effect size, α

The trade-off:

  • Decrease α → increase β (more conservative test)
  • Increase α → decrease β (more liberal test)
  • Increase n → decrease both!

Choosing α: Context Matters

Type I error rate = α — we choose it directly, before seeing the data.

Life-or-death decision:

  • Use α = 0.01 (very conservative)
  • Only 1% chance of approving an ineffective treatment

Preliminary research:

  • Use α = 0.10 (more liberal)
  • 10% false-positive rate acceptable at this stage

Standard research:

  • Use α = 0.05 (balanced)
  • 5% chance of Type I error

Controlling β:

  • β depends on α, n, effect size, and σ
  • Larger n → smaller β → higher power
  • Researchers typically aim for power ≥ 0.80 (β ≤ 0.20)

Statistical vs. Practical Significance

Statistical significance: p-value < α

  • The effect is unlikely due to chance alone
  • Heavily influenced by sample size — large n can flag tiny effects

Practical significance: Effect size matters in the real world

  • The effect is large enough to actually care about
  • Independent of sample size

Example — Large study (n = 10,000):

  • Drug lowers BP by 1 mmHg → p-value < 0.001 (statistically significant!)
  • But 1 mmHg is clinically meaningless (not practically significant)

Example — Small study (n = 20):

  • Drug lowers BP by 15 mmHg → p-value = 0.08 (not statistically significant)
  • But 15 mmHg is very important clinically (practically significant!)

Always report: p-value and effect size together.

Common Mistakes in Hypothesis Testing

Mistake 1: “Accepting” H₀

  • ✗ “We accept H₀”
  • ✓ “We fail to reject H₀” / “Insufficient evidence against H₀”

Mistake 2: Wrong interpretation of p-value

  • ✗ “p = 0.03 means there’s a 3% chance H₀ is true”
  • ✓ “p = 0.03 means data this extreme occur 3% of the time if H₀ is true”

Mistake 3: Changing α after seeing results (“p-hacking”)

Mistake 4: Confusing significance with importance

Mistake 5: Conclusion not in context — always state what the decision means for the problem, not just “reject H₀”

THINK-PAIR-SHARE 1 (7 minutes)

Errors in Context

A factory produces bolts with a specified length of 5 cm. Quality control samples n = 40 bolts and finds x̄ = 5.15 cm, s = 0.4 cm.

  1. Test at α = 0.05 whether mean length differs from 5 cm (all 4 steps).
  2. What type of error could you have made? Describe it in context.
  3. If you had failed to reject H₀, what would the Type II error mean in context?
  4. How would you reduce the probability of a Type II error?
  5. Construct a 95% CI for μ. Does it support your test conclusion?
  6. Would your conclusion change at α = 0.01?

Use Google Sheets! Post on Ed Discussion with your partner’s name.

Share your answers in Poll Everywhere!

What is the test statistic for the bolt length test?

🧘‍♀️ STRETCH BREAK

Time to move! (5 minutes)

  • Stand up and stretch 🤸‍♀️
  • Chat with neighbors about errors 💬
  • Grab some water 💧

Comparing Two Groups

Case Study: Retail A/B Testing

Scenario: A major online retailer is testing two checkout designs:

  • Design A (Control): Traditional multi-page checkout
  • Design B (Treatment): New single-page checkout

Questions we’ll answer today:

  1. Does Design B increase the average purchase amount?
  2. How large is the effect — is it practically meaningful?
  3. Does Design B improve conversion rates?

When Do We Compare Two Means?

  • Testing whether two groups differ on a continuous outcome
  • A/B testing in business contexts
  • Clinical trials comparing treatments
  • Product testing and quality control

Key assumption: Two independent samples from two populations.

This is different from the one-sample tests we’ve been doing — now we have no known μ₀ to test against. We let the data from both groups speak.

The Two-Sample t-Test Framework

Hypotheses:

  • H₀: μ₁ = μ₂   (or equivalently, μ₁ − μ₂ = 0)
  • Hₐ: μ₁ ≠ μ₂   (two-sided)
  • Hₐ: μ₁ > μ₂ or μ₁ < μ₂   (one-sided)

Test statistic (under H₀):

\[t = \frac{\bar{x}_1 - \bar{x}_2}{SE(\bar{x}_1 - \bar{x}_2)}\]

Two cases for SE — depending on whether we assume equal variances:

Case Assumption SE formula
Pooled σ₁² = σ₂² uses pooled \(s_p\)
Welch’s σ₁² ≠ σ₂² uses \(s_1, s_2\) separately

Standard Error: Two Cases

Case 1: Equal Variances (pooled t-test)

\[SE = s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}, \quad \text{where } s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}\]

  • Degrees of freedom: df = n₁ + n₂ − 2
  • \(s_p\) is a weighted average of the two sample standard deviations

Case 2: Unequal Variances (Welch’s t-test)

\[SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]

  • df is approximated (Welch-Satterthwaite formula — let Google Sheets handle it)
  • More conservative, safer when in doubt

In practice: When in doubt, use Welch’s — it’s the default in most software.

Example: Purchase Amounts

Data from the checkout design test:

  • Design A: n₁ = 250, x̄₁ = $87.50, s₁ = $22.30
  • Design B: n₂ = 250, x̄₂ = $92.80, s₂ = $24.10

Test: H₀: μ_A = μ_B  vs  Hₐ: μ_B > μ_A at α = 0.05 (one-sided, assuming equal variances)

Step 1: Pooled standard deviation

\[s_p = \sqrt{\frac{(249)(22.30)^2 + (249)(24.10)^2}{498}} = \sqrt{\frac{123{,}956 + 144{,}840}{498}} = \sqrt{539.6} = 23.23\]

Step 2: Standard error

\[SE = 23.23 \times \sqrt{\frac{1}{250} + \frac{1}{250}} = 23.23 \times 0.0894 = 2.08\]

Example: Purchase Amounts (cont.)

Step 3: Test statistic

\[t = \frac{92.80 - 87.50}{2.08} = \frac{5.30}{2.08} = 2.55\]

Step 4: P-value and decision

df = 498, one-sided test at α = 0.05:

=1 - T.DIST(2.55, 498, TRUE) ≈ 0.0055

p-value = 0.0055 < 0.05 → Reject H₀

Conclusion: There is significant evidence that Design B leads to higher average purchase amounts than Design A.

Practical interpretation: The new checkout design increases average purchases by about $5.30.

Google Sheets for Two-Sample t-Test

Function: =T.TEST(array1, array2, tails, type)

Parameter Meaning
array1 First sample data range
array2 Second sample data range
tails 1 = one-sided, 2 = two-sided
type 1 = paired, 2 = equal variance, 3 = unequal variance (Welch’s)

Example:

=T.TEST(A2:A251, B2:B251, 1, 2)

Returns the p-value directly for a one-sided, equal-variance test.

For the test statistic and CI manually:

// Pooled SD
=SQRT(((n1-1)*s1^2 + (n2-1)*s2^2) / (n1+n2-2))

// SE
=sp * SQRT(1/n1 + 1/n2)

// t-statistic
=(xbar1 - xbar2) / SE

Effect Size: Cohen’s d

Statistical significance ≠ Practical importance.

Even a tiny difference can be “significant” with a large enough sample.

Cohen’s d measures the standardized difference between two means:

\[d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}\]

It tells us how far apart the means are in standard deviation units.

Cohen’s benchmarks:

Cohen’s d Interpretation
0.2 Small effect — difficult to notice
0.5 Medium effect — noticeable
0.8 Large effect — very noticeable

Our example:

\[d = \frac{92.80 - 87.50}{23.23} = \frac{5.30}{23.23} = 0.23\]

Small-to-medium effect — statistically significant, but modest in practical terms.

THINK-PAIR-SHARE 2 (7 minutes)

Two-Sample t-Test

A software company tested two training methods for new employees.

  • Method A (n = 40): mean productivity score = 78.5, SD = 12.3
  • Method B (n = 40): mean productivity score = 82.7, SD = 11.8

Calculate:

  1. The pooled standard deviation \(s_p\)
  2. The standard error SE
  3. The t-statistic (assume equal variances, two-sided test, α = 0.05)
  4. The p-value using Google Sheets
  5. Cohen’s d — is this difference practically important?

Post on Ed Discussion with your partner’s name!

Share your answers in Poll Everywhere!

Is this difference practically important?

Comparing Two Proportions

When: Testing whether two groups differ on a binary outcome (success/failure)

Examples:

  • Conversion rates between two checkout designs
  • Default rates between two loan types
  • Customer satisfaction (satisfied / not satisfied) between two services

Hypotheses:

  • H₀: p₁ = p₂   (or p₁ − p₂ = 0)
  • Hₐ: p₁ ≠ p₂   (two-sided) or directional (one-sided)

Key difference from one-sample proportion test: We no longer have a known p₀. Instead, we estimate the common proportion under H₀ by pooling both samples.

Test Statistic for Two Proportions

Step 1: Pooled proportion (our best estimate of p under H₀)

\[\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}\]

Step 2: Standard error under H₀

\[SE = \sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}\]

Step 3: Test statistic

\[z = \frac{\hat{p}_1 - \hat{p}_2}{SE}\]

Under H₀, z follows the standard normal distribution, so we use z critical values and NORM.S.DIST for p-values.

Example: Conversion Rates

Data from the checkout design test:

  • Design A: 250 visitors, 47 completed purchases → p̂_A = 0.188
  • Design B: 250 visitors, 63 completed purchases → p̂_B = 0.252

Test: H₀: p_A = p_B  vs  Hₐ: p_B > p_A at α = 0.05

Step 1: Pooled proportion

\[\hat{p} = \frac{47 + 63}{250 + 250} = \frac{110}{500} = 0.220\]

Step 2: Standard error

\[SE = \sqrt{0.220 \times 0.780 \times \left(\frac{1}{250} + \frac{1}{250}\right)} = \sqrt{0.001373} = 0.0371\]

Example: Conversion Rates (cont.)

Step 3: Test statistic

\[z = \frac{0.252 - 0.188}{0.0371} = \frac{0.064}{0.0371} = 1.72\]

Step 4: P-value and decision

One-sided test at α = 0.05:

=1 - NORM.S.DIST(1.72, TRUE) ≈ 0.043

p-value = 0.043 < 0.05 → Reject H₀

Conclusion: There is significant evidence that Design B has a higher conversion rate than Design A.

Practical interpretation: Design B increases conversion by about 6.4 percentage points (18.8% → 25.2%).

Google Sheets for Two-Proportion Test

// Pooled proportion
=(x1 + x2) / (n1 + n2)

// Standard error
=SQRT(p_pool * (1 - p_pool) * (1/n1 + 1/n2))

// z-statistic
=(phat1 - phat2) / SE

// P-value (one-sided, right-tailed)
=1 - NORM.S.DIST(z, TRUE)

// P-value (two-sided)
=2 * (1 - NORM.S.DIST(ABS(z), TRUE))

Conditions to check before running this test:

  • Random, independent samples from both groups
  • \(n_1\hat{p} \geq 10\), \(n_1(1-\hat{p}) \geq 10\), and same for \(n_2\)

Choosing the Right Test

Scenario Parameter Test statistic Distribution
One mean, σ known μ \(z = \frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}\) z
One mean, σ unknown μ \(t = \frac{\bar{x}-\mu_0}{s/\sqrt{n}}\) t (df = n−1)
One proportion p \(z = \frac{\hat{p}-p_0}{\sqrt{p_0(1-p_0)/n}}\) z
Two means (equal var) μ₁−μ₂ \(t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{1/n_1+1/n_2}}\) t (df = n₁+n₂−2)
Two means (Welch’s) μ₁−μ₂ \(t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_1^2/n_1+s_2^2/n_2}}\) t (df approx.)
Two proportions p₁−p₂ \(z = \frac{\hat{p}_1 - \hat{p}_2}{SE_\text{pool}}\) z

THINK-PAIR-SHARE 3 (7 minutes)

Two Proportions

A university tested two formats of an online course:

  • Format A (n = 180): 126 students passed the final exam
  • Format B (n = 180): 144 students passed the final exam
  1. Check conditions for the two-proportion z-test.
  2. Test at α = 0.05 whether pass rates differ between formats (all 4 steps).
  3. Calculate a 95% CI for p_B − p_A.
  4. Does the CI support your test conclusion? Why?
  5. Which format would you recommend, and why?

Use Google Sheets! Post on Ed Discussion with your partner’s name.

Share your answers in Poll Everywhere!

What is the z-statistic for the pass rate comparison?

Summary

Error types:

  • Type I (α): Reject a true H₀ — controlled by your choice of α
  • Type II (β): Fail to reject a false H₀ — reduced by larger n
  • Power = 1 − β: Probability of detecting a true effect

Comparing two groups:

Goal Test Key formula
Compare two means Two-sample t \(t = \frac{\bar{x}_1 - \bar{x}_2}{SE}\)
Quantify difference Cohen’s d \(d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}\)
Compare two proportions Two-proportion z \(z = \frac{\hat{p}_1 - \hat{p}_2}{SE_\text{pool}}\)

Always report statistical significance (p-value) AND effect size!

Quick Knowledge Check ✅

Rate your confidence (1–5) on Ed Discussion:

  1. Describing Type I and Type II errors in context ⭐⭐⭐⭐⭐
  2. Understanding the relationship between α, β, and power ⭐⭐⭐⭐⭐
  3. Distinguishing statistical from practical significance ⭐⭐⭐⭐⭐
  4. Conducting a two-sample t-test ⭐⭐⭐⭐⭐
  5. Interpreting Cohen’s d ⭐⭐⭐⭐⭐
  6. Conducting a two-proportion z-test ⭐⭐⭐⭐⭐
  7. Choosing the right test for a given scenario ⭐⭐⭐⭐⭐

If you rated anything 3 or below, come to office hours!

Thank you! 📊✨

Questions? I have office hours right after class!

Next up: One-way ANOVA — comparing more than two groups

Remember:

  • Post Think-Pair-Share responses on Ed Discussion and Poll Everywhere
  • Rate your confidence
  • Statistical significance + effect size = complete picture
  • Always check conditions before running any test