STAT 17: Error Types, Effect Sizes & Comparing Two Groups

Prof. Marcela Alfaro Cordoba

Statistics - UCSC

20 Nov 2025

What We’ll Accomplish Today

Quick recap: Last class we covered hypothesis testing — the four-step process, p-values, and confidence interval connections.

Today:

Type I and Type II errors (the cost of being wrong)
Statistical vs. practical significance
Common hypothesis testing mistakes
Comparing two independent means (two-sample t-test)
Effect sizes (Cohen’s d)
Comparing two independent proportions

Type I and Type II Errors

Four possible outcomes in hypothesis testing:

	H₀ True	H₀ False
Reject H₀	Type I Error (α)	Correct Decision (Power)
Fail to Reject H₀	Correct Decision	Type II Error (β)

Type I Error (False Positive):

Reject H₀ when H₀ is actually true
Probability = α (significance level)
Example: Approve an ineffective drug

Type II Error (False Negative):

Fail to reject H₀ when H₀ is actually false
Probability = β
Example: Reject an effective drug

Power = 1 − β: Probability of correctly rejecting a false H₀. Higher power is better!

Understanding Errors in Context

Dr. Chen’s drug trial:

H₀: Drug reduces BP by ≥ 10 mmHg (effective)
H₁: Drug reduces BP by < 10 mmHg (not effective enough)

Type I Error (α):

Reject H₀ when true
Conclude drug is ineffective when it actually works
Consequence: Deny patients an effective treatment
Controlled directly by choosing α

Type II Error (β):

Fail to reject H₀ when false
Conclude drug works when it doesn’t meet the standard
Consequence: Approve an ineffective drug
Affected by n, effect size, α

The trade-off:

Decrease α → increase β (more conservative test)
Increase α → decrease β (more liberal test)
Increase n → decrease both!

Choosing α: Context Matters

Type I error rate = α — we choose it directly, before seeing the data.

Life-or-death decision:

Use α = 0.01 (very conservative)
Only 1% chance of approving an ineffective treatment

Preliminary research:

Use α = 0.10 (more liberal)
10% false-positive rate acceptable at this stage

Standard research:

Use α = 0.05 (balanced)
5% chance of Type I error

Controlling β:

β depends on α, n, effect size, and σ
Larger n → smaller β → higher power
Researchers typically aim for power ≥ 0.80 (β ≤ 0.20)

Statistical vs. Practical Significance

Statistical significance: p-value < α

The effect is unlikely due to chance alone
Heavily influenced by sample size — large n can flag tiny effects

Practical significance: Effect size matters in the real world

The effect is large enough to actually care about
Independent of sample size

Example — Large study (n = 10,000):

Drug lowers BP by 1 mmHg → p-value < 0.001 (statistically significant!)
But 1 mmHg is clinically meaningless (not practically significant)

Example — Small study (n = 20):

Drug lowers BP by 15 mmHg → p-value = 0.08 (not statistically significant)
But 15 mmHg is very important clinically (practically significant!)

Always report: p-value and effect size together.

Common Mistakes in Hypothesis Testing

Mistake 1: “Accepting” H₀

✗ “We accept H₀”
✓ “We fail to reject H₀” / “Insufficient evidence against H₀”

Mistake 2: Wrong interpretation of p-value

✗ “p = 0.03 means there’s a 3% chance H₀ is true”
✓ “p = 0.03 means data this extreme occur 3% of the time if H₀ is true”

Mistake 3: Changing α after seeing results (“p-hacking”)

Mistake 4: Confusing significance with importance

Mistake 5: Conclusion not in context — always state what the decision means for the problem, not just “reject H₀”

🧘‍♀️ STRETCH BREAK

Time to move! (5 minutes)

Stand up and stretch 🤸‍♀️
Chat with neighbors about errors 💬
Grab some water 💧

Comparing Two Groups

Case Study: Retail A/B Testing

Scenario: A major online retailer is testing two checkout designs:

Design A (Control): Traditional multi-page checkout
Design B (Treatment): New single-page checkout

Questions we’ll answer today:

Does Design B increase the average purchase amount?
How large is the effect — is it practically meaningful?
Does Design B improve conversion rates?

When Do We Compare Two Means?

Testing whether two groups differ on a continuous outcome
A/B testing in business contexts
Clinical trials comparing treatments
Product testing and quality control

Key assumption: Two independent samples from two populations.

This is different from the one-sample tests we’ve been doing — now we have no known μ₀ to test against. We let the data from both groups speak.

The Two-Sample t-Test Framework

Hypotheses:

H₀: μ₁ = μ₂ (or equivalently, μ₁ − μ₂ = 0)
Hₐ: μ₁ ≠ μ₂ (two-sided)
Hₐ: μ₁ > μ₂ or μ₁ < μ₂ (one-sided)

Test statistic (under H₀):

\[t = \frac{\bar{x}_1 - \bar{x}_2}{SE(\bar{x}_1 - \bar{x}_2)}\]

Two cases for SE — depending on whether we assume equal variances:

Case	Assumption	SE formula
Pooled	σ₁² = σ₂²	uses pooled $s_p$
Welch’s	σ₁² ≠ σ₂²	uses $s_1, s_2$ separately

Standard Error: Two Cases

Case 1: Equal Variances (pooled t-test)

\[SE = s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}, \quad \text{where } s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}\]

Degrees of freedom: df = n₁ + n₂ − 2
$s_p$ is a weighted average of the two sample standard deviations

Case 2: Unequal Variances (Welch’s t-test)

\[SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]

df is approximated (Welch-Satterthwaite formula — let Google Sheets handle it)
More conservative, safer when in doubt

In practice: When in doubt, use Welch’s — it’s the default in most software.

Example: Purchase Amounts

Data from the checkout design test:

Design A: n₁ = 250, x̄₁ = $87.50, s₁ = $22.30
Design B: n₂ = 250, x̄₂ = $92.80, s₂ = $24.10

Test: H₀: μ_A = μ_B vs Hₐ: μ_B > μ_A at α = 0.05 (one-sided, assuming equal variances)

Step 1: Pooled standard deviation

\[s_p = \sqrt{\frac{(249)(22.30)^2 + (249)(24.10)^2}{498}} = \sqrt{\frac{123{,}956 + 144{,}840}{498}} = \sqrt{539.6} = 23.23\]

Step 2: Standard error

\[SE = 23.23 \times \sqrt{\frac{1}{250} + \frac{1}{250}} = 23.23 \times 0.0894 = 2.08\]

Example: Purchase Amounts (cont.)

Step 3: Test statistic

\[t = \frac{92.80 - 87.50}{2.08} = \frac{5.30}{2.08} = 2.55\]

Step 4: P-value and decision

df = 498, one-sided test at α = 0.05:

=1 - T.DIST(2.55, 498, TRUE) ≈ 0.0055

p-value = 0.0055 < 0.05 → Reject H₀

Conclusion: There is significant evidence that Design B leads to higher average purchase amounts than Design A.

Practical interpretation: The new checkout design increases average purchases by about $5.30.

Google Sheets for Two-Sample t-Test

Function: =T.TEST(array1, array2, tails, type)

Parameter	Meaning
`array1`	First sample data range
`array2`	Second sample data range
`tails`	1 = one-sided, 2 = two-sided
`type`	1 = paired, 2 = equal variance, 3 = unequal variance (Welch’s)

Example:

=T.TEST(A2:A251, B2:B251, 1, 2)

Returns the p-value directly for a one-sided, equal-variance test.

For the test statistic and CI manually:

// Pooled SD
=SQRT(((n1-1)*s1^2 + (n2-1)*s2^2) / (n1+n2-2))

// SE
=sp * SQRT(1/n1 + 1/n2)

// t-statistic
=(xbar1 - xbar2) / SE

Effect Size: Cohen’s d

Statistical significance ≠ Practical importance.

Even a tiny difference can be “significant” with a large enough sample.

Cohen’s d measures the standardized difference between two means:

\[d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}\]

It tells us how far apart the means are in standard deviation units.

Cohen’s benchmarks:

Cohen’s d	Interpretation
0.2	Small effect — difficult to notice
0.5	Medium effect — noticeable
0.8	Large effect — very noticeable

Our example:

\[d = \frac{92.80 - 87.50}{23.23} = \frac{5.30}{23.23} = 0.23\]

Small-to-medium effect — statistically significant, but modest in practical terms.

Comparing Two Proportions

When: Testing whether two groups differ on a binary outcome (success/failure)

Examples:

Conversion rates between two checkout designs
Default rates between two loan types
Customer satisfaction (satisfied / not satisfied) between two services

Hypotheses:

H₀: p₁ = p₂ (or p₁ − p₂ = 0)
Hₐ: p₁ ≠ p₂ (two-sided) or directional (one-sided)

Key difference from one-sample proportion test: We no longer have a known p₀. Instead, we estimate the common proportion under H₀ by pooling both samples.

Test Statistic for Two Proportions

Step 1: Pooled proportion (our best estimate of p under H₀)

\[\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}\]

Step 2: Standard error under H₀

\[SE = \sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}\]

Step 3: Test statistic

\[z = \frac{\hat{p}_1 - \hat{p}_2}{SE}\]

Under H₀, z follows the standard normal distribution, so we use z critical values and NORM.S.DIST for p-values.

Example: Conversion Rates

Data from the checkout design test:

Design A: 250 visitors, 47 completed purchases → p̂_A = 0.188
Design B: 250 visitors, 63 completed purchases → p̂_B = 0.252

Test: H₀: p_A = p_B vs Hₐ: p_B > p_A at α = 0.05

Step 1: Pooled proportion

\[\hat{p} = \frac{47 + 63}{250 + 250} = \frac{110}{500} = 0.220\]

Step 2: Standard error

\[SE = \sqrt{0.220 \times 0.780 \times \left(\frac{1}{250} + \frac{1}{250}\right)} = \sqrt{0.001373} = 0.0371\]

Example: Conversion Rates (cont.)

Step 3: Test statistic

\[z = \frac{0.252 - 0.188}{0.0371} = \frac{0.064}{0.0371} = 1.72\]

Step 4: P-value and decision

One-sided test at α = 0.05:

=1 - NORM.S.DIST(1.72, TRUE) ≈ 0.043

p-value = 0.043 < 0.05 → Reject H₀

Conclusion: There is significant evidence that Design B has a higher conversion rate than Design A.

Practical interpretation: Design B increases conversion by about 6.4 percentage points (18.8% → 25.2%).

Google Sheets for Two-Proportion Test

// Pooled proportion
=(x1 + x2) / (n1 + n2)

// Standard error
=SQRT(p_pool * (1 - p_pool) * (1/n1 + 1/n2))

// z-statistic
=(phat1 - phat2) / SE

// P-value (one-sided, right-tailed)
=1 - NORM.S.DIST(z, TRUE)

// P-value (two-sided)
=2 * (1 - NORM.S.DIST(ABS(z), TRUE))

Conditions to check before running this test:

Random, independent samples from both groups
$n_1\hat{p} \geq 10$, $n_1(1-\hat{p}) \geq 10$, and same for $n_2$

Choosing the Right Test

Scenario	Parameter	Test statistic	Distribution
One mean, σ known	μ	$z = \frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}$	z
One mean, σ unknown	μ	$t = \frac{\bar{x}-\mu_0}{s/\sqrt{n}}$	t (df = n−1)
One proportion	p	$z = \frac{\hat{p}-p_0}{\sqrt{p_0(1-p_0)/n}}$	z
Two means (equal var)	μ₁−μ₂	$t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{1/n_1+1/n_2}}$	t (df = n₁+n₂−2)
Two means (Welch’s)	μ₁−μ₂	$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_1^2/n_1+s_2^2/n_2}}$	t (df approx.)
Two proportions	p₁−p₂	$z = \frac{\hat{p}_1 - \hat{p}_2}{SE_\text{pool}}$	z

Summary

Error types:

Type I (α): Reject a true H₀ — controlled by your choice of α
Type II (β): Fail to reject a false H₀ — reduced by larger n
Power = 1 − β: Probability of detecting a true effect

Comparing two groups:

Goal	Test	Key formula
Compare two means	Two-sample t	$t = \frac{\bar{x}_1 - \bar{x}_2}{SE}$
Quantify difference	Cohen’s d	$d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}$
Compare two proportions	Two-proportion z	$z = \frac{\hat{p}_1 - \hat{p}_2}{SE_\text{pool}}$

Always report statistical significance (p-value) AND effect size!

Quick Knowledge Check ✅

Rate your confidence (1–5) on Ed Discussion:

Describing Type I and Type II errors in context ⭐⭐⭐⭐⭐
Understanding the relationship between α, β, and power ⭐⭐⭐⭐⭐
Distinguishing statistical from practical significance ⭐⭐⭐⭐⭐
Conducting a two-sample t-test ⭐⭐⭐⭐⭐
Interpreting Cohen’s d ⭐⭐⭐⭐⭐
Conducting a two-proportion z-test ⭐⭐⭐⭐⭐
Choosing the right test for a given scenario ⭐⭐⭐⭐⭐

If you rated anything 3 or below, come to office hours!

Thank you! 📊✨

Questions? I have office hours right after class!

Next up: One-way ANOVA — comparing more than two groups

Remember:

Post Think-Pair-Share responses on Ed Discussion and Poll Everywhere
Rate your confidence
Statistical significance + effect size = complete picture
Always check conditions before running any test

Case	Assumption	SE formula
Pooled	σ₁² = σ₂²	uses pooled \(s_p\)
Welch’s	σ₁² ≠ σ₂²	uses \(s_1, s_2\) separately

Scenario	Parameter	Test statistic	Distribution
One mean, σ known	μ	\(z = \frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}\)	z
One mean, σ unknown	μ	\(t = \frac{\bar{x}-\mu_0}{s/\sqrt{n}}\)	t (df = n−1)
One proportion	p	\(z = \frac{\hat{p}-p_0}{\sqrt{p_0(1-p_0)/n}}\)	z
Two means (equal var)	μ₁−μ₂	\(t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{1/n_1+1/n_2}}\)	t (df = n₁+n₂−2)
Two means (Welch’s)	μ₁−μ₂	\(t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_1^2/n_1+s_2^2/n_2}}\)	t (df approx.)
Two proportions	p₁−p₂	\(z = \frac{\hat{p}_1 - \hat{p}_2}{SE_\text{pool}}\)	z

Goal	Test	Key formula
Compare two means	Two-sample t	\(t = \frac{\bar{x}_1 - \bar{x}_2}{SE}\)
Quantify difference	Cohen’s d	\(d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}\)
Compare two proportions	Two-proportion z	\(z = \frac{\hat{p}_1 - \hat{p}_2}{SE_\text{pool}}\)

STAT 17: Error Types, Effect Sizes & Comparing Two Groups

What We’ll Accomplish Today

Type I and Type II Errors

Understanding Errors in Context

Choosing α: Context Matters

Statistical vs. Practical Significance

Common Mistakes in Hypothesis Testing

THINK-PAIR-SHARE 1 (7 minutes)

Share your answers in Poll Everywhere!

🧘‍♀️ STRETCH BREAK

Time to move! (5 minutes)

Comparing Two Groups

Case Study: Retail A/B Testing

When Do We Compare Two Means?

The Two-Sample t-Test Framework

Standard Error: Two Cases

Example: Purchase Amounts

Example: Purchase Amounts (cont.)

Google Sheets for Two-Sample t-Test

Effect Size: Cohen’s d

THINK-PAIR-SHARE 2 (7 minutes)

Share your answers in Poll Everywhere!

Comparing Two Proportions

Test Statistic for Two Proportions

Example: Conversion Rates

Example: Conversion Rates (cont.)

Google Sheets for Two-Proportion Test

Choosing the Right Test

THINK-PAIR-SHARE 3 (7 minutes)

Share your answers in Poll Everywhere!

Summary

Quick Knowledge Check ✅

Thank you! 📊✨