STAT 17: Comparing two means and testing for independence

Prof. Marcela Alfaro Cordoba

Statistics - UCSC

18 Nov 2025

Today’s Learning Objectives

By the end of this lecture, you will be able to:

  • Compare two independent population means using appropriate tests
  • Understand and apply Cohen’s standards for effect sizes
  • Test for differences in means assuming equal population variances
  • Compare two independent population proportions
  • Conduct tests with known population standard deviations
  • Understand properties and applications of chi-square distribution
  • Test for independence between categorical variables

Retail A/B Testing

Scenario: A major online retailer is testing two checkout designs:

  • Design A (Control): Traditional multi-page checkout
  • Design B (Treatment): New single-page checkout

Questions we’ll answer:

  1. Does Design B increase average purchase amount?
  2. How large is the effect of the new design?
  3. Does Design B improve conversion rates?
  4. Is customer satisfaction independent of checkout design?

Part 1: Comparing Two Means

When do we compare two means?

  • Testing if two groups differ on a continuous outcome
  • A/B testing in business contexts
  • Clinical trials comparing treatments
  • Product testing and quality control

Key assumption: Two independent samples from two populations

The Two-Sample t-Test Framework

Hypotheses:

  • \(H_0: \mu_1 = \mu_2\) (or \(\mu_1 - \mu_2 = 0\))
  • \(H_a: \mu_1 \neq \mu_2\) (two-sided)
  • \(H_a: \mu_1 > \mu_2\) or \(H_a: \mu_1 < \mu_2\) (one-sided)

Test Statistic:

\[t = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{SE(\bar{x}_1 - \bar{x}_2)}\]

Under \(H_0\): \((\mu_1 - \mu_2) = 0\), so:

\[t = \frac{\bar{x}_1 - \bar{x}_2}{SE(\bar{x}_1 - \bar{x}_2)}\]

Standard Error: Two Cases

Case 1: Equal Variances (\(\sigma_1^2 = \sigma_2^2\))

\[SE = s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}\]

where pooled standard deviation:

\[s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}\]

Case 2: Unequal Variances (Welch’s t-test)

\[SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]

Example: Purchase Amounts

Data from our checkout design test:

  • Design A: \(n_1 = 250\), \(\bar{x}_1 = \$87.50\), \(s_1 = \$22.30\)
  • Design B: \(n_2 = 250\), \(\bar{x}_2 = \$92.80\), \(s_2 = \$24.10\)

Test: \(H_0: \mu_A = \mu_B\) vs \(H_a: \mu_B > \mu_A\) at \(\alpha = 0.05\)

Let’s assume equal variances for simplicity.

Calculating the Test Statistic

Step 1: Calculate pooled standard deviation

\[s_p = \sqrt{\frac{(250-1)(22.30)^2 + (250-1)(24.10)^2}{250 + 250 - 2}}\]

\[s_p = \sqrt{\frac{123,956.1 + 144,840.1}{498}} = \sqrt{539.59} = 23.23\]

Calculating the Test Statistic (cont.)

Step 2: Calculate standard error

\[SE = 23.23 \sqrt{\frac{1}{250} + \frac{1}{250}} = 23.23 \times 0.0894 = 2.08\]

Step 3: Calculate t-statistic

\[t = \frac{92.80 - 87.50}{2.08} = \frac{5.30}{2.08} = 2.55\]

Making a Decision

Degrees of freedom: \(df = n_1 + n_2 - 2 = 498\)

For one-sided test at \(\alpha = 0.05\): Critical value ≈ 1.645

Our test statistic: \(t = 2.55 > 1.645\)

Conclusion: Reject \(H_0\). There is significant evidence that Design B leads to higher average purchase amounts than Design A.

Practical interpretation: The new checkout design increases average purchases by about $5.30.

Google Sheets for Two-Sample t-Test

Function: =T.TEST(array1, array2, tails, type)

Parameters:

  • array1: First sample data range
  • array2: Second sample data range
  • tails: 1 for one-sided, 2 for two-sided
  • type: 1 for paired, 2 for equal variance, 3 for unequal variance

Example:

=T.TEST(A2:A251, B2:B251, 1, 2)

Returns p-value for one-sided test with equal variances

Effect Size: Beyond Statistical Significance

Statistical significance ≠ Practical importance

Effect Size measures the magnitude of a difference in standardized units

Cohen’s d:

\[d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}\]

Standardized mean difference (in standard deviations)

Cohen’s Standards for Effect Sizes

Interpretation of Cohen’s d:

Effect Size Cohen’s d Interpretation
Small 0.2 Difficult to detect
Medium 0.5 Noticeable difference
Large 0.8 Very noticeable

Our example:

\[d = \frac{92.80 - 87.50}{23.23} = \frac{5.30}{23.23} = 0.23\]

Small to medium effect size - statistically significant but modest practical impact

THINK-PAIR-SHARE 1 (7 minutes)

Poll Everywhere Time!

Question: A software company tested two training methods for new employees. Method A (n=40): mean productivity score = 78.5, SD = 12.3. Method B (n=40): mean = 82.7, SD = 11.8.

Calculate:

  1. The pooled standard deviation
  2. The t-statistic (assume equal variances)
  3. Cohen’s d

Is this difference practically important?

Discuss with your neighbor (3 minutes), then submit your answer!

Share your answers in Poll Everywhere!

Is this difference practically important?

Comparing Two Proportions

When: Testing if two groups differ on a binary outcome (success/failure)

Examples:

  • Conversion rates between two designs
  • Default rates between two loan types
  • Customer satisfaction (satisfied/not satisfied) between services

Hypotheses:

  • \(H_0: p_1 = p_2\) or \(p_1 - p_2 = 0\)
  • \(H_a: p_1 \neq p_2\) (two-sided)

Test Statistic for Two Proportions

Pooled proportion: \(\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}\)

Standard error under \(H_0\):

\[SE = \sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}\]

Test statistic:

\[z = \frac{(\hat{p}_1 - \hat{p}_2) - 0}{SE}\]

Under \(H_0\), \(z \sim N(0,1)\)

Example: Conversion Rates

Data from checkout design test:

  • Design A: 250 visitors, 47 completed purchases → \(\hat{p}_A = 0.188\)
  • Design B: 250 visitors, 63 completed purchases → \(\hat{p}_B = 0.252\)

Test: \(H_0: p_A = p_B\) vs \(H_a: p_B > p_A\) at \(\alpha = 0.05\)

Step 1: Pooled proportion

\[\hat{p} = \frac{47 + 63}{250 + 250} = \frac{110}{500} = 0.220\]

Calculating the Test (cont.)

Step 2: Standard error

\[SE = \sqrt{0.220(1-0.220)\left(\frac{1}{250} + \frac{1}{250}\right)}\]

\[SE = \sqrt{0.1716 \times 0.008} = \sqrt{0.001373} = 0.0371\]

Step 3: Test statistic

\[z = \frac{0.252 - 0.188}{0.0371} = \frac{0.064}{0.0371} = 1.72\]

Making a Decision

For one-sided test at \(\alpha = 0.05\): Critical value = 1.645

Our test statistic: \(z = 1.72 > 1.645\)

Conclusion: Reject \(H_0\). There is significant evidence that Design B has a higher conversion rate than Design A.

Practical interpretation: Design B increases conversion rate by about 6.4 percentage points (from 18.8% to 25.2%).

Google Sheets for Proportion Tests

Manual calculation approach:

// Pooled proportion
=(x1 + x2)/(n1 + n2)

// Standard error
=SQRT(pooled*(1-pooled)*(1/n1 + 1/n2))

// Z-statistic
=(p1 - p2)/SE

// P-value (one-sided)
=1 - NORM.S.DIST(z, TRUE)

// P-value (two-sided)
=2*(1 - NORM.S.DIST(ABS(z), TRUE))

🧘‍♀️ STRETCH BREAK

Time to move! (5 minutes)

  • Stand up and stretch 🤸‍♀️
  • Chat with neighbors about differences of proportions 💬
  • Grab some water 💧

Welcome Back!

Quick recap of Part 1:

  • Two-sample t-tests for comparing means
  • Effect sizes (Cohen’s d) for practical significance
  • Two-proportion z-tests for comparing rates

Now: What if we have more than two categories?

Part 2: The Chi-Square Distribution

The chi-square (\(\chi^2\)) distribution:

  • Only takes positive values
  • Skewed right (especially for small df)
  • Defined by degrees of freedom (df)
  • Used for testing with categorical data

Notation: \(\chi^2_{df}\) or \(\chi^2(df)\)

Properties of Chi-Square Distribution

Key properties:

  1. Mean: \(E(\chi^2_{df}) = df\)
  2. Variance: \(Var(\chi^2_{df}) = 2 \times df\)
  3. As df increases, distribution becomes more symmetric
  4. Sum of squared standard normals: If \(Z_i \sim N(0,1)\), then \(\sum Z_i^2 \sim \chi^2_{df}\)

Shape depends on df:

  • Small df (1-3): Very right-skewed
  • Medium df (10-20): Moderately skewed
  • Large df (>30): Approaches normal

See it in an app

Chi-Square Test for Independence

Question: Are two categorical variables related?

Example: Is customer satisfaction level independent of checkout design?

Contingency Table:

Very Satisfied Satisfied Neutral Dissatisfied
Design A 45 102 68 35
Design B 72 115 48 15

Hypotheses for Independence Test

Hypotheses:

  • \(H_0\): The two variables are independent
  • \(H_a\): The two variables are associated (dependent)

Test Statistic:

\[\chi^2 = \sum_{all\ cells} \frac{(O - E)^2}{E}\]

where:

  • O = Observed frequency
  • E = Expected frequency under independence

Calculating Expected Frequencies

Formula:

\[E_{ij} = \frac{(\text{Row}_i\ \text{Total}) \times (\text{Column}_j\ \text{Total})}{\text{Grand Total}}\]

Our example:

Very Satisfied Satisfied Neutral Dissatisfied Total
Design A 45 102 68 35 250
Design B 72 115 48 15 250
Total 117 217 116 50 500

Expected Frequencies Calculation

For Design A, Very Satisfied:

\[E_{11} = \frac{250 \times 117}{500} = \frac{29,250}{500} = 58.5\]

Complete expected frequency table:

Very Satisfied Satisfied Neutral Dissatisfied
Design A 58.5 108.5 58.0 25.0
Design B 58.5 108.5 58.0 25.0

Note: Row totals match observed (250 each)

Calculating Chi-Square Statistic

\[\chi^2 = \sum_{all\ cells} \frac{(O - E)^2}{E}\]

Design A, Very Satisfied: \(\frac{(45-58.5)^2}{58.5} = \frac{182.25}{58.5} = 3.12\)

Design A, Satisfied: \(\frac{(102-108.5)^2}{108.5} = \frac{42.25}{108.5} = 0.39\)

Design A, Neutral: \(\frac{(68-58.0)^2}{58.0} = \frac{100}{58.0} = 1.72\)

Design A, Dissatisfied: \(\frac{(35-25.0)^2}{25.0} = \frac{100}{25.0} = 4.00\)

Calculating Chi-Square Statistic (cont.)

Design B, Very Satisfied: \(\frac{(72-58.5)^2}{58.5} = \frac{182.25}{58.5} = 3.12\)

Design B, Satisfied: \(\frac{(115-108.5)^2}{108.5} = \frac{42.25}{108.5} = 0.39\)

Design B, Neutral: \(\frac{(48-58.0)^2}{58.0} = \frac{100}{58.0} = 1.72\)

Design B, Dissatisfied: \(\frac{(15-25.0)^2}{25.0} = \frac{100}{25.0} = 4.00\)

\[ \chi^2 = 3.12 + 0.39 + 1.72 + 4.00 + \]

\[3.12 + 0.39 + 1.72 + 4.00 = 18.46\]

Degrees of Freedom and Decision

Degrees of freedom:

\[df = (r-1)(c-1)\]

where r = number of rows, c = number of columns

Our example: \(df = (2-1)(4-1) = 3\)

For \(\alpha = 0.05\) and \(df = 3\): Critical value = 7.815

Our test statistic: \(\chi^2 = 18.46 > 7.815\)

Conclusion: Reject \(H_0\). There is significant evidence that customer satisfaction and checkout design are associated.

Interpretation: Design B leads to higher satisfaction levels.

Google Sheets for Chi-Square Test

Function: =CHISQ.TEST(actual_range, expected_range)

Returns p-value for the test

For critical value: =CHISQ.INV.RT(alpha, df)

For p-value from statistic: =CHISQ.DIST.RT(chi_square, df)

Example:

// P-value
=CHISQ.DIST.RT(18.46, 3)

// Critical value
=CHISQ.INV.RT(0.05, 3)

THINK-PAIR-SHARE 2 (7 minutes)

Poll Everywhere Time!

Question: A company surveys 400 employees about work preference (office/hybrid/remote) across 3 departments. Here’s the data:

Office Hybrid Remote
Sales 30 45 25
Tech 15 50 85
Admin 40 55 55

Calculate the expected frequency for Sales-Office cell and the contribution to chi-square for that cell.

Work with your neighbor (4 minutes), then submit!

Share your answers in Poll Everywhere!

What is the expected frequency for Sales-Office cell and the contribution to chi-square for that cell.

Assumptions for Chi-Square Test

Requirements for valid test:

  1. Random sample: Data collected randomly
  2. Independence: Observations are independent
  3. Expected frequencies: All expected counts ≥ 5
  4. Categorical data: Variables are categorical (not continuous)

What if expectations aren’t met?

  • Fisher’s exact test for small samples
  • Combine categories if some cells have low counts
  • Use simulation-based methods

Interpreting Chi-Square Results

What does rejection of \(H_0\) tell us?

  • Variables are associated (not independent)
  • Does NOT tell us direction of association
  • Does NOT tell us strength of association
  • Does NOT imply causation

To understand the relationship:

  • Examine residuals: \((O-E)/\sqrt{E}\)
  • Look at cell contributions to \(\chi^2\)
  • Calculate measures of association (Cramér’s V, odds ratios)

Looking Ahead

Next lecture:

  • One-way Analysis of Variance (ANOVA)
  • F distribution and F-ratio
  • Comparing more than two means
  • Post-hoc tests

This builds on:

  • Today’s two-sample tests
  • Understanding of hypothesis testing
  • Comparing multiple groups simultaneously

Quick Knowledge Check ✅

Rate your confidence (1-5) on Ed Discussion:

  1. Conducting two-sample t-tests ⭐⭐⭐⭐⭐
  2. Interpreting effect sizes ⭐⭐⭐⭐⭐
  3. Testing two proportions ⭐⭐⭐⭐⭐
  4. Understanding chi-square distribution ⭐⭐⭐⭐⭐
  5. Testing for independence ⭐⭐⭐⭐⭐

Thank you! 📊✨

Questions? I have office hours right after class today!

Next up: ANOVA and Linear Regression

Remember:

  • Post Think-Pair-Share on Ed Discussion and Poll Everywhere
  • Rate your confidence
  • Statistical significance + Effect size = Complete picture :::