Statistics - UCSC
04 Nov 2025
Meet Chloe, a quality control manager at a smartphone manufacturer.
Her challenge: She can’t test every phone (population of millions), so she must:
The problem: How can conclusions from 100 phones apply to millions?
The answer: The Central Limit Theorem (CLT) - one of the most powerful ideas in statistics! Understanding CLT helps Chloe (and you!) make reliable inferences from samples.
What we learned last time:
Key formulas:
=NORM.DIST(x, μ, σ, TRUE)=NORM.INV(probability, μ, σ)Today: We extend this to samples and introduce statistical inference!
By the end of this lecture, you will be able to:
Fundamental Challenge in Statistics:
Example: Chloe’s battery life problem
This is statistical inference!
| Concept | Population | Sample |
|---|---|---|
| Size | N (usually unknown) | n (we choose this) |
| Mean | μ (unknown parameter) | x̄ (known statistic) |
| Std Dev | σ (usually unknown) | s (calculated from data) |
| Proportion | p (unknown parameter) | p̂ (known statistic) |
Key Terms:
Let’s experience the CLT with YOUR data!
PART 1: Individual Data (2 min)
Go to: [bit.ly/stat17-heights]
PART 2: Sample Means (3 min)
In groups of 5:
DISCUSSION (2 min): What’s different about the two distributions?
Compare the distribution of individual heights to group averages:
The Central Limit Theorem (CLT) states:
For a random sample of size n from ANY population with mean μ and standard deviation σ:
As n gets large, the sampling distribution of x̄ is approximately normal:
\[\bar{x} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\]
Or equivalently: \(\bar{x} \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)\)
This is AMAZING because:
What the CLT tells us:
The Standard Error (SE):
\[SE = \frac{\sigma}{\sqrt{n}}\]
Critical insight: Variability of x̄ decreases with √n, not n!
Two scenarios where CLT works:
Scenario 1: Population is already normal
Then x̄ is exactly normal for ANY sample size n
No minimum n required!
Scenario 2: Population is not normal
Need “large enough” sample size
Rule of thumb: n ≥ 30 usually sufficient
More skewed population → need larger n
Additional requirements:
In practice: Check these conditions before using CLT!
Example: Population is strongly right-skewed
The magic: Regardless of population shape, x̄ becomes normal!
Standard Error gets smaller:
n=5: SE large, x̄ values spread out
n=30: SE smaller, x̄ values cluster around μ
n=100: SE even smaller, x̄ very close to μ
See Applet: https://istats.shinyapps.io/sampdist_cont/
Chloe’s scenario: Battery life for all phones
Question: What is the probability that the sample mean is between 23.5 and 24.5 hours?
By CLT: \(\bar{x} \sim N\left(24, \frac{4^2}{100}\right) = N(24, 0.16)\)
So: \(\bar{x} \sim N(24, 0.4)\) where 0.4 is the standard error
Google Sheets:
=NORM.DIST(24.5, 24, 0.4, TRUE) - NORM.DIST(23.5, 24, 0.4, TRUE)
Answer: ≈ 0.7888 or about 79%
Understanding the Central Limit Theorem
A coffee shop’s daily revenue has μ = $2400 and σ = $600. You track revenue for n = 36 randomly selected days.
Answer these questions:
Use Google Sheets for probability calculations!
Between what two values will the middle 95% of sample means fall?
General approach for sample means:
Given: Population with μ and σ, sample size n
Step 1: Check CLT conditions - Random sample? - n ≥ 30 or population normal?
Step 2: Find sampling distribution - Mean: μ - Standard error: σ/√n
Step 3: Calculate probability using normal distribution
=NORM.DIST(x, μ, σ/SQRT(n), TRUE)
For P(x̄ ≥ a):
=1 - NORM.DIST(a, μ, σ/SQRT(n), TRUE)
For P(a ≤ x̄ ≤ b):
=NORM.DIST(b, μ, σ/SQRT(n), TRUE) - NORM.DIST(a, μ, σ/SQRT(n), TRUE)
Example 1: Shipping Company Quality Control
The situation: A shipping company processes thousands of packages daily. Individual package weights vary considerably (σ = 10 lbs) around a mean of μ = 50 lbs.
The question: If a delivery truck can safely carry packages with an average weight up to 48 lbs, and we load n = 64 randomly selected packages, what’s the probability the truck is overloaded?
Solution:
Standard Error: SE = 10/√64 = 1.25 lbs
We need: P(x̄ < 48)
=NORM.DIST(48, 50, 1.25, TRUE) ≈ 0.0548
Interpretation: Only about 5.5% chance that 64 random packages average less than 48 lbs. The truck is likely safe! Notice how much more predictable the average is compared to individual packages.
Example 2: Standardized Testing Assessment
The situation: A state education department knows that individual student math scores vary widely (σ = 12 points) around a mean of μ = 75.
The question: A school’s performance is evaluated based on the average score of n = 100 randomly selected students. What’s the probability the school’s average falls within the “acceptable” range of 73-77 points?
Solution:
Standard Error: SE = 12/√100 = 1.2 points
We need: P(73 ≤ x̄ ≤ 77)
=NORM.DIST(77,75,1.2,TRUE)-NORM.DIST(73,75,1.2,TRUE) ≈ 0.9044
Interpretation: About 90% of the time, a school’s average will fall in this range. With 100 students, individual score variability (σ = 12) becomes much smaller at the average level (SE = 1.2)!
What do both examples teach us?
| Feature | Package Weights | Test Scores | Key Insight |
|---|---|---|---|
| Individual Variability | σ = 10 lbs | σ = 12 points | High variability in individuals |
| Sample Size | n = 64 | n = 100 | Larger samples better |
| Standard Error | SE = 1.25 lbs | SE = 1.2 points | Much smaller than σ! |
| Predictability | 95% of truck loads: 47.6-52.4 lbs | 95% of schools: 72.6-77.4 points | Averages are stable! |
The big transition: Using samples to learn about populations
Three main types of inference:
Today’s focus: Confidence Intervals!
Definition: A range of values that likely contains the true parameter
Why we need them:
General form:
\[\text{Estimate} \pm \text{Margin of Error}\]
For a mean:
\[\bar{x} \pm \text{(critical value)} \times \text{SE}\]
Interpretation: We are [confidence level]% confident the true parameter falls in this interval
Confidence Level: How confident we are that interval contains true parameter
Margin of Error (ME): Half the width of the CI
The relationship:
Scenario: We know the population standard deviation σ
When to use:
Formula:
\[\bar{x} \pm z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}\]
Formula:
\[\bar{x} \pm z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}\]
Where:
Common z-values:
For confidence level C (like 0.95 for 95%):
Method 1: Direct calculation
For 95% confidence (α = 0.05):
=NORM.S.INV(1 - 0.05/2)
=NORM.S.INV(0.975) ≈ 1.96
Method 2: For any confidence level C
=NORM.S.INV(1 - (1-C)/2)
Example: 90% confidence
=NORM.S.INV(1 - (1-0.90)/2)
=NORM.S.INV(0.95) ≈ 1.645
Example: 99% confidence
=NORM.S.INV(1 - (1-0.99)/2)
=NORM.S.INV(0.995) ≈ 2.576
Chloe’s data:
Step 1: Check conditions: Random sample? ✓ and n = 100 ≥ 30? ✓
Step 2: Calculate critical value = NORM.S.INV(0.975) = 1.96
Step 3: Calculate standard error SE = 4/SQRT(100) = 0.4
Step 4: Calculate margin of error ME = 1.96 × 0.4 = 0.784
Step 5: Construct interval 23.8 ± 0.784 = (23.016, 24.584)
Interpretation: We are 95% confident the true average battery life for all phones is between 23.0 and 24.6 hours.
Confidence Intervals with Known σ
A university measures commute times for students. Historical data shows σ = 15 minutes. A random sample of n = 225 students has x̄ = 42 minutes.
Calculate and interpret:
Use Google Sheets for all calculations!
CORRECT interpretation (95% CI: (23.0, 24.6)):
✓ “We are 95% confident that the true population mean battery life is between 23.0 and 24.6 hours.”
✓ “Using this method, 95% of intervals constructed this way will contain the true mean.”
INCORRECT interpretations:
✗ “There is a 95% probability that μ is in (23.0, 24.6).” - μ is fixed, not random!
✗ “95% of phones have battery life between 23.0 and 24.6.” - That’s about individual phones, not the mean!
✗ “We are 95% confident that x̄ is between 23.0 and 24.6.” - We know x̄ = 23.8 exactly!
Remember: The interval either contains μ or it doesn’t. Our confidence is in the METHOD, not a specific interval!
Key relationship:
\[ME = z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}\]
What affects margin of error:
The trade-off:
\[ME = z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}\]
Example: ME = 0.784 with n = 100
Question: How large should n be for a desired margin of error?
Formula:
\[n = \left(\frac{z_{\alpha/2} \times \sigma}{ME}\right)^2\]
Example: Chloe wants ME = 0.5 hours with 95% confidence, σ = 4
=ROUNDUP((1.96 * 4 / 0.5)^2, 0)
=ROUNDUP(245.86, 0) = 246
She needs n = 246 phones for ME of 0.5 hours
Google Sheets template:
Cell A1: Desired ME | Cell B1: 0.5
Cell A2: Confidence Level | Cell B2: 0.95
Cell A3: Pop Std Dev (σ) | Cell B3: 4
Cell A4: Critical Value | Cell B4: =NORM.S.INV(1-(1-B2)/2)
Cell A5: Required n | Cell B5: =ROUNDUP((B4*B3/B1)^2, 0)
Always round UP to ensure ME doesn’t exceed target!
Reality check: We usually don’t know σ!
When σ is unknown:
\[\bar{x} \pm t_{\alpha/2, df} \times \frac{s}{\sqrt{n}}\]
Where:
When to use t instead of z:
Properties:
Degrees of freedom (df = n - 1):
Why heavier tails?
See Applet: https://istats.shinyapps.io/tdist/
For confidence level C with df = n - 1:
=T.INV.2T(alpha, df)
Where alpha = 1 - C
Example: 95% CI with n = 20 (df = 19)
=T.INV.2T(0.05, 19) ≈ 2.093
Compare to z:
=NORM.S.INV(0.975) ≈ 1.96
Notice: t-value is larger (wider interval for small sample)!
As sample size increases:
n = 10, df = 9: t = 2.262
n = 30, df = 29: t = 2.045
n = 100, df = 99: t = 1.984
z = 1.96
Chloe tests a new production method:
Step 1: Check conditions
Random sample? ✓
n = 25 < 30, but assume battery life roughly normal ✓
Step 2: Calculate critical value
df = 25 - 1 = 24
=T.INV.2T(0.05, 24) ≈ 2.064
Chloe tests a new production method:
Step 3: Calculate standard error
SE = 3.5/SQRT(25) = 0.7
Step 4: Calculate margin of error
ME = 2.064 × 0.7 = 1.445
Step 5: Construct interval
24.5 ± 1.445 = (23.055, 25.945)
Interpretation: We are 95% confident the true mean battery life with the new method is between 23.1 and 25.9 hours.
| Scenario | Use | Why |
|---|---|---|
| σ known, any n | z | Know population std dev |
| σ unknown, n ≥ 30 | t (or z) | CLT applies, t ≈ z |
| σ unknown, n < 30, normal pop | t | Account for extra uncertainty |
| σ unknown, n < 30, non-normal | Neither | Need non-parametric methods |
In practice:
Example: If σ = 4 is historical but you calculate s = 3.5 from your sample, use t with s = 3.5!
Confidence Intervals with Unknown σ (t-distribution)
A nutritionist samples n = 16 students and measures daily protein intake (grams):
x̄ = 68, s = 12
Calculate and interpret:
Use Google Sheets! Post on Ed Discussion with partner’s name!
New scenario: Estimating a population proportion p
Examples:
Sample proportion:
\[\hat{p} = \frac{x}{n}\]
where x = number of successes, n = sample size
Standard error for proportion:
\[SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]
Confidence interval for proportion:
\[\hat{p} \pm z_{\alpha/2} \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]
Conditions (check these!):
Why use z, not t?
Chloe surveys customers:
Step 1: Check conditions - Random sample? ✓ - np̂ = 200(0.78) = 156 ≥ 10 ✓ - n(1-p̂) = 200(0.22) = 44 ≥ 10 ✓
Step 2: Calculate
SE = SQRT(0.78*0.22/200) = 0.0293
z = 1.96
ME = 1.96 × 0.0293 = 0.0574
CI: 0.78 ± 0.0574 = (0.723, 0.837)
Interpretation: We are 95% confident that between 72.3% and 83.7% of ALL customers are satisfied.
Question: How large should n be for desired ME?
Formula:
\[n = \left(\frac{z_{\alpha/2}}{ME}\right)^2 \times \hat{p}(1-\hat{p})\]
Problem: We need p̂ to calculate n, but we need n to get p̂!
Solutions:
Example: Want ME = 0.03 with 95% confidence, use p̂ = 0.5
=(1.96/0.03)^2 * 0.5 * 0.5 = 1067.11 ≈ 1068
Need n = 1068 for ME of 3% with no prior estimate
Mean with σ known (or large n) \[\bar{x} \pm z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}\]
Mean with σ unknown (use s) \[\bar{x} \pm t_{\alpha/2, df} \times \frac{s}{\sqrt{n}}\]
Proportion \[\hat{p} \pm z_{\alpha/2} \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]
Always check conditions before constructing CIs!
Mistake 1: Wrong interpretation - CI is about the parameter (μ or p), not the statistic or individuals
Mistake 2: Using z when should use t - If σ unknown, use t (especially for small samples)
Mistake 3: Forgetting to check conditions - Random sampling, sample size requirements, normality
Mistake 4: Confusing confidence level and probability - “95% confident” ≠ “95% probability μ is in the interval”
Mistake 5: Not rounding sample size correctly - Always round UP to ensure ME doesn’t exceed target
Mistake 6: Using wrong formula for proportions - Don’t use s, use √[p̂(1-p̂)/n]
Chloe can now:
✅ Use CLT to understand sampling variability of x̄
✅ Construct 95% CI: (23.0, 24.6 hours) for battery life
✅ Calculate required sample size for desired precision
✅ Make confidence statements: “95% confident true mean is in interval”
✅ Handle both known and unknown σ scenarios
✅ Build CIs for satisfaction rates (proportions)
This is the foundation of statistical inference!
Next time: Hypothesis testing - testing claims about parameters!
Rate your confidence (1-5) on Ed Discussion:
If you rated anything 3 or below, visit office hours!
Questions? I have office hours right after class today!
Next up: Hypothesis Testing - testing claims about populations
Remember:
STAT 17 – Fall 2025