Statistics - UCSC
02 Dec 2025
The Next Challenge
Last time, we found: - Strong correlation (r = 0.85) between advertising and revenue - Regression equation: \(\widehat{\text{Revenue}} = 12.45 + 0.78 \times \text{Advertising}\)
But the investors ask:
Today: Move from description to inference - test hypotheses and quantify uncertainty!
By the end of today’s lecture, you will be able to:
These describe our sample data
These are the true values we want to know
Important
Key Question: Can we use our sample statistics to make inferences about population parameters?
If we took many different samples and calculated \(b_1\) each time:
Key insight: \(b_1\) varies from sample to sample, but centers around the true \(\beta_1\)!
Before doing inference, check these conditions:
Warning
Mnemonic: LINE (Linearity, Independence, Normality, Equal variance)
What to look for:
Which condition is MOST important to check before regression inference?
A. The residuals are perfectly normal
B. The relationship is linear
C. Every single data point is independent
D. R² is above 0.70
Standard Test Setup
Hypotheses: \[H_0: \beta_1 = 0\] (No linear relationship in population) \[H_a: \beta_1 \neq 0\] (Linear relationship exists)
Test Statistic: \[t = \frac{b_1 - 0}{SE_{b_1}}\]
where \(SE_{b_1}\) is the standard error of the slope
Degrees of freedom: \(df = n - 2\)
For TechStart: With n = 25 startups, df = 23
Standard Error (\(SE_{b_1}\)): Measures variability of slope estimates across samples
\[SE_{b_1} = \frac{s_e}{s_x \sqrt{n-1}}\]
where:
What affects \(SE_{b_1}\)?
Decreases when: - Residuals are smaller (\(s_e\) ↓) - X has more spread (\(s_x\) ↑) - Sample size is larger (\(n\) ↑)
Result: More precise slope estimate!
Given regression output from Google Sheets:
Calculate t-statistic:
\[t = \frac{b_1 - 0}{SE_{b_1}} = \frac{0.78 - 0}{0.12} = 6.50\]
Find p-value: With df = 23, using t-table or Sheets: p-value < 0.001
Decision: Reject \(H_0\) at α = 0.05
Conclusion: There is strong evidence of a significant positive linear relationship between advertising and revenue in the population (t = 6.50, p < 0.001).
P-value: Probability of getting a slope as extreme as ours (or more) if true slope is zero
TechStart: p < 0.001
→ Statistical evidence!
Example: p = 0.23
→ Not significant
Warning
Statistical significance ≠ Practical significance!
Regression output shows: \(b_1 = 0.45\), \(SE_{b_1} = 0.30\), p-value = 0.15
What should we conclude at α = 0.05?
A. The relationship is statistically significant
B. The relationship is not statistically significant
C. We need more data to decide
D. The slope is definitely zero
Formula
\[b_1 \pm t^* \times SE_{b_1}\]
where:
For TechStart (95% CI with df = 23):
\[0.78 \pm 2.069(0.12) = 0.78 \pm 0.25 = (0.53, 1.03)\]
Interpretation: We are 95% confident that for each additional $1,000 in advertising, revenue increases between $530 and $1,030 in the population.
Important Connection
For a two-tailed test at significance level α:
If the \((1-\alpha) \times 100\%\) confidence interval for \(\beta_1\):
TechStart example:
This provides more information than hypothesis test alone!
10 Minute Break
We’ll return to discuss standard error, confidence intervals for predictions, and Google Sheets implementation!
Measures typical size of prediction errors (residuals)
\[s_e = \sqrt{\frac{\sum(y_i - \widehat{y}_i)^2}{n-2}} = \sqrt{\frac{\text{Sum of Squared Residuals}}{df}}\]
For TechStart
\(s_e = 9.8\) thousand dollars
Interpretation: Predictions are typically off by about $9,800, either direction
Also called:
Question: What’s the average y for a given x?
Example: What’s the average revenue for ALL startups spending $50k on ads?
Formula: \[\widehat{y} \pm t^* \times SE_{\text{mean}}\]
Narrower interval
(more precise)
Question: What y will we observe for a new case?
Example: What revenue will THIS specific startup with $50k ad spending achieve?
Formula: \[\widehat{y} \pm t^* \times SE_{\text{pred}}\]
Wider interval
(more uncertainty)
Predict revenue for a startup spending $50,000 on advertising:
Point estimate: \[\widehat{\text{Revenue}} = 12.45 + 0.78(50) = 51.45 \text{ thousand}\]
95% Prediction Interval: (30.8, 72.1) thousand
Interpretation: We are 95% confident this specific startup will generate between $30,800 and $72,100 in revenue.
Note: Wide interval reflects uncertainty in predicting individual cases!
Warning
Key Business Insight: Even with strong correlation (r = 0.85), individual predictions have substantial uncertainty. Don’t over-rely on point estimates!
Why is a prediction interval wider than a confidence interval at the same x value?
A. Because we use a larger t* value
B. Because it accounts for both estimation uncertainty AND individual variation
C. Because the sample size is smaller
D. Because predictions are less accurate
Example output for TechStart:
| Coefficient | Estimate | Std Error | t-stat | p-value |
|---|---|---|---|---|
| Intercept | 12.45 | 3.82 | 3.26 | 0.004 |
| Advertising | 0.78 | 0.12 | 6.50 | < 0.001 |
Additional statistics:
What this tells us:
TechStart Ventures: Full Analysis
Research Question: Is advertising spending significantly related to revenue?
1. Check Conditions: - ✓ Linearity: Scatterplot shows linear pattern - ✓ Independence: Random sample of startups - ✓ Normality: Residuals approximately normal (n = 25) - ✓ Equal variance: No fan shape in residual plot
2. Hypothesis Test: - \(H_0: \beta_1 = 0\) vs \(H_a: \beta_1 \neq 0\) - Test statistic: t = 6.50, df = 23 - p-value < 0.001 - Conclusion: Statistical evidence of significant relationship
3. Confidence Interval: - 95% CI for \(\beta_1\): (0.53, 1.03) - Interpretation: Each $1,000 increase in advertising increases revenue by $530-$1,030
4. Prediction: - For $50k advertising: \(\widehat{y}\) = 51.45k - 95% PI: (30.8, 72.1)k
Regression output shows p-value = 0.001 for slope. What does this mean?
A. There’s a 0.1% chance the null hypothesis is true
B. If there’s no relationship in the population, there’s a 0.1% chance of getting our result or more extreme
C. The slope is definitely not zero
D. 99.9% of the variance is explained
Important Distinction
Statistically Significant (p < α)
- Means: Effect likely exists in population (not just chance) - Determined by: Sample size, effect size, variability
Practically Significant
- Means: Effect is large enough to matter in practice - Determined by: Context, costs, benefits, business impact
Example scenarios:
| Scenario | Statistical | Practical | What to do? |
|---|---|---|---|
| Huge sample, tiny slope | Significant | Not significant | Don’t implement |
| Small sample, large slope | Not significant | Would be significant | Collect more data |
| Large sample, moderate slope | Significant | Significant | Implement! |
Conditions (LINE): Must check before inference
Hypothesis test for slope: Tests if \(\beta_1 \neq 0\)
Confidence interval for \(\beta_1\): Range of plausible values
Standard error (\(s_e\)): Typical prediction error
Prediction interval: Accounts for individual variation
Next Time: Multiple Regression
The Power of Multiple Regression
Most real business problems involve multiple factors. Next time we’ll see how to model them simultaneously!
What questions do you have about:
See you next time for:
Multiple Regression
Office hours: I’m available now if you have any questions
STAT 17 – Fall 2025