Statistics - UCSC
04 Dec 2025
Lecture 1: Simple regression with one predictor
Lecture 2: Statistical inference
Can we do better by including multiple predictors?
New Available Data
Beyond advertising spending, we now have:
Can these variables improve our sales predictions?
From one predictor to many
Real world is multivariate!
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + \epsilon\]
\[\widehat{Y} = b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_k X_k\]
Where:
Critical Interpretation Change
Simple regression: “For each unit increase in X, Y changes by \(b_1\)”
Multiple regression: “For each unit increase in \(X_i\), holding all other variables constant, Y changes by \(b_i\)”
This is also called:
In a regression predicting salary from years of experience and education level, the coefficient for experience is $2,500. This means:
A. Each year of experience adds $2,500 to salary
B. Each year of experience adds $2,500, holding education constant
C. Experience is more important than education
D. Both A and B are correct
Outcome (Y): Monthly Sales Revenue (thousands)
Predictors (X’s):
\[\widehat{Y} = b_0 + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4\]
Let’s do this together!
| Sales | Advertising | Size | Competitors | Income |
|---|---|---|---|---|
| 85.2 | 10.5 | 25 | 3 | 65 |
| 92.1 | 12.3 | 30 | 2 | 72 |
| … | … | … | … | … |
| Variable | Coefficient | Std Error | t-Stat | p-value |
|---|---|---|---|---|
| Intercept | 15.3 | 8.2 | 1.87 | 0.068 |
| Advertising | 3.2 | 0.45 | 7.11 | <0.001 |
| Size | 1.8 | 0.32 | 5.63 | <0.001 |
| Competitors | -4.2 | 1.1 | -3.82 | <0.001 |
| Income | 0.35 | 0.18 | 1.94 | 0.058 |
What does each coefficient mean?
“Holding store size, competitors, and income constant, each additional $1,000 in advertising is associated with $3,200 more in sales.”
“Holding advertising, competitors, and income constant, each additional 1,000 sq ft is associated with $1,800 more in sales.”
“Holding other variables constant, each additional competitor is associated with $4,200 less in sales.”
If all coefficients are significant except Income (p = 0.058), what should we conclude about Income?
A. Remove it immediately
B. It has no statistically significant association with sales
C. It’s marginally significant; consider context
D. The model is invalid
\[\widehat{\text{Sales}} = 15.3 + 3.2(\text{Ad}) + 1.8(\text{Size}) - 4.2(\text{Comp}) + 0.35(\text{Income})\]
Store with: Advertising = $12K, Size = 28K sq ft, Competitors = 3, Income = $68K
\[\widehat{Y} = 15.3 + 3.2(12) + 1.8(28) - 4.2(3) + 0.35(68) = 115.3\]
Predicted Sales: $115,300
When we return: Model assessment, variable selection, complete example
Is our multiple regression model good?
A variable is not significant in multiple regression but was in simple regression. This suggests:
A. Data error
B. Other variables explain its effect
C. Multiple regression is wrong
D. Remove all variables
Variables:
price: House price (thousands $)sqft: Square footagebedrooms: Number of bedroomsage: House age (years)What this does: Runs a multiple regression with price as the dependent variable and sqft, bedrooms, and age as independent variables.
Source | SS df MS Number of obs = 150
--------+---------------------------------- F(3, 146) = 45.23
Model | 8234.56 3 2744.85 Prob > F = 0.0000
Residual| 8856.32 146 60.66 R-squared = 0.4820
--------+---------------------------------- Adj R-squared = 0.4713
Total | 17090.88 149 114.70 Root MSE = 7.7888
-----------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------+--------------------------------------------------------------------
sqft. | 0.1245 0.0156 7.98 0.000 0.0937 0.1553
bedrooms| 15.234 3.456 4.41 0.000 8.412 22.056
age | -0.8234 0.2134 -3.86 0.000 -1.245 -0.4018
_cons | 45.678 8.234 5.55 0.000 29.412 61.944
-----------------------------------------------------------------------------
sqft (0.1245): For each additional square foot, house price increases by $124.50, holding other variables constant.
bedrooms (15.234): Each additional bedroom adds $15,234 to the price, holding other variables constant.
age (-0.8234): Each year older, the house loses $823.40 in value, holding other variables constant.
**_cons (45.678):** The baseline price for a house with 0 sqft, 0 bedrooms, and age 0 (not meaningful in practice).
R-squared = 0.4820: Our model explains 48.2% of the variation in house prices.
Adj R-squared = 0.4713: Adjusted for number of predictors (still 47.1%).
F(3, 146) = 45.23, Prob > F = 0.0000: The model is statistically significant overall (at least one predictor matters).
Root MSE = 7.7888: On average, predictions are off by about $7,789.
Output:
Variable | VIF 1/VIF
---------+----------------------
sqft | 1.45 0.689655
bedrooms | 1.52 0.657895
age | 1.12 0.892857
---------+----------------------
Mean VIF | 1.36
Interpretation: All VIF values < 10 (and even < 5), so no multicollinearity problem.
Question: What’s the predicted price for a 2000 sqft house, 3 bedrooms, 10 years old?
Calculate manually:
Price = 45.678 + 0.1245(2000) + 15.234(3) - 0.8234(10)
= 45.678 + 249 + 45.702 - 8.234
= 332.146 thousand dollars = $332,146
Or use STATA:
✅ Square footage is the strongest predictor (highest t-statistic)
✅ Older houses sell for less - consider renovation strategies
✅ Additional bedrooms add value - could guide development decisions
⚠️ Model explains only 48% of variation - other factors matter (location, condition, etc.)
Recommendation: Collect additional variables to improve predictions.
Please don’t forget to complete your SETs
That’s all for this quarter
Office hours: I’m available now if you have any questions
STAT 17 – Fall 2025