Stat 17: Introduction to Statistical Methods for Business and Economics

Recap: Our Journey So Far

Lecture 1: Simple regression with one predictor

$\hat{Y} = b_0 + b_1 X$
Sales = 45.2 + 3.8(Advertising)

Lecture 2: Statistical inference

Testing significance
Confidence and prediction intervals
R² = 0.67 (67% explained)

Today’s Question:

Can we do better by including multiple predictors?

📊 The RetailMax Case: Final Chapter

New Available Data

Beyond advertising spending, we now have:

Store size (square feet in thousands)
Number of competitors within 5 miles
Median household income in store’s zip code (thousands)
Years in operation

Can these variables improve our sales predictions?

🎯 Today’s Learning Objectives

Understand multiple regression models
Interpret coefficients with multiple predictors
Build multiple regression models in Google Sheets
Assess model quality and variable importance
Make predictions with multiple variables
Apply multiple regression to real business problems

Part 1: Multiple Regression Concepts

From one predictor to many

Why Multiple Regression?

Simple Regression Limitations

Only one predictor
Ignores other influences
May have confounding variables
Lower predictive power

Multiple Regression Advantages

Multiple predictors simultaneously
Controls for confounding
Better predictions (higher R²)
More realistic models

Real world is multivariate!

The Multiple Regression Model

Population Model

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + \epsilon\]

Sample Regression Equation

\[\widehat{Y} = b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_k X_k\]

Where:

$k$ = number of predictors
$b_0$ = intercept
$b_i$ = coefficient for $X_i$ (holding other variables constant)

Key Difference: “Holding Other Variables Constant”

Critical Interpretation Change

Simple regression: “For each unit increase in X, Y changes by $b_1$”

Multiple regression: “For each unit increase in $X_i$, holding all other variables constant, Y changes by $b_i$”

This is also called:

“Controlling for other variables”
“All else equal”
“Ceteris paribus”

THINK-PAIR-SHARE 1 (5 minutes)

In a regression predicting salary from years of experience and education level, the coefficient for experience is $2,500. This means:

A. Each year of experience adds $2,500 to salary
B. Each year of experience adds $2,500, holding education constant
C. Experience is more important than education
D. Both A and B are correct

RetailMax: Multiple Regression Model

Variables in Our Model

Outcome (Y): Monthly Sales Revenue (thousands)

Predictors (X’s):

$X_1$ = Advertising Spending (thousands)
$X_2$ = Store Size (thousand sq ft)
$X_3$ = Number of Competitors
$X_4$ = Median Income (thousands)

Equation

\[\widehat{Y} = b_0 + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4\]

Part 2: Building the Model

Let’s do this together!

👥 Example: Sales and Advertising

Data Structure

Sales	Advertising	Size	Competitors	Income
85.2	10.5	25	3	65
92.1	12.3	30	2	72
…	…	…	…	…

Reading the Output: Coefficients

Example Output

Variable	Coefficient	Std Error	t-Stat	p-value
Intercept	15.3	8.2	1.87	0.068
Advertising	3.2	0.45	7.11	<0.001
Size	1.8	0.32	5.63	<0.001
Competitors	-4.2	1.1	-3.82	<0.001
Income	0.35	0.18	1.94	0.058

What does each coefficient mean?

Interpreting the Coefficients

$b_1 = 3.2$ (Advertising)

“Holding store size, competitors, and income constant, each additional $1,000 in advertising is associated with $3,200 more in sales.”

$b_2 = 1.8$ (Size)

“Holding advertising, competitors, and income constant, each additional 1,000 sq ft is associated with $1,800 more in sales.”

$b_3 = -4.2$ (Competitors)

“Holding other variables constant, each additional competitor is associated with $4,200 less in sales.”

THINK-PAIR-SHARE 2 (5 minutes)

If all coefficients are significant except Income (p = 0.058), what should we conclude about Income?

A. Remove it immediately
B. It has no statistically significant association with sales
C. It’s marginally significant; consider context
D. The model is invalid

Complete Regression Equation

Based on Our Output

\[\widehat{\text{Sales}} = 15.3 + 3.2(\text{Ad}) + 1.8(\text{Size}) - 4.2(\text{Comp}) + 0.35(\text{Income})\]

Prediction Example

Store with: Advertising = $12K, Size = 28K sq ft, Competitors = 3, Income = $68K

\[\widehat{Y} = 15.3 + 3.2(12) + 1.8(28) - 4.2(3) + 0.35(68) = 115.3\]

Predicted Sales: $115,300

🔟 10-Minute Break! ☕

When we return: Model assessment, variable selection, complete example

Part 3: Model Assessment

Is our multiple regression model good?

R² and Adjusted R²

R² (Regular)

Always increases with more variables
Simple: R² = 0.67
Multiple: R² = 0.84

Adjusted R²

Penalizes extra variables
Only increases if variable helps enough
Better for comparing models

THINK-PAIR-SHARE 3 (5 minutes)

A variable is not significant in multiple regression but was in simple regression. This suggests:

A. Data error
B. Other variables explain its effect
C. Multiple regression is wrong
D. Remove all variables

👥 Complete Class Example: Predicting House Prices

Dataset: Houses in California

Variables:

price: House price (thousands $)
sqft: Square footage
bedrooms: Number of bedrooms
age: House age (years)

Step 1: Run the Regression in STATA

regress price sqft bedrooms age

What this does: Runs a multiple regression with price as the dependent variable and sqft, bedrooms, and age as independent variables.

Step 2: Understanding STATA Output

Source  |       SS           df       MS      Number of obs   =       150
--------+----------------------------------   F(3, 146)       =     45.23
Model   |   8234.56         3    2744.85      Prob > F        =    0.0000
Residual|  8856.32       146      60.66       R-squared       =    0.4820
--------+----------------------------------   Adj R-squared   =    0.4713
Total   |  17090.88       149     114.70      Root MSE        =    7.7888

-----------------------------------------------------------------------------
price   |    Coef.   Std. Err.     t      P>|t|    [95% Conf. Interval]
--------+--------------------------------------------------------------------
sqft.   |   0.1245     0.0156      7.98   0.000      0.0937   0.1553
bedrooms|  15.234      3.456       4.41   0.000      8.412   22.056
age     |  -0.8234     0.2134     -3.86   0.000     -1.245   -0.4018
_cons   |  45.678      8.234       5.55   0.000     29.412   61.944
-----------------------------------------------------------------------------

Step 3: Interpreting the Coefficients

sqft (0.1245): For each additional square foot, house price increases by $124.50, holding other variables constant.

bedrooms (15.234): Each additional bedroom adds $15,234 to the price, holding other variables constant.

age (-0.8234): Each year older, the house loses $823.40 in value, holding other variables constant.

**_cons (45.678):** The baseline price for a house with 0 sqft, 0 bedrooms, and age 0 (not meaningful in practice).

Step 4: Model Quality Assessment

R-squared = 0.4820: Our model explains 48.2% of the variation in house prices.

Adj R-squared = 0.4713: Adjusted for number of predictors (still 47.1%).

F(3, 146) = 45.23, Prob > F = 0.0000: The model is statistically significant overall (at least one predictor matters).

Root MSE = 7.7888: On average, predictions are off by about $7,789.

Step 5: Check Multicollinearity

vif

Output:

Variable |       VIF       1/VIF  
---------+----------------------
sqft     |      1.45     0.689655
bedrooms |      1.52     0.657895
age      |      1.12     0.892857
---------+----------------------
Mean VIF |      1.36

Interpretation: All VIF values < 10 (and even < 5), so no multicollinearity problem.

Step 6: Making a Prediction

Question: What’s the predicted price for a 2000 sqft house, 3 bedrooms, 10 years old?

Calculate manually:

Price = 45.678 + 0.1245(2000) + 15.234(3) - 0.8234(10)
      = 45.678 + 249 + 45.702 - 8.234
      = 332.146 thousand dollars = $332,146

Or use STATA:

display 45.678 + 0.1245*2000 + 15.234*3 - 0.8234*10

Step 7: Business Implications

✅ Square footage is the strongest predictor (highest t-statistic)

✅ Older houses sell for less - consider renovation strategies

✅ Additional bedrooms add value - could guide development decisions

⚠️ Model explains only 48% of variation - other factors matter (location, condition, etc.)

Recommendation: Collect additional variables to improve predictions.

Questions?

Please don’t forget to complete your SETs

Thank You! 🎉

That’s all for this quarter

Office hours: I’m available now if you have any questions