STAT 17: Linear Regression

Prof. Marcela Alfaro Cordoba

Statistics - UCSC

25 Nov 2025

📊 Case Study: TechStart Ventures

The Challenge

TechStart Ventures is a venture capital firm evaluating startup investments. They’ve noticed that advertising spending seems related to revenue, but they need a systematic way to:

  • Quantify the strength of relationships between variables
  • Predict future revenue based on planned advertising budgets
  • Make decisions about which startups show the strongest revenue-to-marketing relationships

Today: Learn how correlation and linear regression can transform scattered data points into actionable business insights!

Learning Objectives 🎯

By the end of today’s lecture, you will be able to:

  • Understand and interpret scatterplots to visualize relationships
  • Calculate and interpret the correlation coefficient (r)
  • Develop and interpret simple linear regression equations
  • Understand what regression coefficients (slope and intercept) mean
  • Make predictions using regression equations
  • Use Google Sheets for correlation and regression analysis

Why Linear Regression is Essential

📈 Descriptive Statistics

Summarize relationships: How strong is the association between advertising and revenue?

Quantify direction: Does the relationship increase or decrease?

Visualize patterns: See trends in your data clearly

🔮 Inferential Statistics

Make predictions: Forecast revenue for new advertising budgets

Test hypotheses: Is this relationship statistically significant?

Quantify uncertainty: How confident are we in our predictions?

Important

Regression is the bridge between description and prediction!

Part 1: Visualizing Relationships 📉

Scatterplots

Guess the correlation: https://www.geogebra.org/m/KE6JfuF9

Scatterplots: The Foundation

What to look for:

  • Direction: Positive or negative?
  • Form: Linear or curved?
  • Strength: Tight cluster or scattered?
  • Outliers: Unusual points?

THINK-PAIR-SHARE 1 (5 minutes)

Looking at the scatterplot on the previous slide:

What best describes the relationship between advertising and revenue?

A. Strong positive linear relationship
B. Weak positive linear relationship
C. No relationship
D. Negative linear relationship

Part 2: Quantifying Relationships 📏

The Correlation Coefficient

Correlation Coefficient (r)

Measures the strength and direction of a linear relationship

Formula (you won’t calculate by hand!)

\[r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}\]

Range: \(-1 \leq r \leq +1\)

Interpretation Guidelines:

Value of |r| Strength
0.0 - 0.3 Weak
0.3 - 0.7 Moderate
0.7 - 1.0 Strong

For TechStart data: \(r = 0.85\) → Strong positive linear relationship!

Understanding Correlation Values

Warning

⚠️ Correlation ≠ Causation! High correlation doesn’t mean one variable causes the other.

Google Sheets: Calculating Correlation

Function: =CORREL(range1, range2)

Example for TechStart data:

=CORREL(A2:A26, B2:B26)

Where: - A2:A26 contains Advertising data - B2:B26 contains Revenue data

Try it yourself!

  1. Make your own copy of this sheet
  2. Use =CORREL(A:A, B:B) for entire columns
  3. Interpret the result using the guidelines

THINK-PAIR-SHARE 2 (7 minutes)

A study finds r = -0.68 between hours of study and test anxiety.

What does this mean?

A. More study causes less anxiety
B. Less study causes more anxiety
C. There’s a moderate negative relationship
D. There’s no relationship

🍵 Break Time!

5 Minute Break

We’ll return to build regression equations and make predictions!

Part 3: Simple Linear Regression 📐

Building Prediction Equations

The Regression Line: “Line of Best Fit”

Goal: Find the line that minimizes the sum of squared errors (vertical distances from points to line)

This is called the Least Squares Regression Line

The Regression Equation

General Form

\[\widehat{y} = b_0 + b_1 x\]

Where:

  • \(\widehat{y}\) (y-hat) = predicted value of y
  • \(b_0\) = y-intercept (value of y when x = 0)
  • \(b_1\) = slope (change in y for each 1-unit increase in x)
  • \(x\) = value of the explanatory variable

For TechStart Ventures

\[\widehat{\text{Revenue}} = 12.45 + 0.78 \times \text{Advertising}\]

  • Intercept (12.45): Expected revenue with $0 advertising is $12,450
  • Slope (0.78): For each $1,000 increase in advertising, revenue is expected to increase by $780

Understanding the Slope

The slope (\(b_1\)) is the most important coefficient!

\[b_1 = r \times \frac{s_y}{s_x}\]

Where:

  • \(r\) = correlation coefficient
  • \(s_y\) = standard deviation of y
  • \(s_x\) = standard deviation of x

Key Insight

The slope combines: 1. Correlation (how strong is the relationship?) 2. Relative variability (how much does y vary compared to x?)

For TechStart: \(b_1 = 0.85 \times \frac{18.2}{19.7} = 0.78\)

Making Predictions or Estimations

Example: A startup plans to spend $50,000 on advertising. What revenue do we predict?

Step 1: Identify the regression equation

\[\widehat{\text{Revenue}} = 12.45 + 0.78 \times \text{Advertising}\]

Step 2: Substitute x = 50

\[\widehat{\text{Revenue}} = 12.45 + 0.78 \times 50 = 12.45 + 39 = 51.45\]

Step 3: Interpret in context

We predict this startup will generate approximately $51,450 in revenue.

Warning

⚠️ Extrapolation warning: Only make predictions within the range of your data!

THINK-PAIR-SHARE 3 (7 minutes)

Using the TechStart regression equation:
\(\widehat{\text{Revenue}} = 12.45 + 0.78 \times \text{Advertising}\)

If advertising spending is $30,000, what revenue do we predict?

A. $24,850
B. $35,850
C. $42,450
D. $51,450

Residuals: Prediction Errors

Residual: The difference between actual and predicted values

\[e = y - \widehat{y}\]

Properties:

  • Positive: Actual > Predicted
  • Negative: Actual < Predicted
  • Sum of residuals = 0
  • Used to assess model fit

Good model: Small residuals

R-squared: “Goodness of Fit”

How much of the variability in y is explained by x?

\[R^2 = r^2\]

For TechStart Ventures

\[R^2 = (0.85)^2 = 0.72 = 72\%\]

Interpretation: 72% of the variability in Revenue is explained by Advertising spending.

Guidelines:

\(R^2\) Value Interpretation
0.00 - 0.30 Weak fit
0.30 - 0.70 Moderate fit
0.70 - 1.00 Strong fit

The remaining 28% is due to other factors (product quality, market timing, competition, etc.)

Google Sheets: Regression Analysis

Method 1: Using Functions

Slope: =SLOPE(known_y's, known_x's)
Intercept: =INTERCEPT(known_y's, known_x's)
R-squared: =RSQ(known_y's, known_x's)

Method 2: Built-in Regression Tool

  1. Select your data (including headers)
  2. InsertChart
  3. Chart type: Scatter chart
  4. Customize → Series → Trendline
  5. Check “Show R²” and “Show equation”

data

Tip

The trendline method is fastest and visualizes everything at once!

THINK-PAIR-SHARE 4 (7 minutes)

A regression model has \(R^2 = 0.40\).

What does this mean?

A. The correlation coefficient is 0.40
B. 40% of variability in y is explained by x
C. The model makes 40% errors
D. We’re 40% confident in predictions

Putting It All Together: TechStart Example

Complete Analysis

Research Question: Can we predict startup revenue from advertising spending?

  1. Scatterplot: Shows clear positive linear relationship

  2. Correlation: \(r = 0.85\) (strong positive)

  3. Regression Equation: \(\widehat{\text{Revenue}} = 12.45 + 0.78 \times \text{Advertising}\)

  4. Interpretation:

    • Each $1,000 in advertising predicts $780 increase in revenue
    • With $0 advertising, predict $12,450 base revenue
  5. Model Fit: \(R^2 = 0.72\) (72% of revenue variability explained)

  6. Prediction: For $50,000 advertising → predict $51,450 revenue

Key Concepts to Remember

  1. Scatterplots visualize relationships between two quantitative variables

  2. Correlation (r) measures strength and direction of linear relationships

    • Range: -1 to +1
    • Does NOT imply causation
  3. Regression equation \(\hat{y} = b_0 + b_1x\) models the relationship

    • Intercept (\(b_0\)): Predicted y when x = 0
    • Slope (\(b_1\)): Change in y per unit change in x
  4. R-squared measures model fit (proportion of variability explained)

  5. Use regression for prediction within data range (avoid extrapolation)

Looking Ahead: Next Lecture Preview

Next week: Inference with Regression

  • Is the relationship statistically significant?
  • Hypothesis tests for slope (is \(b_1 \neq 0\)?)
  • Confidence intervals for predictions
  • Standard error and prediction uncertainty
  • Conditions for regression inference

Coming Up

We’ll shift from describing relationships to making inferences about populations based on sample data!

Practice Problems

Try these before next class:

  1. Calculate correlation for: Revenue vs Menu_price data (provided in the same shared Sheet)

  2. Given: \(\widehat{y} = -14.4 + 2.54x\)

    • Interpret the slope and intercept
    • Predict y when x = 16
    • If actual y = 40 when x = 10, what’s the residual?
  3. If r = 0.95 and \(R^2 = ?\) → What percentage of variability is explained?

  4. Use Google Sheets to:

    • Create a scatterplot
    • Calculate correlation
    • Add a trendline with equation

Questions? 💭

What questions do you have about:

  • Scatterplots?
  • Correlation?
  • Regression equations?
  • Making predictions?
  • Using Google Sheets?

Thank You! 🎉

See you next time for:

Inference with Regression

Office hours I’ll be here after class in case you need to talk