
Statistics - UCSC
25 Nov 2025
The Challenge
TechStart Ventures is a venture capital firm evaluating startup investments. They’ve noticed that advertising spending seems related to revenue, but they need a systematic way to:
Today: Learn how correlation and linear regression can transform scattered data points into actionable business insights!
By the end of today’s lecture, you will be able to:
Summarize relationships: How strong is the association between advertising and revenue?
Quantify direction: Does the relationship increase or decrease?
Visualize patterns: See trends in your data clearly
Make predictions: Forecast revenue for new advertising budgets
Test hypotheses: Is this relationship statistically significant?
Quantify uncertainty: How confident are we in our predictions?
Important
Regression is the bridge between description and prediction!
Guess the correlation: https://www.geogebra.org/m/KE6JfuF9
What to look for:
Looking at the scatterplot on the previous slide:
What best describes the relationship between advertising and revenue?
A. Strong positive linear relationship
B. Weak positive linear relationship
C. No relationship
D. Negative linear relationship
Measures the strength and direction of a linear relationship
Formula (you won’t calculate by hand!)
\[r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}\]
Range: \(-1 \leq r \leq +1\)
Interpretation Guidelines:
| Value of |r| | Strength |
|---|---|
| 0.0 - 0.3 | Weak |
| 0.3 - 0.7 | Moderate |
| 0.7 - 1.0 | Strong |
For TechStart data: \(r = 0.85\) → Strong positive linear relationship!
Warning
⚠️ Correlation ≠ Causation! High correlation doesn’t mean one variable causes the other.
Function: =CORREL(range1, range2)
Example for TechStart data:
=CORREL(A2:A26, B2:B26)
Where: - A2:A26 contains Advertising data - B2:B26 contains Revenue data
Try it yourself!
=CORREL(A:A, B:B) for entire columnsA study finds r = -0.68 between hours of study and test anxiety.
What does this mean?
A. More study causes less anxiety
B. Less study causes more anxiety
C. There’s a moderate negative relationship
D. There’s no relationship
5 Minute Break
We’ll return to build regression equations and make predictions!
Goal: Find the line that minimizes the sum of squared errors (vertical distances from points to line)
This is called the Least Squares Regression Line
General Form
\[\widehat{y} = b_0 + b_1 x\]
Where:
For TechStart Ventures
\[\widehat{\text{Revenue}} = 12.45 + 0.78 \times \text{Advertising}\]
The slope (\(b_1\)) is the most important coefficient!
\[b_1 = r \times \frac{s_y}{s_x}\]
Where:
Key Insight
The slope combines: 1. Correlation (how strong is the relationship?) 2. Relative variability (how much does y vary compared to x?)
For TechStart: \(b_1 = 0.85 \times \frac{18.2}{19.7} = 0.78\)
Example: A startup plans to spend $50,000 on advertising. What revenue do we predict?
Step 1: Identify the regression equation
\[\widehat{\text{Revenue}} = 12.45 + 0.78 \times \text{Advertising}\]
Step 2: Substitute x = 50
\[\widehat{\text{Revenue}} = 12.45 + 0.78 \times 50 = 12.45 + 39 = 51.45\]
Step 3: Interpret in context
We predict this startup will generate approximately $51,450 in revenue.
Warning
⚠️ Extrapolation warning: Only make predictions within the range of your data!
Using the TechStart regression equation:
\(\widehat{\text{Revenue}} = 12.45 + 0.78 \times \text{Advertising}\)
If advertising spending is $30,000, what revenue do we predict?
A. $24,850
B. $35,850
C. $42,450
D. $51,450

Residual: The difference between actual and predicted values
\[e = y - \widehat{y}\]
Properties:
Good model: Small residuals
How much of the variability in y is explained by x?
\[R^2 = r^2\]
For TechStart Ventures
\[R^2 = (0.85)^2 = 0.72 = 72\%\]
Interpretation: 72% of the variability in Revenue is explained by Advertising spending.
Guidelines:
| \(R^2\) Value | Interpretation |
|---|---|
| 0.00 - 0.30 | Weak fit |
| 0.30 - 0.70 | Moderate fit |
| 0.70 - 1.00 | Strong fit |
The remaining 28% is due to other factors (product quality, market timing, competition, etc.)
Method 1: Using Functions
Slope: =SLOPE(known_y's, known_x's)
Intercept: =INTERCEPT(known_y's, known_x's)
R-squared: =RSQ(known_y's, known_x's)
Method 2: Built-in Regression Tool
Tip
The trendline method is fastest and visualizes everything at once!
A regression model has \(R^2 = 0.40\).
What does this mean?
A. The correlation coefficient is 0.40
B. 40% of variability in y is explained by x
C. The model makes 40% errors
D. We’re 40% confident in predictions
Complete Analysis
Research Question: Can we predict startup revenue from advertising spending?
Scatterplot: Shows clear positive linear relationship
Correlation: \(r = 0.85\) (strong positive)
Regression Equation: \(\widehat{\text{Revenue}} = 12.45 + 0.78 \times \text{Advertising}\)
Interpretation:
Model Fit: \(R^2 = 0.72\) (72% of revenue variability explained)
Prediction: For $50,000 advertising → predict $51,450 revenue
Scatterplots visualize relationships between two quantitative variables
Correlation (r) measures strength and direction of linear relationships
Regression equation \(\hat{y} = b_0 + b_1x\) models the relationship
R-squared measures model fit (proportion of variability explained)
Use regression for prediction within data range (avoid extrapolation)
Next week: Inference with Regression
Coming Up
We’ll shift from describing relationships to making inferences about populations based on sample data!
Try these before next class:
Calculate correlation for: Revenue vs Menu_price data (provided in the same shared Sheet)
Given: \(\widehat{y} = -14.4 + 2.54x\)
If r = 0.95 and \(R^2 = ?\) → What percentage of variability is explained?
Use Google Sheets to:
What questions do you have about:
See you next time for:
Inference with Regression
Office hours I’ll be here after class in case you need to talk
STAT 17 – Fall 2025