HW2:

From Data Exploration to Probability

Author

🎬 The Case

Welcome back, Statistical Detective!

Your success with Brew Haven has caught the attention of StreamFlix, a growing streaming service competing with Netflix and Disney+. The VP of User Experience, Jordan Chen, has reached out with an urgent problem:

“We’re losing subscribers at an alarming rate, but we don’t understand the patterns. Some users cancel after trying our free trial, others leave after a few months. We track viewing habits, subscription tiers, and device usage, but we need someone who understands probability to help us predict churn risk and identify our most valuable customer segments. Can you help us make sense of these patterns?”

Jordan has provided real data from the industry alongside StreamFlix’s own subscriber records. Your job starts the way any good statistical investigation does: exploring the data before making probability-based claims.

Your mission: First, use exploratory data analysis to describe what the data looks like. Then, use probability concepts to help StreamFlix understand their subscriber patterns and reduce churn!

Part I: Exploratory Data Analysis

Question 1: What Does Streaming Content Look Like?

Before analyzing StreamFlix’s subscribers, Jordan wants you to understand the broader streaming landscape using publicly available data. This question uses the Netflix Titles dataset, collected and shared through the TidyTuesday project. It contains information on Netflix titles added between 2008 and 2021. The variable of interest is duration_min, which records the runtime of each movie in minutes for all movies in the catalog.

The summary statistics and visualizations below have been generated for you. Use them to answer the questions that follow.

a. Summary Statistics

Statistic	Value
n (movies)	5377.0
Mean	99.3
Median	98.0
Std. Deviation	28.5
IQR	28.0
Minimum	3.0
Maximum	312.0

b. Distribution Plots

Histogram of Netflix movie runtimes in minutes

Boxplot of Netflix movie runtimes in minutes

c. Describe the CENTER of the distribution

Using the summary table above:

What are the mean and median runtimes? Are they close to each other or notably different?
Which measure of center better represents a “typical” Netflix movie runtime, and why?

d. Describe the SPREAD of the distribution

Using the summary table above:

What are the standard deviation and IQR? What do these values tell you about how much movie runtimes vary?
Which measure of spread is more appropriate here, and why?

e. Describe the SHAPE of the distribution

Using the histogram and boxplot above, along with the relationship between the mean and median:

Is the distribution symmetric, left-skewed, or right-skewed? Describe what you see.
Are there any unusual features (outliers, gaps, multiple peaks)? What might explain them?
A classmate says: “Most Netflix movies are about 90 minutes long.” Does the output above support or challenge this claim? Use specific values from the summary table to justify your answer.

f. Bridge to Probability

Jordan points out that streaming platforms use viewing patterns and content characteristics to estimate probabilities — for example, the likelihood that a subscriber will watch a documentary, or the chance a long movie gets completed in one sitting.

Reflect briefly (2–3 sentences): How does understanding the distribution of a variable (its center, spread, and shape) help set up probability-based questions about that variable? What would you want to know about the distribution before making probability claims?

Part II: Probability

Question 2: Probability Foundations

StreamFlix has 10,000 current subscribers. Answer the following:

a. Define these terms in the context of StreamFlix:

Experiment:
Sample Space:
Event:
Complement:

b. Let A = “subscriber watches action movies” and B = “subscriber watches on mobile device”

Write in words what each represents:

\(A^c\):
\(A \cup B\):
\(A \cap B\):
\(P(A|B)\):

Question 3: Independent vs. Mutually Exclusive Events

Jordan is confused about independence and mutual exclusivity.

a. Definitions

Explain in YOUR OWN WORDS (no textbook definitions):

What makes two events independent?
What makes two events mutually exclusive?
Can two events be both independent AND mutually exclusive? Explain why or why not.

b. Classify these pairs

For each pair, state if they are: Independent, Mutually Exclusive, Both, or Neither. Justify each answer.

Event Pair	Classification	Justification
“Subscribes to Premium tier” and “Subscribes to Basic tier”
“Watches content on weekends” and “Prefers comedy genre”
“Cancels within 30 days” and “Completes free trial”
“Uses mobile app” and “Uses smart TV app” (same subscriber can use both)

Question 4: Addition and Multiplication Rules

StreamFlix data shows:

35% of subscribers watch action movies
45% of subscribers watch comedies
15% of subscribers watch both action movies AND comedies
60% of Premium subscribers renew for another month
40% of subscribers are on the Premium tier

a. Using the Addition Rule

Calculate \(P(\text{Action OR Comedy})\)

Show your work using the formula: \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)

b. More Addition Rule Practice

What’s the probability a randomly selected subscriber watches NEITHER action nor comedy?
Show your calculation:

c. Using the Multiplication Rule

What’s the probability that a randomly selected subscriber is BOTH on Premium tier AND renews for another month?

Assume renewal and tier choice are independent. Show your work using: \(P(A \cap B) = P(A) \times P(B)\)

d. Critical Thinking

In part (c), we assumed independence. List TWO reasons why renewal and tier choice might NOT actually be independent in reality:

Question 5: Contingency Table Analysis

StreamFlix surveyed 1,000 subscribers about their viewing habits and subscription status. Here’s the data:

	Canceled Subscription	Active Subscription	Total
Watches Daily	80	420	500
Watches Weekly	120	230	350
Watches Rarely	100	50	150
Total	300	700	1,000

a. Define probability types

Using this table as context, explain:

Joint Probability:
Marginal Probability:
Conditional Probability:

b. Calculate the following (show work as fractions, then decimals):

\(P(\text{Watches Daily})\) =
\(P(\text{Canceled})\) =
\(P(\text{Watches Daily AND Canceled})\) =
\(P(\text{Watches Daily OR Canceled})\) =
\(P(\text{Canceled | Watches Daily})\) =
\(P(\text{Canceled | Watches Rarely})\) =
\(P(\text{Watches Daily | Active})\) =

c. Interpretation

Compare \(P(\text{Canceled | Watches Daily})\) with \(P(\text{Canceled | Watches Rarely})\). What does this tell Jordan about viewing frequency and churn risk?
Is viewing frequency independent of subscription status? Provide statistical evidence using the definition: Events A and B are independent if \(P(A|B) = P(A)\).

Question 6: Device Usage Analysis

StreamFlix tracks which devices subscribers use. Here’s data from 2,000 subscribers:

	Mobile Only	TV Only	Both Devices	Total
Premium Tier	180	320	300	800
Standard Tier	240	360	280	880
Basic Tier	150	130	40	320
Total	570	810	620	2,000

a. Calculate:

\(P(\text{Premium Tier})\) =
\(P(\text{Both Devices})\) =
\(P(\text{Premium AND Both Devices})\) =
\(P(\text{Both Devices | Premium})\) =
\(P(\text{Premium | Both Devices})\) =

b. Comparison

Notice that \(P(\text{Both Devices | Premium}) \neq P(\text{Premium | Both Devices})\)

Explain in words what EACH probability means and why they’re different:

c. Business Insight

Which tier has the highest percentage of multi-device users? Show calculation.
What should Jordan conclude about the relationship between tier choice and device usage?

Question 7: Free Trial Conversion

StreamFlix offers a 30-day free trial. Historical data shows:

70% of trial users watch at least 5 hours of content
Of those who watch 5+ hours, 60% convert to paid subscribers
Of those who watch less than 5 hours, only 15% convert to paid subscribers

a. Create a tree diagram

Sketch a tree diagram showing:

First branch: Hours watched (5+ hours vs. <5 hours)
Second branch: Conversion outcome (Convert vs. Don’t Convert)
Label all probabilities on branches
Calculate all four endpoint probabilities

b. Calculate:

What’s the probability a trial user BOTH watches 5+ hours AND converts?
What’s the overall probability that a trial user converts (regardless of viewing)? Use the Law of Total Probability: \(P(\text{Convert}) = P(\text{Convert AND 5+ hours}) + P(\text{Convert AND <5 hours})\)
Given that a user converted to paid, what’s the probability they watched 5+ hours during trial? Use Bayes’ Theorem concept (show your work)

c. Recommendation

Based on these probabilities, what strategy would you recommend to increase conversions?

Question 8: Content Preference Patterns

Jordan provides this data on genre preferences and age groups (1,500 subscribers):

	Action	Romance	Documentary	Total
Ages 18-30	240	180	80	500
Ages 31-50	200	220	180	600
Ages 51+	60	140	200	400
Total	500	540	460	1,500

a. Calculate these probabilities:

\(P(\text{Documentary})\) =
\(P(\text{Ages 18-30})\) =
\(P(\text{Documentary AND Ages 51+})\) =
\(P(\text{Documentary | Ages 51+})\) =
\(P(\text{Ages 51+ | Documentary})\) =

b. Independence check

Are age group and genre preference independent?

Test this by checking if \(P(\text{Action | Ages 18-30}) = P(\text{Action})\)

Show your calculations and state your conclusion.

c. Marketing insight

Based on the conditional probabilities, which genre should StreamFlix promote to each age group? Justify with data.

💭 Question 9: Detective’s Reflection

Reflect on your full investigation — from exploratory analysis to probability (5–7 sentences):

How did the EDA in Question 1 prepare you for the probability work that followed? What’s the relationship between describing a distribution and making probabilistic claims?
How does understanding conditional probability help businesses make better decisions?
What’s the difference between \(P(A|B)\) and \(P(B|A)\), and why does this matter in real applications?
How did contingency tables help you see patterns in the data?
What surprised you about independence vs. mutual exclusivity?
Name one other business or real-world scenario where both EDA and probability concepts would be crucial.

🎉 Excellent work, Statistical Detective! StreamFlix is grateful for your insights — from raw data all the way to probability!