STAT 7 Midterm Practice

Statistical Methods for Biological, Environmental, and Health Sciences

Author

Dr. Marcela Alfaro Córdoba

0.1 Instructions

This comprehensive practice set contains 100 questions to help you prepare for the STAT 7 Midterm Examination:

80 Multiple Choice Questions (Section I)
20 Short Answer Questions (Section II)

How to use this practice:

Work through questions systematically by topic
Check your answers against the solutions provided
Focus on understanding concepts, not just memorizing answers
If you struggle with a topic, review lecture materials and homework
Time yourself: aim for about 2 minutes per multiple choice, 5-6 minutes per short answer

Topics Covered:

Variables and data types
Study design and sampling methods
Descriptive statistics
Probability basics and rules
Contingency tables and independence
Diagnostic testing and Bayes’ Theorem
Binomial distribution
Normal distribution

1 Section I: Multiple Choice Questions (80 Questions)

1.1 Topic 1: Variables and Data Types (Questions 1-10)

1.1.1 Question 1

A researcher records the number of eggs in bird nests. This variable is:

Categorical - Nominal
Categorical - Ordinal
Numerical - Continuous
Numerical - Discrete

Solution

Answer: D) Numerical - Discrete

The number of eggs is countable (0, 1, 2, 3, …) and takes on whole number values only. Discrete variables are numerical but can only take specific, countable values.

1.1.2 Question 2

Blood type (A, B, AB, O) is an example of:

Numerical - Discrete
Numerical - Continuous
Categorical - Nominal
Categorical - Ordinal

Solution

Answer: C) Categorical - Nominal

Blood types are categories with no natural ordering. One blood type is not “more” or “less” than another.

1.1.3 Question 3

A hospital rates patient pain on a scale: None, Mild, Moderate, Severe, Extreme. This is:

Numerical - Discrete
Numerical - Continuous
Categorical - Nominal
Categorical - Ordinal

Solution

Answer: D) Categorical - Ordinal

These categories have a natural order (None < Mild < Moderate < Severe < Extreme), making it ordinal categorical.

1.1.4 Question 4

Water temperature measured in degrees Celsius is:

Categorical - Nominal
Categorical - Ordinal
Numerical - Discrete
Numerical - Continuous

Solution

Answer: D) Numerical - Continuous

Temperature can take any value within a range (e.g., 23.5°C, 23.51°C) and is measured on a continuous scale.

1.1.5 Question 5

Which of the following is a continuous numerical variable?

Number of patients admitted to a hospital
Tumor size measured in millimeters
Number of species in an ecosystem
Survival status (alive/dead)

Solution

Answer: B) Tumor size measured in millimeters

Tumor size can take any value within a range (e.g., 15.3 mm, 15.37 mm). The others are discrete (counts) or categorical.

1.1.6 Question 6

A study records whether each plant is diseased (Yes/No). This variable is:

Numerical - Continuous
Numerical - Discrete
Categorical - Ordinal
Categorical - Nominal

Solution

Answer: D) Categorical - Nominal

This is a binary categorical variable (two categories with no ordering).

1.1.7 Question 7

In a population of 10,000 trees, a researcher measures the diameter of 100 randomly selected trees. The average diameter of these 100 trees is:

A population parameter
A sample statistic
A population
A sample

Solution

Answer: B) A sample statistic

The average is calculated from a sample (100 trees), so it’s a statistic. If it were calculated from all 10,000 trees, it would be a parameter.

1.1.8 Question 8

The true average blood pressure of all adults in California is:

A sample
A statistic
A parameter
A variable

Solution

Answer: C) A parameter

A parameter is a numerical summary of a population. This describes the entire population of adults in California.

1.1.9 Question 9

Reaction time measured in milliseconds is best classified as:

Categorical - Ordinal
Numerical - Discrete
Numerical - Continuous
Categorical - Nominal

Solution

Answer: C) Numerical - Continuous

Time can be measured to arbitrary precision (e.g., 245.673 ms) and is continuous.

1.1.10 Question 10

Which variable is discrete?

Height of plants in centimeters
Weight of fish in grams
Number of mutations in a DNA sequence
pH level of water samples

Solution

Answer: C) Number of mutations in a DNA sequence

Mutations are countable (0, 1, 2, …). The others are continuous measurements.

1.2 Topic 2: Study Design and Sampling (Questions 11-20)

1.2.1 Question 11

A researcher randomly assigns 50 mice to receive a new drug and 50 mice to receive a placebo. This is an example of:

An observational study
A survey
An experiment
Convenience sampling

Solution

Answer: C) An experiment

The researcher actively assigns treatments (random assignment), making this an experiment.

1.2.2 Question 12

Which study design can establish a cause-and-effect relationship?

Observational study
Survey
Experiment with random assignment
Case study

Solution

Answer: C) Experiment with random assignment

Random assignment helps ensure groups are similar, allowing us to attribute differences to the treatment (causation).

1.2.3 Question 13

A researcher surveys people at a gym about their exercise habits. This sampling method is:

Simple random sampling
Stratified sampling
Systematic sampling
Convenience sampling

Solution

Answer: D) Convenience sampling

The researcher samples people who are easily accessible (at the gym), which is convenience sampling and leads to bias.

1.2.4 Question 14

To study fish populations in a lake, a researcher divides the lake into 5 depth zones and randomly selects 20 fish from each zone. This is:

Simple random sampling
Stratified sampling
Cluster sampling
Convenience sampling

Solution

Answer: B) Stratified sampling

The population is divided into strata (depth zones) and random samples are taken from each stratum.

1.2.5 Question 15

In an experiment, neither the participants nor the researchers collecting data know who receives which treatment. This is called:

Random assignment
Blinding
Replication
Control

Solution

Answer: B) Blinding

Blinding prevents bias by keeping participants and/or researchers unaware of treatment assignments.

1.2.6 Question 16

A control group in an experiment:

Receives the treatment being tested
Provides a comparison for the treatment group
Is always larger than the treatment group
Must consist of volunteers only

Solution

Answer: B) Provides a comparison for the treatment group

The control group provides a baseline for comparison to see if the treatment has an effect.

1.2.7 Question 17

An ecologist wants to estimate mercury levels in all fish in a lake but only samples fish from one small cove. This is an example of:

Random sampling error
Sampling bias
Measurement error
Nonresponse bias

Solution

Answer: B) Sampling bias

The sample (fish from one cove) is not representative of the population (all fish in the lake).

1.2.8 Question 18

Which is TRUE about random assignment in experiments?

It eliminates all bias
It ensures the sample represents the population
It helps create comparable groups at the start
It is the same as random sampling

Solution

Answer: C) It helps create comparable groups at the start

Random assignment distributes potential confounding variables evenly across groups, making them comparable.

1.2.9 Question 19

A study finds that people who take vitamin supplements have better health outcomes. However, people who take supplements also tend to exercise more and eat healthier. This is an example of:

Random sampling
Blinding
Confounding
Replication

Solution

Answer: C) Confounding

Exercise and diet are confounding variables - they’re associated with both supplement use and health outcomes, making it unclear what causes the better health.

1.2.10 Question 20

Replication in experimental design means:

Repeating the experiment multiple times
Using multiple subjects in the experiment
Using a control group
Both A and B

Solution

Answer: D) Both A and B

Replication includes both using multiple experimental units (subjects) and repeating the entire experiment.

1.3 Topic 3: Descriptive Statistics - Center (Questions 21-30)

1.3.1 Question 21

For the dataset: 2, 3, 5, 5, 6, 8, the median is:

4
5
5.5
4.83

Solution

Answer: B) 5

With 6 values, the median is the average of the 3rd and 4th values: (5 + 5)/2 = 5.

1.3.2 Question 22

For the dataset: 1, 2, 2, 3, 3, 3, 4, the mode is:

Solution

Answer: C) 3

The mode is the most frequently occurring value. The value 3 appears 3 times, more than any other value.

1.3.3 Question 23

A distribution has mean = 50 and median = 45. This distribution is likely:

Left-skewed
Symmetric
Right-skewed
Uniform

Solution

Answer: C) Right-skewed

When mean > median, the distribution is typically right-skewed. The mean is pulled toward the tail by extreme high values.

1.3.4 Question 24

Recovery times (days) for 5 patients are: 3, 4, 5, 6, 25. Which measure best represents typical recovery time?

Mean (8.6 days)
Median (5 days)
Mode
Range

Solution

Answer: B) Median (5 days)

The median is resistant to outliers. The value 25 is an outlier that pulls the mean up, but most patients recover in 3-6 days.

1.3.5 Question 25

For a perfectly symmetric distribution:

Mean < Median
Mean > Median
Mean = Median
Cannot determine the relationship

Solution

Answer: C) Mean = Median

In a symmetric distribution, the mean and median are equal because the distribution is balanced.

1.3.6 Question 26

Which statement is TRUE about the mean?

It is always equal to the median
It is not affected by outliers
It is affected by every value in the dataset
It is always a value that appears in the dataset

Solution

Answer: C) It is affected by every value in the dataset

The mean uses all data values in its calculation, so changing any value changes the mean.

1.3.7 Question 27

A dataset has values: 10, 12, 14, 16, 18. If we add 5 to each value, the new mean is:

The same as before
5 more than before
5 times the old mean
Cannot be determined

Solution

Answer: B) 5 more than before

Adding a constant to all values increases the mean by that constant. Original mean = 14, new mean = 19.

1.3.8 Question 28

Plant heights have mean = 30 cm and median = 35 cm. This suggests:

Right-skewed distribution
Left-skewed distribution
Symmetric distribution
No outliers

Solution

Answer: B) Left-skewed distribution

When mean < median, the distribution is typically left-skewed. The mean is pulled down by low outliers.

1.3.9 Question 29

A distribution with two distinct peaks is called:

Uniform
Skewed
Bimodal
Normal

Solution

Answer: C) Bimodal

A bimodal distribution has two modes (peaks), often indicating two different groups in the data.

1.3.10 Question 30

Which is resistant to outliers?

Mean
Range
Standard deviation
Median

Solution

Answer: D) Median

The median only depends on the middle value(s) and is not affected by how extreme the outliers are.

1.4 Topic 4: Descriptive Statistics - Spread (Questions 31-40)

1.4.1 Question 31

For a dataset with Q1 = 10 and Q3 = 20, the IQR is:

Solution

Answer: B) 10

IQR = Q3 - Q1 = 20 - 10 = 10.

1.4.2 Question 32

A value is considered an outlier if it is:

More than 1 IQR from the quartiles
More than 1.5 IQR from the quartiles
More than 2 IQR from the quartiles
More than 3 IQR from the quartiles

Solution

Answer: B) More than 1.5 IQR from the quartiles

Outliers are values below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR).

1.4.3 Question 33

Given Q1 = 20, Q3 = 40, which value is an outlier?

Solution

Answer: D) 70

IQR = 40 - 20 = 20. Upper boundary = 40 + 1.5(20) = 70. Values above 70 are outliers (70 itself is on the boundary, but any value above it would be an outlier).

1.4.4 Question 34

If all values in a dataset are identical, the standard deviation is:

Equal to the mean
Equal to 1
Equal to 0
Undefined

Solution

Answer: C) Equal to 0

If there’s no variability (all values the same), the standard deviation is 0.

1.4.5 Question 35

Dataset A has standard deviation 5. Dataset B has standard deviation 10. Both have the same mean. Which is true?

Dataset A has more variability
Dataset B has more variability
Both have the same variability
Cannot compare without seeing the data

Solution

Answer: B) Dataset B has more variability

Larger standard deviation = more spread/variability in the data.

1.4.6 Question 36

The range of a dataset is:

Q3 - Q1
Maximum - Minimum
The most frequent value
The middle value

Solution

Answer: B) Maximum - Minimum

Range is the difference between the largest and smallest values.

1.4.7 Question 37

Which measure of spread is most affected by outliers?

IQR
Range
Median
Q1

Solution

Answer: B) Range

Range uses the extreme values (max and min), so outliers directly affect it. IQR uses quartiles and is more resistant.

1.4.8 Question 38

If we multiply every value in a dataset by 3, the standard deviation:

Stays the same
Is multiplied by 3
Is multiplied by 9
Is divided by 3

Solution

Answer: B) Is multiplied by 3

Multiplying all values by a constant multiplies the standard deviation by that constant.

1.4.9 Question 39

A small standard deviation indicates:

Values are spread far from the mean
Values are close to the mean
The distribution is skewed
There are outliers present

Solution

Answer: B) Values are close to the mean

Small standard deviation means low variability - data points cluster near the mean.

1.4.10 Question 40

The IQR represents the spread of:

All the data
The middle 50% of the data
The middle 25% of the data
The outer 50% of the data

Solution

Answer: B) The middle 50% of the data

IQR is the range from Q1 (25th percentile) to Q3 (75th percentile), covering the middle half of the data.

1.5 Topic 5: Probability Basics (Questions 41-50)

1.5.1 Question 41

If P(A) = 0.6, then P(not A) equals:

Solution

Answer: B) 0.4

P(not A) = 1 - P(A) = 1 - 0.6 = 0.4.

1.5.2 Question 42

In a study of 100 patients, 60 improved. The probability a randomly selected patient improved is:

0.40
0.50
0.60
1.00

Solution

Answer: C) 0.60

P(improved) = 60/100 = 0.60.

1.5.3 Question 43

Events A and B are mutually exclusive. This means:

P(A AND B) = 0
P(A AND B) = 1
P(A|B) = P(A)
P(A AND B) = P(A) × P(B)

Solution

Answer: A) P(A AND B) = 0

Mutually exclusive events cannot occur together, so their intersection has probability 0.

1.5.4 Question 44

If P(A) = 0.3 and P(B) = 0.4, and A and B are mutually exclusive, then P(A OR B) equals:

0.12
0.30
0.70
0.10

Solution

Answer: C) 0.70

For mutually exclusive events: P(A OR B) = P(A) + P(B) = 0.3 + 0.4 = 0.7.

1.5.5 Question 45

If P(A) = 0.5, P(B) = 0.3, and P(A AND B) = 0.15, then P(A OR B) equals:

0.65
0.80
0.50
0.35

Solution

Answer: A) 0.65

Using the addition rule: P(A OR B) = P(A) + P(B) - P(A AND B) = 0.5 + 0.3 - 0.15 = 0.65.

1.5.6 Question 46

Two events A and B are independent if:

P(A AND B) = 0
P(A|B) = P(A)
P(A OR B) = P(A) + P(B)
P(A) = P(B)

Solution

Answer: B) P(A|B) = P(A)

Events are independent if knowing B occurred doesn’t change the probability of A.

1.5.7 Question 47

If A and B are independent with P(A) = 0.4 and P(B) = 0.5, then P(A AND B) equals:

0.20
0.90
0.10
0.70

Solution

Answer: A) 0.20

For independent events: P(A AND B) = P(A) × P(B) = 0.4 × 0.5 = 0.20.

1.5.8 Question 48

A probability of 0 means:

The event always occurs
The event never occurs
The event occurs half the time
We don’t have enough information

Solution

Answer: B) The event never occurs

A probability of 0 indicates impossibility.

1.5.9 Question 49

Which value CANNOT be a probability?

Solution

Answer: D) 1.5

Probabilities must be between 0 and 1 (inclusive). 1.5 is greater than 1 and thus impossible.

1.5.10 Question 50

In a sample space of equally likely outcomes, if there are 8 favorable outcomes and 40 total outcomes, the probability is:

0.20
0.25
5.0
0.50

Solution

Answer: A) 0.20

P = favorable/total = 8/40 = 0.20.

1.6 Topic 6: Conditional Probability & Independence (Questions 51-60)

1.6.1 Question 51

If P(A) = 0.6, P(B) = 0.4, and P(A AND B) = 0.24, then P(A|B) equals:

0.24
0.40
0.60
0.96

Solution

Answer: C) 0.60

P(A|B) = P(A AND B) / P(B) = 0.24 / 0.4 = 0.60.

1.6.2 Question 52

Using the data from Question 51, are events A and B independent?

Yes, because P(A|B) = P(A)
No, because P(A|B) ≠ P(A)
Cannot determine
Yes, because P(A AND B) = P(A) × P(B)

Solution

Answer: A) Yes, because P(A|B) = P(A)

P(A|B) = 0.60 and P(A) = 0.60, so they’re equal, indicating independence. We can also verify: P(A) × P(B) = 0.6 × 0.4 = 0.24 = P(A AND B).

1.6.3 Question 53

In a group of 100 patients, 40 have disease D and 30 have symptom S. If 20 have both D and S, what is P(S|D)?

0.20
0.30
0.50
0.67

Solution

Answer: C) 0.50

P(S|D) = P(S AND D) / P(D) = (20/100) / (40/100) = 20/40 = 0.50.

1.6.4 Question 54

If two events are independent, then:

They cannot occur together
They must have equal probabilities
Knowing one occurred doesn’t change the probability of the other
P(A OR B) = P(A) + P(B)

Solution

Answer: C) Knowing one occurred doesn’t change the probability of the other

This is the definition of independence: P(A|B) = P(A).

1.6.5 Question 55

Events that cannot both occur at the same time are:

Independent
Mutually exclusive
Conditional
Complementary

Solution

Answer: B) Mutually exclusive

Mutually exclusive (or disjoint) events have no overlap: P(A AND B) = 0.

1.6.6 Question 56

Can two events be both mutually exclusive AND independent (assuming neither has probability 0)?

Yes, always
No, never
Sometimes
Only if P(A) = P(B)

Solution

Answer: B) No, never

If events are mutually exclusive, P(A AND B) = 0. If they’re independent, P(A AND B) = P(A)×P(B). These can only both be true if P(A) = 0 or P(B) = 0.

1.6.7 Question 57

In a study, P(Recovery|Treatment) = 0.8 and P(Recovery|Placebo) = 0.5. This suggests:

Recovery and treatment are independent
Recovery depends on whether treatment was received
Recovery and treatment are mutually exclusive
The probabilities are incorrect

Solution

Answer: B) Recovery depends on whether treatment was received

The probability of recovery differs depending on the treatment, showing dependence (not independence).

1.6.8 Question 58

If P(A|B) = 0.7 and P(B) = 0.2, then P(A AND B) equals:

0.14
0.50
0.90
3.50

Solution

Answer: A) 0.14

From P(A|B) = P(A AND B)/P(B), we get: P(A AND B) = P(A|B) × P(B) = 0.7 × 0.2 = 0.14.

1.6.9 Question 59

In a contingency table, if P(A|B) = P(A), then:

A and B are mutually exclusive
A and B are independent
A and B are dependent
A and B are complementary

Solution

Answer: B) A and B are independent

When P(A|B) = P(A), knowing B doesn’t change the probability of A, which is the definition of independence.

1.6.10 Question 60

The complement of event A is:

All outcomes where A occurs
All outcomes where A does not occur
All outcomes where A and B occur
The same as P(A)

Solution

Answer: B) All outcomes where A does not occur

The complement (not A) consists of all outcomes in the sample space that are not in A.

1.7 Topic 7: Binomial Distribution (Questions 61-70)

1.7.1 Question 61

Which scenario follows a binomial distribution?

Time until a patient recovers
Number of heads in 10 coin flips
Weight of newborn babies
Types of birds observed

Solution

Answer: B) Number of heads in 10 coin flips

This has fixed n (10 trials), two outcomes (heads/tails), same probability (0.5), and independence.

1.7.2 Question 62

For a binomial distribution to apply, which is NOT required?

Fixed number of trials
Two possible outcomes per trial
Same probability for each trial
Trials must be dependent

Solution

Answer: D) Trials must be dependent

Binomial requires independence, not dependence.

1.7.3 Question 63

A binomial random variable has n = 20 and p = 0.3. The expected value E(X) is:

Solution

Answer: A) 6

E(X) = np = 20 × 0.3 = 6.

1.7.4 Question 64

For a binomial distribution with n = 50 and p = 0.6, the expected value is:

Solution

Answer: B) 30

E(X) = np = 50 × 0.6 = 30.

1.7.5 Question 65

A researcher plants 100 seeds, each with 0.8 probability of germinating independently. If X = number of seeds that germinate, then X follows:

Normal distribution
Binomial distribution with n=100, p=0.8
Uniform distribution
Binomial distribution with n=0.8, p=100

Solution

Answer: B) Binomial distribution with n=100, p=0.8

Fixed number of trials (100), two outcomes (germinate/don’t), same probability (0.8), independence.

1.7.6 Question 66

For the scenario in Question 65, what is E(X)?

Solution

Answer: C) 80

E(X) = np = 100 × 0.8 = 80 seeds expected to germinate.

1.7.7 Question 67

If a binomial random variable has n = 10 and E(X) = 7, what is p?

Solution

Answer: C) 0.7

E(X) = np, so 7 = 10p, therefore p = 7/10 = 0.7.

1.7.8 Question 68

In a binomial distribution, if we increase p (probability of success) while keeping n fixed:

E(X) decreases
E(X) increases
E(X) stays the same
Cannot determine

Solution

Answer: B) E(X) increases

Since E(X) = np, increasing p while n is fixed increases the expected value.

1.7.9 Question 69

A binomial random variable can take values:

Only 0 and 1
Any integer from 0 to n
Any real number
Only positive values

Solution

Answer: B) Any integer from 0 to n

X counts the number of successes in n trials, so X can be 0, 1, 2, …, n.

1.7.10 Question 70

If X ~ Binomial(n=25, p=0.4) and Y = 3X + 5, then E(Y) equals:

Solution

Answer: D) 35

E(X) = np = 25 × 0.4 = 10. Then E(Y) = E(3X + 5) = 3E(X) + 5 = 3(10) + 5 = 35.

1.8 Topic 8: Normal Distribution (Questions 71-80)

1.8.1 Question 71

A normal distribution is characterized by:

Skewed to the right
Bell-shaped and symmetric
Uniform across all values
Two distinct peaks

Solution

Answer: B) Bell-shaped and symmetric

The normal distribution has a characteristic bell shape and is symmetric around its mean.

1.8.2 Question 72

According to the 68-95-99.7 rule, approximately what percentage of data falls within 1 standard deviation of the mean?

50%
68%
95%
99.7%

Solution

Answer: B) 68%

About 68% of data in a normal distribution falls within μ ± 1σ.

1.8.3 Question 73

Heights are normally distributed with mean 170 cm and standard deviation 10 cm. Approximately what percentage of heights are between 160 cm and 180 cm?

50%
68%
95%
99.7%

Solution

Answer: B) 68%

160 and 180 are one standard deviation below and above the mean (170 ± 10), so about 68% fall in this range.

1.8.4 Question 74

For the distribution in Question 73, approximately what percentage of heights are between 150 cm and 190 cm?

68%
95%
99.7%
50%

Solution

Answer: B) 95%

150 and 190 are two standard deviations below and above the mean (170 ± 20), so about 95% fall in this range.

1.8.5 Question 75

A z-score of 2 means the value is:

2 units above the mean
2 standard deviations above the mean
2% above the mean
Equal to 2

Solution

Answer: B) 2 standard deviations above the mean

The z-score tells us how many standard deviations a value is from the mean.

1.8.6 Question 76

If μ = 100, σ = 15, and x = 130, the z-score is:

Solution

Answer: B) 2

z = (x - μ)/σ = (130 - 100)/15 = 30/15 = 2.

1.8.7 Question 77

A normal distribution with mean 50 and standard deviation 5. Approximately what percentage of values are above 60?

2.5%
16%
50%
84%

Solution

Answer: A) 2.5%

60 is two standard deviations above the mean (50 + 2×5). About 95% fall within 2 SD, leaving 5% outside. By symmetry, 2.5% are above.

1.8.8 Question 78

Which distribution is appropriate for data that is bell-shaped and symmetric?

Binomial
Uniform
Normal
Skewed

Solution

Answer: C) Normal

The normal distribution is used for bell-shaped, symmetric data.

1.8.9 Question 79

For a normal distribution, the mean, median, and mode are:

All different
All equal
Mean > Median > Mode
Mode > Mean > Median

Solution

Answer: B) All equal

In a symmetric distribution like the normal, mean = median = mode.

1.8.10 Question 80

A negative z-score indicates:

The value is above the mean
The value is below the mean
An error in calculation
The value equals the mean

Solution

Answer: B) The value is below the mean

Negative z-scores occur when x < μ, indicating the value is below the mean.

2 Section II: Short Answer Questions (20 Questions)

2.1 Questions 81-90: Conceptual Understanding

2.1.1 Question 81

Explain the difference between a parameter and a statistic. Give an example of each in a biological research context.

Solution

A parameter is a numerical summary of a population, while a statistic is a numerical summary of a sample.

Example: - Parameter: The true average wingspan of all monarch butterflies in North America (we’d need to measure every monarch to know this). - Statistic: The average wingspan calculated from a sample of 100 monarch butterflies captured and measured.

The key difference is that statistics are calculated from data we collect, while parameters are true (but usually unknown) values describing the entire population.

2.1.2 Question 82

A researcher wants to determine if a new fertilizer increases plant growth. Describe how to design this as a proper experiment, including random assignment, control group, and replication.

Solution

Proper Experimental Design:

Random Assignment: Randomly assign plants to two groups - treatment group (receives new fertilizer) and control group (receives standard fertilizer or no fertilizer).
Control Group: The control group provides a baseline for comparison. Without it, we wouldn’t know if observed growth is due to the fertilizer or other factors.
Replication: Use many plants in each group (e.g., 50 per group) to ensure results aren’t due to chance variation. Also, could repeat the entire experiment multiple times.
Control Variables: Keep other conditions identical (light, water, temperature, soil type).
Measurement: Measure plant growth after a fixed period for all plants.

This design allows us to establish causation because random assignment creates comparable groups, and any difference in growth can be attributed to the fertilizer.

2.1.3 Question 83

Explain why the median is often preferred over the mean when a distribution is skewed. Use an example.

Solution

The median is preferred for skewed distributions because it’s resistant to outliers, while the mean is pulled toward the tail.

Example: Hospital stay lengths (in days): 1, 2, 2, 3, 3, 4, 5, 45

Mean = (1+2+2+3+3+4+5+45)/8 = 65/8 = 8.125 days
Median = (3+3)/2 = 3 days

Most patients (7 out of 8) stayed 1-5 days, but one patient stayed 45 days. The median (3 days) better represents a “typical” stay. The mean (8.125 days) is misleading because it’s pulled up by the one extreme value.

Why this matters: In right-skewed data (like income, hospital stays, recovery times), the median gives a more accurate picture of what’s typical.

2.1.4 Question 84

What does it mean for two events to be independent? Explain using a medical testing example.

Solution

Independence means that knowing one event occurred does not change the probability of the other event occurring. Mathematically: P(A|B) = P(A).

Medical Example: Consider testing for two different diseases - Disease X and Disease Y.

Suppose: - P(Disease X) = 0.10 (10% of people have it) - P(Disease Y) = 0.05 (5% of people have it)

If the diseases are independent, then: - P(Disease X | has Disease Y) = P(Disease X) = 0.10

This means: among people who have Disease Y, still only 10% have Disease X - the same as in the general population. Having Disease Y doesn’t change the probability of having Disease X.

If the diseases were dependent (not independent), knowing someone has Disease Y would change the probability they have Disease X (e.g., if both diseases were caused by the same risk factor).

2.1.5 Question 85

Describe the difference between sampling bias and random sampling error. Why is sampling bias more problematic?

Solution

Random Sampling Error: - Natural variation that occurs when sampling - Occurs even with proper random sampling - Unpredictable - sometimes sample mean is too high, sometimes too low - Decreases with larger sample size - Example: Randomly sampling 100 fish from a lake - by chance, you might get slightly larger fish than the true population average

Sampling Bias: - Systematic error due to poor sampling method - Sample consistently over- or under-represents certain groups - Predictable direction - always pulls results the same way - Does NOT decrease with larger sample size - Example: Only sampling fish from one cove (easier to access) when fish in that cove are systematically larger than in other parts of the lake

Why bias is worse: You can reduce random error by taking larger samples, but no amount of data will fix bias. A biased sampling method will give you the wrong answer even with huge samples.

2.1.6 Question 86

A study finds that people who drink coffee have lower rates of heart disease. Can we conclude that coffee prevents heart disease? Why or why not?

Solution

No, we cannot conclude that coffee prevents heart disease from this observational study alone.

Reasons:

Correlation ≠ Causation: This is an observational study showing an association, not an experiment showing causation.
Potential Confounding: There may be other variables associated with both coffee drinking and heart disease:
- Coffee drinkers might exercise more
- Coffee drinkers might have higher income (better healthcare)
- Coffee drinkers might have lower stress levels
- Many other lifestyle factors could differ
Reverse Causation: Maybe people with heart disease are advised to avoid coffee, creating the observed pattern.

To establish causation, we’d need a randomized experiment where people are randomly assigned to drink coffee or not, and all other factors are controlled. Only then could we attribute differences in heart disease to the coffee itself.

The observational study suggests a relationship worth investigating, but doesn’t prove coffee is the cause.

2.1.7 Question 87

Explain what the 68-95-99.7 rule tells us about a normal distribution. Why is this useful?

Solution

The 68-95-99.7 Rule (Empirical Rule) states that for a normal distribution: - Approximately 68% of data falls within 1 standard deviation of the mean (μ ± σ) - Approximately 95% of data falls within 2 standard deviations of the mean (μ ± 2σ) - Approximately 99.7% of data falls within 3 standard deviations of the mean (μ ± 3σ)

Why this is useful:

Quick estimates: We can quickly estimate what percentage of data falls in a range without complex calculations
Identify unusual values: Values more than 2-3 standard deviations from the mean are rare (less than 5% or 0.3%)
Real-world application: Many biological measurements (height, blood pressure, birth weight) approximately follow a normal distribution

Example: If adult male heights are normally distributed with mean 175 cm and SD 7 cm: - About 68% of men are between 168-182 cm (175 ± 7) - About 95% of men are between 161-189 cm (175 ± 14) - A man who is 196 cm (175 + 21 = 175 + 3×7) is in the tallest 0.15% (very unusual)

2.1.8 Question 88

What is a confounding variable? Give an example from environmental or health research.

Solution

A confounding variable is a variable that: 1. Is associated with both the independent variable (exposure) and the dependent variable (outcome) 2. Creates a false or misleading association between the exposure and outcome 3. Makes it difficult to determine if the exposure actually causes the outcome

Example from Health Research:

Question: Does vitamin supplement use improve health outcomes?

Observed: People who take vitamins have better health

Confounding Variables: - Exercise: People who take vitamins may also exercise more - Exercise is associated with: taking vitamins (health-conscious behavior) AND better health - Income: Wealthier people may both take vitamins AND have better healthcare access - Diet: People who take vitamins may eat healthier overall

The Problem: Without accounting for these confounders, we can’t tell if better health is due to: - The vitamins themselves, OR - Exercise, income, diet, or other factors

Solution: Random assignment in an experiment would distribute these confounding variables evenly across groups, or we can measure and control for them statistically in observational studies.

2.1.9 Question 89

Describe what sensitivity and specificity mean for a medical diagnostic test. Why are both important?

Solution

Sensitivity = P(Test Positive | Disease Present) - The probability that the test correctly identifies someone who HAS the disease - Measures how good the test is at catching disease (detecting true positives) - High sensitivity means few false negatives

Specificity = P(Test Negative | Disease Absent) - The probability that the test correctly identifies someone who DOES NOT have the disease - Measures how good the test is at ruling out disease in healthy people (avoiding false positives) - High specificity means few false positives

Why both are important:

High Sensitivity is crucial when: - Missing disease has serious consequences (e.g., cancer screening) - Disease is treatable if caught early - We want to catch all possible cases (even if we get some false alarms)

High Specificity is crucial when: - False positives cause harm (unnecessary treatment, anxiety, cost) - Follow-up tests are invasive or expensive - Disease is rare (otherwise, we’d get too many false positives)

Ideal: High sensitivity AND high specificity, but there’s often a trade-off. Tests are designed based on which type of error is more costly in that particular situation.

Example: HIV screening uses high-sensitivity tests (don’t want to miss anyone infected), with positive results confirmed by high-specificity tests (ensure they truly have HIV before diagnosis).

2.1.10 Question 90

Explain what expected value E(X) means for a binomial random variable. What does it tell us in practical terms?

Solution

For a binomial random variable, E(X) = np represents the expected value or mean of the distribution.

What it means:

Long-run average: If we repeated the random process many times, the average number of successes would be close to E(X)
Most likely neighborhood: Values near E(X) are more probable than values far from E(X)
NOT a prediction for a single trial: E(X) might not even be a possible outcome!

Practical Example:

A biologist plants 100 seeds, each with 0.70 probability of germinating. - X = number of seeds that germinate - E(X) = np = 100 × 0.70 = 70 seeds

This tells us: - On average, we expect 70 seeds to germinate - If we planted 100 seeds many times, the average number germinating would be about 70 - Values near 70 (like 68, 69, 71, 72) are more likely than values far from 70 (like 50 or 90) - We won’t necessarily get exactly 70 seeds germinating in any particular trial due to random variation

Important: E(X) is the center of the distribution, but actual outcomes will vary around this value according to the binomial distribution.

2.2 Questions 91-100: Calculations and Applications

2.2.1 Question 91

A wildlife researcher captures and measures 50 randomly selected deer from a forest of 2000 deer. The average weight of the 50 deer is 68 kg.

Is 68 kg a parameter or a statistic?
What population is the researcher interested in?
What is the sample in this study?

Solution

Statistic - It’s calculated from a sample (50 deer), not the entire population (2000 deer).
Population: All 2000 deer in the forest. The researcher wants to learn about the average weight of all deer in this forest.
Sample: The 50 deer that were captured and measured.

Key distinction: The true average weight of all 2000 deer is a parameter (unknown). The 68 kg is our sample statistic, an estimate of that parameter.

2.2.2 Question 92

Recovery times (in weeks) for 8 patients after surgery are: 2, 3, 3, 4, 4, 5, 6, 14

Calculate the mean and median.
Which measure better represents typical recovery time? Why?
Is the value 14 an outlier? Show your work.

Solution

Calculations:

Mean = (2+3+3+4+4+5+6+14)/8 = 41/8 = 5.125 weeks
Median: Middle values are 4 and 4, so median = (4+4)/2 = 4 weeks

Median (4 weeks) is better because:

Most patients (7 out of 8) recovered in 2-6 weeks
One patient took 14 weeks (an outlier)
The mean is pulled up by this outlier
The median better represents what’s “typical”

Check for outlier using IQR method:

Q1 = 3 (between 2nd and 3rd values)
Q3 = 5.5 (between 6th and 7th values)
IQR = Q3 - Q1 = 5.5 - 3 = 2.5
Upper boundary = Q3 + 1.5(IQR) = 5.5 + 1.5(2.5) = 5.5 + 3.75 = 9.25
Since 14 > 9.25, yes, 14 is an outlier

2.2.3 Question 93

In a study of 200 patients: 120 received Treatment A, 80 received Treatment B. Among Treatment A patients, 90 improved. Among Treatment B patients, 40 improved.

What is P(Improved)?
What is P(Improved | Treatment A)?
Are “Improved” and “Treatment A” independent? Justify your answer.

Solution

P(Improved) = Total improved / Total patients = (90+40)/200 = 130/200 = 0.65 or 65%
P(Improved | Treatment A) = Improved in A / Total in A = 90/120 = 0.75 or 75%
Check independence: Are P(Improved|Treatment A) and P(Improved) equal?

P(Improved|Treatment A) = 0.75
P(Improved) = 0.65
Since 0.75 ≠ 0.65, NO, they are NOT independent

Interpretation: The probability of improvement depends on which treatment was received. Treatment A patients have a higher improvement rate (75%) than the overall rate (65%), showing that improvement and treatment type are related (dependent).

2.2.4 Question 94

A screening test for a disease has sensitivity = 0.92 and specificity = 0.88. The disease prevalence is 3% in the population.

If a person has the disease, what is the probability they test positive?
If a person does NOT have the disease, what is the probability they test negative?
Explain why a positive test doesn’t necessarily mean the person has the disease.

Solution

P(Test + | Disease) = Sensitivity = 0.92 or 92%
P(Test - | No Disease) = Specificity = 0.88 or 88%
Why positive ≠ definitely has disease:

Even though the test is fairly accurate, the disease is rare (3% prevalence). This means:

Out of 1000 people: About 30 have disease, 970 don’t
Of the 30 with disease: ~28 test positive (sensitivity 92%)
Of the 970 without disease: ~116 test positive (1-specificity = 12% false positive rate)

So we’d expect about 28 + 116 = 144 total positive tests, but only 28 actually have the disease!

P(Disease|Test+) = 28/144 ≈ 19%

This is called the positive predictive value (PPV) - only about 19% of positive tests actually indicate disease. The low prevalence combined with imperfect specificity leads to many false positives.

This is why positive screening tests usually require confirmatory testing!

2.2.5 Question 95

A geneticist studies inheritance of a trait. Each offspring has 0.25 probability of expressing the trait, independent of siblings. A mating pair has 6 offspring.

What distribution models the number of offspring expressing the trait? State the parameters.
What is the expected number of offspring expressing the trait?
What is the probability that NO offspring express the trait? (Write the formula, you don’t need to compute the final number.)

Solution

Binomial distribution with parameters:

n = 6 (number of offspring)
p = 0.25 (probability of expressing trait)

Justification: - Fixed number of trials (6 offspring) - Two outcomes (expresses trait or not) - Same probability for each (0.25) - Independence (siblings inherit independently)

E(X) = np = 6 × 0.25 = 1.5 offspring

On average, we expect 1.5 out of 6 offspring to express the trait.

P(X = 0) formula:

\[P(X = 0) = \binom{6}{0}(0.25)^0(0.75)^6 = 1 \times 1 \times (0.75)^6\]

This gives us P(X = 0) ≈ 0.178 or about 17.8% chance that no offspring express the trait.

2.2.6 Question 96

Birth weights are normally distributed with mean 3400 grams and standard deviation 500 grams.

Between what two weights do approximately 95% of births fall?
What percentage of babies weigh more than 4400 grams?
A baby weighing 2400 grams has what z-score?

Solution

95% fall within 2 SD of mean:

Lower bound: μ - 2σ = 3400 - 2(500) = 3400 - 1000 = 2400 grams
Upper bound: μ + 2σ = 3400 + 2(500) = 3400 + 1000 = 4400 grams
Answer: Between 2400 and 4400 grams

4400 grams is 2 SD above mean:

95% fall within 2 SD, leaving 5% outside
By symmetry: 2.5% below 2400g and 2.5% above 4400g
Answer: Approximately 2.5%

Z-score calculation: \[z = \frac{x - \mu}{\sigma} = \frac{2400 - 3400}{500} = \frac{-1000}{500} = -2\]

Interpretation: A 2400g baby is 2 standard deviations below the mean.

2.2.7 Question 97

Consider the following contingency table for 500 patients:

	Has Fever	No Fever	Total
Flu	60	20	80
No Flu	80	340	420
Total	140	360	500

What is P(Flu)?
What is P(Fever | Flu)?
What is P(Fever | No Flu)?
Does having fever indicate flu is more likely? Explain.

Solution

P(Flu) = 80/500 = 0.16 or 16%
P(Fever | Flu) = 60/80 = 0.75 or 75%
P(Fever | No Flu) = 80/420 ≈ 0.19 or 19%
Yes, fever indicates flu is more likely:

Let’s compare P(Flu | Fever) to P(Flu): - P(Flu | Fever) = 60/140 ≈ 0.43 or 43% - P(Flu) = 0.16 or 16%

Among people with fever, 43% have flu, compared to only 16% in the general population. So knowing someone has a fever increases the probability they have flu from 16% to 43%.

Also note: 75% of flu patients have fever, but only 19% of non-flu patients have fever, showing fever is associated with flu.

2.2.8 Question 98

A researcher monitors 30 bird nests. Historically, 60% of nests successfully fledge at least one chick. Let X = number of successful nests.

What are the parameters n and p?
Calculate E(X) and interpret it.
If monitoring costs $15 per nest plus a fixed $50 cost, and Y = total cost, what is E(Y)?

Solution

Parameters:

n = 30 (number of nests monitored)
p = 0.60 (probability of success per nest)

E(X) = np = 30 × 0.60 = 18 nests

Interpretation: On average, we expect 18 out of 30 nests to successfully fledge at least one chick. If we monitored many sets of 30 nests, the average number of successful nests would be close to 18.

Cost calculation:

Y = 15X + 50 (where X is number of successful nests)
E(Y) = 15·E(X) + 50 (using property of expected value)
E(Y) = 15(18) + 50 = 270 + 50 = $320

Interpretation: On average, the total monitoring cost is expected to be $320.

Note: This assumes cost is $15 per successful nest. If it’s $15 per nest monitored regardless of success, then Y = 15(30) + 50 = $500 always.

2.2.9 Question 99

Data on plant heights (cm): 10, 12, 12, 15, 18, 20, 22, 35

Calculate Q1, Q3, and IQR.
Determine the outlier boundaries.
Are there any outliers?
Describe the shape of this distribution.

Solution

Quartiles (with 8 values):

Q1 is between 2nd and 3rd values: Q1 = (12+12)/2 = 12 cm
Q3 is between 6th and 7th values: Q3 = (20+22)/2 = 21 cm
IQR = Q3 - Q1 = 21 - 12 = 9 cm

Outlier boundaries:

Lower: Q1 - 1.5(IQR) = 12 - 1.5(9) = 12 - 13.5 = -1.5 cm
Upper: Q3 + 1.5(IQR) = 21 + 1.5(9) = 21 + 13.5 = 34.5 cm

Check for outliers:

All values ≥ -1.5, so no low outliers
35 > 34.5, so 35 is an outlier

Distribution shape:

Most values clustered 10-22 cm (relatively symmetric)
One extreme value (35) to the right
Right-skewed (positive skew) due to the high outlier

We’d also expect mean > median for right-skewed data.

2.2.10 Question 100

In a clinical trial, patients are randomly assigned to receive either Drug X or placebo. Neither the patients nor the doctors evaluating outcomes know who received which treatment.

What experimental design feature is described by “neither patients nor doctors know”?
Why is this important?
What role does random assignment play?
If 75% of Drug X patients improve vs. 50% of placebo patients, can we conclude Drug X causes improvement? Why?

Solution

Double-blinding (or just “blinding”)

Patients don’t know = single blind
Doctors also don’t know = double blind

Why blinding is important:

Without blinding: - Placebo effect: Patients who know they’re getting the drug might improve due to expectation/belief - Observer bias: Doctors who know which patients got the drug might unconsciously evaluate them more favorably - Treatment bias: Doctors might treat groups differently if they know who got what

With blinding: - Any differences in outcomes can be attributed to the drug itself - Eliminates psychological and observational biases

Random assignment’s role:

Creates comparable groups at the start
Distributes potential confounding variables evenly across groups
Ensures any pre-existing differences are due to chance, not systematic bias
Makes groups similar in all ways except the treatment

Yes, we can conclude Drug X causes improvement because:

This is a well-designed experiment with: 1. Random assignment → comparable groups 2. Control group (placebo) → proper comparison 3. Blinding → eliminates bias 4. Clear difference (75% vs 50%)

All these features together allow us to establish causation, not just correlation. The 25 percentage point difference can be attributed to Drug X itself, not to confounding variables or bias.

However, we should note: - Results apply to patients similar to those in the trial - Statistical significance should be verified - Clinical significance should be considered (is 25% improvement large enough to matter?)

2.3 Summary

You’ve completed all 100 practice questions! Here’s how they break down:

Multiple Choice (80 questions):

Variables and Data Types: 10 questions
Study Design and Sampling: 10 questions
Descriptive Statistics - Center: 10 questions
Descriptive Statistics - Spread: 10 questions
Probability Basics: 10 questions
Conditional Probability & Independence: 10 questions
Binomial Distribution: 10 questions
Normal Distribution: 10 questions

Short Answer (20 questions):

Conceptual Understanding: 10 questions
Calculations and Applications: 10 questions

Study Recommendations:

Review any topics where you struggled - go back to lecture notes and homework
Practice explaining concepts - can you teach them to someone else?
Time yourself - can you work efficiently under exam conditions?
Focus on interpretation - don’t just calculate, understand what results mean
Use the actual exam format - 15 MC + 2 FR questions in 90 minutes

Good luck on your midterm!