STAT 7 Midterm Practice
Statistical Methods for Biological, Environmental, and Health Sciences
0.1 Instructions
This comprehensive practice set contains 100 questions to help you prepare for the STAT 7 Midterm Examination:
- 80 Multiple Choice Questions (Section I)
- 20 Short Answer Questions (Section II)
How to use this practice:
- Work through questions systematically by topic
- Check your answers against the solutions provided
- Focus on understanding concepts, not just memorizing answers
- If you struggle with a topic, review lecture materials and homework
- Time yourself: aim for about 2 minutes per multiple choice, 5-6 minutes per short answer
Topics Covered:
- Variables and data types
- Study design and sampling methods
- Descriptive statistics
- Probability basics and rules
- Contingency tables and independence
- Diagnostic testing and Bayes’ Theorem
- Binomial distribution
- Normal distribution
1 Section I: Multiple Choice Questions (80 Questions)
1.1 Topic 1: Variables and Data Types (Questions 1-10)
1.1.1 Question 1
A researcher records the number of eggs in bird nests. This variable is:
- Categorical - Nominal
- Categorical - Ordinal
- Numerical - Continuous
- Numerical - Discrete
Answer: D) Numerical - Discrete
The number of eggs is countable (0, 1, 2, 3, …) and takes on whole number values only. Discrete variables are numerical but can only take specific, countable values.
1.1.2 Question 2
Blood type (A, B, AB, O) is an example of:
- Numerical - Discrete
- Numerical - Continuous
- Categorical - Nominal
- Categorical - Ordinal
Answer: C) Categorical - Nominal
Blood types are categories with no natural ordering. One blood type is not “more” or “less” than another.
1.1.3 Question 3
A hospital rates patient pain on a scale: None, Mild, Moderate, Severe, Extreme. This is:
- Numerical - Discrete
- Numerical - Continuous
- Categorical - Nominal
- Categorical - Ordinal
Answer: D) Categorical - Ordinal
These categories have a natural order (None < Mild < Moderate < Severe < Extreme), making it ordinal categorical.
1.1.4 Question 4
Water temperature measured in degrees Celsius is:
- Categorical - Nominal
- Categorical - Ordinal
- Numerical - Discrete
- Numerical - Continuous
Answer: D) Numerical - Continuous
Temperature can take any value within a range (e.g., 23.5°C, 23.51°C) and is measured on a continuous scale.
1.1.5 Question 5
Which of the following is a continuous numerical variable?
- Number of patients admitted to a hospital
- Tumor size measured in millimeters
- Number of species in an ecosystem
- Survival status (alive/dead)
Answer: B) Tumor size measured in millimeters
Tumor size can take any value within a range (e.g., 15.3 mm, 15.37 mm). The others are discrete (counts) or categorical.
1.1.6 Question 6
A study records whether each plant is diseased (Yes/No). This variable is:
- Numerical - Continuous
- Numerical - Discrete
- Categorical - Ordinal
- Categorical - Nominal
Answer: D) Categorical - Nominal
This is a binary categorical variable (two categories with no ordering).
1.1.7 Question 7
In a population of 10,000 trees, a researcher measures the diameter of 100 randomly selected trees. The average diameter of these 100 trees is:
- A population parameter
- A sample statistic
- A population
- A sample
Answer: B) A sample statistic
The average is calculated from a sample (100 trees), so it’s a statistic. If it were calculated from all 10,000 trees, it would be a parameter.
1.1.8 Question 8
The true average blood pressure of all adults in California is:
- A sample
- A statistic
- A parameter
- A variable
Answer: C) A parameter
A parameter is a numerical summary of a population. This describes the entire population of adults in California.
1.1.9 Question 9
Reaction time measured in milliseconds is best classified as:
- Categorical - Ordinal
- Numerical - Discrete
- Numerical - Continuous
- Categorical - Nominal
Answer: C) Numerical - Continuous
Time can be measured to arbitrary precision (e.g., 245.673 ms) and is continuous.
1.1.10 Question 10
Which variable is discrete?
- Height of plants in centimeters
- Weight of fish in grams
- Number of mutations in a DNA sequence
- pH level of water samples
Answer: C) Number of mutations in a DNA sequence
Mutations are countable (0, 1, 2, …). The others are continuous measurements.
1.2 Topic 2: Study Design and Sampling (Questions 11-20)
1.2.1 Question 11
A researcher randomly assigns 50 mice to receive a new drug and 50 mice to receive a placebo. This is an example of:
- An observational study
- A survey
- An experiment
- Convenience sampling
Answer: C) An experiment
The researcher actively assigns treatments (random assignment), making this an experiment.
1.2.2 Question 12
Which study design can establish a cause-and-effect relationship?
- Observational study
- Survey
- Experiment with random assignment
- Case study
Answer: C) Experiment with random assignment
Random assignment helps ensure groups are similar, allowing us to attribute differences to the treatment (causation).
1.2.3 Question 13
A researcher surveys people at a gym about their exercise habits. This sampling method is:
- Simple random sampling
- Stratified sampling
- Systematic sampling
- Convenience sampling
Answer: D) Convenience sampling
The researcher samples people who are easily accessible (at the gym), which is convenience sampling and leads to bias.
1.2.4 Question 14
To study fish populations in a lake, a researcher divides the lake into 5 depth zones and randomly selects 20 fish from each zone. This is:
- Simple random sampling
- Stratified sampling
- Cluster sampling
- Convenience sampling
Answer: B) Stratified sampling
The population is divided into strata (depth zones) and random samples are taken from each stratum.
1.2.5 Question 15
In an experiment, neither the participants nor the researchers collecting data know who receives which treatment. This is called:
- Random assignment
- Blinding
- Replication
- Control
Answer: B) Blinding
Blinding prevents bias by keeping participants and/or researchers unaware of treatment assignments.
1.2.6 Question 16
A control group in an experiment:
- Receives the treatment being tested
- Provides a comparison for the treatment group
- Is always larger than the treatment group
- Must consist of volunteers only
Answer: B) Provides a comparison for the treatment group
The control group provides a baseline for comparison to see if the treatment has an effect.
1.2.7 Question 17
An ecologist wants to estimate mercury levels in all fish in a lake but only samples fish from one small cove. This is an example of:
- Random sampling error
- Sampling bias
- Measurement error
- Nonresponse bias
Answer: B) Sampling bias
The sample (fish from one cove) is not representative of the population (all fish in the lake).
1.2.8 Question 18
Which is TRUE about random assignment in experiments?
- It eliminates all bias
- It ensures the sample represents the population
- It helps create comparable groups at the start
- It is the same as random sampling
Answer: C) It helps create comparable groups at the start
Random assignment distributes potential confounding variables evenly across groups, making them comparable.
1.2.9 Question 19
A study finds that people who take vitamin supplements have better health outcomes. However, people who take supplements also tend to exercise more and eat healthier. This is an example of:
- Random sampling
- Blinding
- Confounding
- Replication
Answer: C) Confounding
Exercise and diet are confounding variables - they’re associated with both supplement use and health outcomes, making it unclear what causes the better health.
1.2.10 Question 20
Replication in experimental design means:
- Repeating the experiment multiple times
- Using multiple subjects in the experiment
- Using a control group
- Both A and B
Answer: D) Both A and B
Replication includes both using multiple experimental units (subjects) and repeating the entire experiment.
1.3 Topic 3: Descriptive Statistics - Center (Questions 21-30)
1.3.1 Question 21
For the dataset: 2, 3, 5, 5, 6, 8, the median is:
- 4
- 5
- 5.5
- 4.83
Answer: B) 5
With 6 values, the median is the average of the 3rd and 4th values: (5 + 5)/2 = 5.
1.3.2 Question 22
For the dataset: 1, 2, 2, 3, 3, 3, 4, the mode is:
- 1
- 2
- 3
- 4
Answer: C) 3
The mode is the most frequently occurring value. The value 3 appears 3 times, more than any other value.
1.3.3 Question 23
A distribution has mean = 50 and median = 45. This distribution is likely:
- Left-skewed
- Symmetric
- Right-skewed
- Uniform
Answer: C) Right-skewed
When mean > median, the distribution is typically right-skewed. The mean is pulled toward the tail by extreme high values.
1.3.4 Question 24
Recovery times (days) for 5 patients are: 3, 4, 5, 6, 25. Which measure best represents typical recovery time?
- Mean (8.6 days)
- Median (5 days)
- Mode
- Range
Answer: B) Median (5 days)
The median is resistant to outliers. The value 25 is an outlier that pulls the mean up, but most patients recover in 3-6 days.
1.3.5 Question 25
For a perfectly symmetric distribution:
- Mean < Median
- Mean > Median
- Mean = Median
- Cannot determine the relationship
Answer: C) Mean = Median
In a symmetric distribution, the mean and median are equal because the distribution is balanced.
1.3.6 Question 26
Which statement is TRUE about the mean?
- It is always equal to the median
- It is not affected by outliers
- It is affected by every value in the dataset
- It is always a value that appears in the dataset
Answer: C) It is affected by every value in the dataset
The mean uses all data values in its calculation, so changing any value changes the mean.
1.3.7 Question 27
A dataset has values: 10, 12, 14, 16, 18. If we add 5 to each value, the new mean is:
- The same as before
- 5 more than before
- 5 times the old mean
- Cannot be determined
Answer: B) 5 more than before
Adding a constant to all values increases the mean by that constant. Original mean = 14, new mean = 19.
1.3.8 Question 28
Plant heights have mean = 30 cm and median = 35 cm. This suggests:
- Right-skewed distribution
- Left-skewed distribution
- Symmetric distribution
- No outliers
Answer: B) Left-skewed distribution
When mean < median, the distribution is typically left-skewed. The mean is pulled down by low outliers.
1.3.9 Question 29
A distribution with two distinct peaks is called:
- Uniform
- Skewed
- Bimodal
- Normal
Answer: C) Bimodal
A bimodal distribution has two modes (peaks), often indicating two different groups in the data.
1.3.10 Question 30
Which is resistant to outliers?
- Mean
- Range
- Standard deviation
- Median
Answer: D) Median
The median only depends on the middle value(s) and is not affected by how extreme the outliers are.
1.4 Topic 4: Descriptive Statistics - Spread (Questions 31-40)
1.4.1 Question 31
For a dataset with Q1 = 10 and Q3 = 20, the IQR is:
- 5
- 10
- 15
- 20
Answer: B) 10
IQR = Q3 - Q1 = 20 - 10 = 10.
1.4.2 Question 32
A value is considered an outlier if it is:
- More than 1 IQR from the quartiles
- More than 1.5 IQR from the quartiles
- More than 2 IQR from the quartiles
- More than 3 IQR from the quartiles
Answer: B) More than 1.5 IQR from the quartiles
Outliers are values below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR).
1.4.3 Question 33
Given Q1 = 20, Q3 = 40, which value is an outlier?
- 10
- 15
- 50
- 70
Answer: D) 70
IQR = 40 - 20 = 20. Upper boundary = 40 + 1.5(20) = 70. Values above 70 are outliers (70 itself is on the boundary, but any value above it would be an outlier).
1.4.4 Question 34
If all values in a dataset are identical, the standard deviation is:
- Equal to the mean
- Equal to 1
- Equal to 0
- Undefined
Answer: C) Equal to 0
If there’s no variability (all values the same), the standard deviation is 0.
1.4.5 Question 35
Dataset A has standard deviation 5. Dataset B has standard deviation 10. Both have the same mean. Which is true?
- Dataset A has more variability
- Dataset B has more variability
- Both have the same variability
- Cannot compare without seeing the data
Answer: B) Dataset B has more variability
Larger standard deviation = more spread/variability in the data.
1.4.6 Question 36
The range of a dataset is:
- Q3 - Q1
- Maximum - Minimum
- The most frequent value
- The middle value
Answer: B) Maximum - Minimum
Range is the difference between the largest and smallest values.
1.4.7 Question 37
Which measure of spread is most affected by outliers?
- IQR
- Range
- Median
- Q1
Answer: B) Range
Range uses the extreme values (max and min), so outliers directly affect it. IQR uses quartiles and is more resistant.
1.4.8 Question 38
If we multiply every value in a dataset by 3, the standard deviation:
- Stays the same
- Is multiplied by 3
- Is multiplied by 9
- Is divided by 3
Answer: B) Is multiplied by 3
Multiplying all values by a constant multiplies the standard deviation by that constant.
1.4.9 Question 39
A small standard deviation indicates:
- Values are spread far from the mean
- Values are close to the mean
- The distribution is skewed
- There are outliers present
Answer: B) Values are close to the mean
Small standard deviation means low variability - data points cluster near the mean.
1.4.10 Question 40
The IQR represents the spread of:
- All the data
- The middle 50% of the data
- The middle 25% of the data
- The outer 50% of the data
Answer: B) The middle 50% of the data
IQR is the range from Q1 (25th percentile) to Q3 (75th percentile), covering the middle half of the data.
1.5 Topic 5: Probability Basics (Questions 41-50)
1.5.1 Question 41
If P(A) = 0.6, then P(not A) equals:
- 0.6
- 0.4
- 0.3
- 1.0
Answer: B) 0.4
P(not A) = 1 - P(A) = 1 - 0.6 = 0.4.
1.5.2 Question 42
In a study of 100 patients, 60 improved. The probability a randomly selected patient improved is:
- 0.40
- 0.50
- 0.60
- 1.00
Answer: C) 0.60
P(improved) = 60/100 = 0.60.
1.5.3 Question 43
Events A and B are mutually exclusive. This means:
- P(A AND B) = 0
- P(A AND B) = 1
- P(A|B) = P(A)
- P(A AND B) = P(A) × P(B)
Answer: A) P(A AND B) = 0
Mutually exclusive events cannot occur together, so their intersection has probability 0.
1.5.4 Question 44
If P(A) = 0.3 and P(B) = 0.4, and A and B are mutually exclusive, then P(A OR B) equals:
- 0.12
- 0.30
- 0.70
- 0.10
Answer: C) 0.70
For mutually exclusive events: P(A OR B) = P(A) + P(B) = 0.3 + 0.4 = 0.7.
1.5.5 Question 45
If P(A) = 0.5, P(B) = 0.3, and P(A AND B) = 0.15, then P(A OR B) equals:
- 0.65
- 0.80
- 0.50
- 0.35
Answer: A) 0.65
Using the addition rule: P(A OR B) = P(A) + P(B) - P(A AND B) = 0.5 + 0.3 - 0.15 = 0.65.
1.5.6 Question 46
Two events A and B are independent if:
- P(A AND B) = 0
- P(A|B) = P(A)
- P(A OR B) = P(A) + P(B)
- P(A) = P(B)
Answer: B) P(A|B) = P(A)
Events are independent if knowing B occurred doesn’t change the probability of A.
1.5.7 Question 47
If A and B are independent with P(A) = 0.4 and P(B) = 0.5, then P(A AND B) equals:
- 0.20
- 0.90
- 0.10
- 0.70
Answer: A) 0.20
For independent events: P(A AND B) = P(A) × P(B) = 0.4 × 0.5 = 0.20.
1.5.8 Question 48
A probability of 0 means:
- The event always occurs
- The event never occurs
- The event occurs half the time
- We don’t have enough information
Answer: B) The event never occurs
A probability of 0 indicates impossibility.
1.5.9 Question 49
Which value CANNOT be a probability?
- 0
- 0.5
- 1.0
- 1.5
Answer: D) 1.5
Probabilities must be between 0 and 1 (inclusive). 1.5 is greater than 1 and thus impossible.
1.5.10 Question 50
In a sample space of equally likely outcomes, if there are 8 favorable outcomes and 40 total outcomes, the probability is:
- 0.20
- 0.25
- 5.0
- 0.50
Answer: A) 0.20
P = favorable/total = 8/40 = 0.20.
1.6 Topic 6: Conditional Probability & Independence (Questions 51-60)
1.6.1 Question 51
If P(A) = 0.6, P(B) = 0.4, and P(A AND B) = 0.24, then P(A|B) equals:
- 0.24
- 0.40
- 0.60
- 0.96
Answer: C) 0.60
P(A|B) = P(A AND B) / P(B) = 0.24 / 0.4 = 0.60.
1.6.2 Question 52
Using the data from Question 51, are events A and B independent?
- Yes, because P(A|B) = P(A)
- No, because P(A|B) ≠ P(A)
- Cannot determine
- Yes, because P(A AND B) = P(A) × P(B)
Answer: A) Yes, because P(A|B) = P(A)
P(A|B) = 0.60 and P(A) = 0.60, so they’re equal, indicating independence. We can also verify: P(A) × P(B) = 0.6 × 0.4 = 0.24 = P(A AND B).
1.6.3 Question 53
In a group of 100 patients, 40 have disease D and 30 have symptom S. If 20 have both D and S, what is P(S|D)?
- 0.20
- 0.30
- 0.50
- 0.67
Answer: C) 0.50
P(S|D) = P(S AND D) / P(D) = (20/100) / (40/100) = 20/40 = 0.50.
1.6.4 Question 54
If two events are independent, then:
- They cannot occur together
- They must have equal probabilities
- Knowing one occurred doesn’t change the probability of the other
- P(A OR B) = P(A) + P(B)
Answer: C) Knowing one occurred doesn’t change the probability of the other
This is the definition of independence: P(A|B) = P(A).
1.6.5 Question 55
Events that cannot both occur at the same time are:
- Independent
- Mutually exclusive
- Conditional
- Complementary
Answer: B) Mutually exclusive
Mutually exclusive (or disjoint) events have no overlap: P(A AND B) = 0.
1.6.6 Question 56
Can two events be both mutually exclusive AND independent (assuming neither has probability 0)?
- Yes, always
- No, never
- Sometimes
- Only if P(A) = P(B)
Answer: B) No, never
If events are mutually exclusive, P(A AND B) = 0. If they’re independent, P(A AND B) = P(A)×P(B). These can only both be true if P(A) = 0 or P(B) = 0.
1.6.7 Question 57
In a study, P(Recovery|Treatment) = 0.8 and P(Recovery|Placebo) = 0.5. This suggests:
- Recovery and treatment are independent
- Recovery depends on whether treatment was received
- Recovery and treatment are mutually exclusive
- The probabilities are incorrect
Answer: B) Recovery depends on whether treatment was received
The probability of recovery differs depending on the treatment, showing dependence (not independence).
1.6.8 Question 58
If P(A|B) = 0.7 and P(B) = 0.2, then P(A AND B) equals:
- 0.14
- 0.50
- 0.90
- 3.50
Answer: A) 0.14
From P(A|B) = P(A AND B)/P(B), we get: P(A AND B) = P(A|B) × P(B) = 0.7 × 0.2 = 0.14.
1.6.9 Question 59
In a contingency table, if P(A|B) = P(A), then:
- A and B are mutually exclusive
- A and B are independent
- A and B are dependent
- A and B are complementary
Answer: B) A and B are independent
When P(A|B) = P(A), knowing B doesn’t change the probability of A, which is the definition of independence.
1.6.10 Question 60
The complement of event A is:
- All outcomes where A occurs
- All outcomes where A does not occur
- All outcomes where A and B occur
- The same as P(A)
Answer: B) All outcomes where A does not occur
The complement (not A) consists of all outcomes in the sample space that are not in A.
1.7 Topic 7: Binomial Distribution (Questions 61-70)
1.7.1 Question 61
Which scenario follows a binomial distribution?
- Time until a patient recovers
- Number of heads in 10 coin flips
- Weight of newborn babies
- Types of birds observed
Answer: B) Number of heads in 10 coin flips
This has fixed n (10 trials), two outcomes (heads/tails), same probability (0.5), and independence.
1.7.2 Question 62
For a binomial distribution to apply, which is NOT required?
- Fixed number of trials
- Two possible outcomes per trial
- Same probability for each trial
- Trials must be dependent
Answer: D) Trials must be dependent
Binomial requires independence, not dependence.
1.7.3 Question 63
A binomial random variable has n = 20 and p = 0.3. The expected value E(X) is:
- 6
- 10
- 15
- 20
Answer: A) 6
E(X) = np = 20 × 0.3 = 6.
1.7.4 Question 64
For a binomial distribution with n = 50 and p = 0.6, the expected value is:
- 20
- 30
- 40
- 50
Answer: B) 30
E(X) = np = 50 × 0.6 = 30.
1.7.5 Question 65
A researcher plants 100 seeds, each with 0.8 probability of germinating independently. If X = number of seeds that germinate, then X follows:
- Normal distribution
- Binomial distribution with n=100, p=0.8
- Uniform distribution
- Binomial distribution with n=0.8, p=100
Answer: B) Binomial distribution with n=100, p=0.8
Fixed number of trials (100), two outcomes (germinate/don’t), same probability (0.8), independence.
1.7.6 Question 66
For the scenario in Question 65, what is E(X)?
- 20
- 50
- 80
- 100
Answer: C) 80
E(X) = np = 100 × 0.8 = 80 seeds expected to germinate.
1.7.7 Question 67
If a binomial random variable has n = 10 and E(X) = 7, what is p?
- 0.3
- 0.5
- 0.7
- 1.0
Answer: C) 0.7
E(X) = np, so 7 = 10p, therefore p = 7/10 = 0.7.
1.7.8 Question 68
In a binomial distribution, if we increase p (probability of success) while keeping n fixed:
- E(X) decreases
- E(X) increases
- E(X) stays the same
- Cannot determine
Answer: B) E(X) increases
Since E(X) = np, increasing p while n is fixed increases the expected value.
1.7.9 Question 69
A binomial random variable can take values:
- Only 0 and 1
- Any integer from 0 to n
- Any real number
- Only positive values
Answer: B) Any integer from 0 to n
X counts the number of successes in n trials, so X can be 0, 1, 2, …, n.
1.7.10 Question 70
If X ~ Binomial(n=25, p=0.4) and Y = 3X + 5, then E(Y) equals:
- 10
- 25
- 30
- 35
Answer: D) 35
E(X) = np = 25 × 0.4 = 10. Then E(Y) = E(3X + 5) = 3E(X) + 5 = 3(10) + 5 = 35.
1.8 Topic 8: Normal Distribution (Questions 71-80)
1.8.1 Question 71
A normal distribution is characterized by:
- Skewed to the right
- Bell-shaped and symmetric
- Uniform across all values
- Two distinct peaks
Answer: B) Bell-shaped and symmetric
The normal distribution has a characteristic bell shape and is symmetric around its mean.
1.8.2 Question 72
According to the 68-95-99.7 rule, approximately what percentage of data falls within 1 standard deviation of the mean?
- 50%
- 68%
- 95%
- 99.7%
Answer: B) 68%
About 68% of data in a normal distribution falls within μ ± 1σ.
1.8.3 Question 73
Heights are normally distributed with mean 170 cm and standard deviation 10 cm. Approximately what percentage of heights are between 160 cm and 180 cm?
- 50%
- 68%
- 95%
- 99.7%
Answer: B) 68%
160 and 180 are one standard deviation below and above the mean (170 ± 10), so about 68% fall in this range.
1.8.4 Question 74
For the distribution in Question 73, approximately what percentage of heights are between 150 cm and 190 cm?
- 68%
- 95%
- 99.7%
- 50%
Answer: B) 95%
150 and 190 are two standard deviations below and above the mean (170 ± 20), so about 95% fall in this range.
1.8.5 Question 75
A z-score of 2 means the value is:
- 2 units above the mean
- 2 standard deviations above the mean
- 2% above the mean
- Equal to 2
Answer: B) 2 standard deviations above the mean
The z-score tells us how many standard deviations a value is from the mean.
1.8.6 Question 76
If μ = 100, σ = 15, and x = 130, the z-score is:
- 30
- 2
- 1.5
- 0.5
Answer: B) 2
z = (x - μ)/σ = (130 - 100)/15 = 30/15 = 2.
1.8.7 Question 77
A normal distribution with mean 50 and standard deviation 5. Approximately what percentage of values are above 60?
- 2.5%
- 16%
- 50%
- 84%
Answer: A) 2.5%
60 is two standard deviations above the mean (50 + 2×5). About 95% fall within 2 SD, leaving 5% outside. By symmetry, 2.5% are above.
1.8.8 Question 78
Which distribution is appropriate for data that is bell-shaped and symmetric?
- Binomial
- Uniform
- Normal
- Skewed
Answer: C) Normal
The normal distribution is used for bell-shaped, symmetric data.
1.8.9 Question 79
For a normal distribution, the mean, median, and mode are:
- All different
- All equal
- Mean > Median > Mode
- Mode > Mean > Median
Answer: B) All equal
In a symmetric distribution like the normal, mean = median = mode.
1.8.10 Question 80
A negative z-score indicates:
- The value is above the mean
- The value is below the mean
- An error in calculation
- The value equals the mean
Answer: B) The value is below the mean
Negative z-scores occur when x < μ, indicating the value is below the mean.
2 Section II: Short Answer Questions (20 Questions)
2.1 Questions 81-90: Conceptual Understanding
2.1.1 Question 81
Explain the difference between a parameter and a statistic. Give an example of each in a biological research context.
A parameter is a numerical summary of a population, while a statistic is a numerical summary of a sample.
Example: - Parameter: The true average wingspan of all monarch butterflies in North America (we’d need to measure every monarch to know this). - Statistic: The average wingspan calculated from a sample of 100 monarch butterflies captured and measured.
The key difference is that statistics are calculated from data we collect, while parameters are true (but usually unknown) values describing the entire population.
2.1.2 Question 82
A researcher wants to determine if a new fertilizer increases plant growth. Describe how to design this as a proper experiment, including random assignment, control group, and replication.
Proper Experimental Design:
Random Assignment: Randomly assign plants to two groups - treatment group (receives new fertilizer) and control group (receives standard fertilizer or no fertilizer).
Control Group: The control group provides a baseline for comparison. Without it, we wouldn’t know if observed growth is due to the fertilizer or other factors.
Replication: Use many plants in each group (e.g., 50 per group) to ensure results aren’t due to chance variation. Also, could repeat the entire experiment multiple times.
Control Variables: Keep other conditions identical (light, water, temperature, soil type).
Measurement: Measure plant growth after a fixed period for all plants.
This design allows us to establish causation because random assignment creates comparable groups, and any difference in growth can be attributed to the fertilizer.
2.1.3 Question 83
Explain why the median is often preferred over the mean when a distribution is skewed. Use an example.
The median is preferred for skewed distributions because it’s resistant to outliers, while the mean is pulled toward the tail.
Example: Hospital stay lengths (in days): 1, 2, 2, 3, 3, 4, 5, 45
- Mean = (1+2+2+3+3+4+5+45)/8 = 65/8 = 8.125 days
- Median = (3+3)/2 = 3 days
Most patients (7 out of 8) stayed 1-5 days, but one patient stayed 45 days. The median (3 days) better represents a “typical” stay. The mean (8.125 days) is misleading because it’s pulled up by the one extreme value.
Why this matters: In right-skewed data (like income, hospital stays, recovery times), the median gives a more accurate picture of what’s typical.
2.1.4 Question 84
What does it mean for two events to be independent? Explain using a medical testing example.
Independence means that knowing one event occurred does not change the probability of the other event occurring. Mathematically: P(A|B) = P(A).
Medical Example: Consider testing for two different diseases - Disease X and Disease Y.
Suppose: - P(Disease X) = 0.10 (10% of people have it) - P(Disease Y) = 0.05 (5% of people have it)
If the diseases are independent, then: - P(Disease X | has Disease Y) = P(Disease X) = 0.10
This means: among people who have Disease Y, still only 10% have Disease X - the same as in the general population. Having Disease Y doesn’t change the probability of having Disease X.
If the diseases were dependent (not independent), knowing someone has Disease Y would change the probability they have Disease X (e.g., if both diseases were caused by the same risk factor).
2.1.5 Question 85
Describe the difference between sampling bias and random sampling error. Why is sampling bias more problematic?
Random Sampling Error: - Natural variation that occurs when sampling - Occurs even with proper random sampling - Unpredictable - sometimes sample mean is too high, sometimes too low - Decreases with larger sample size - Example: Randomly sampling 100 fish from a lake - by chance, you might get slightly larger fish than the true population average
Sampling Bias: - Systematic error due to poor sampling method - Sample consistently over- or under-represents certain groups - Predictable direction - always pulls results the same way - Does NOT decrease with larger sample size - Example: Only sampling fish from one cove (easier to access) when fish in that cove are systematically larger than in other parts of the lake
Why bias is worse: You can reduce random error by taking larger samples, but no amount of data will fix bias. A biased sampling method will give you the wrong answer even with huge samples.
2.1.6 Question 86
A study finds that people who drink coffee have lower rates of heart disease. Can we conclude that coffee prevents heart disease? Why or why not?
No, we cannot conclude that coffee prevents heart disease from this observational study alone.
Reasons:
Correlation ≠ Causation: This is an observational study showing an association, not an experiment showing causation.
Potential Confounding: There may be other variables associated with both coffee drinking and heart disease:
- Coffee drinkers might exercise more
- Coffee drinkers might have higher income (better healthcare)
- Coffee drinkers might have lower stress levels
- Many other lifestyle factors could differ
Reverse Causation: Maybe people with heart disease are advised to avoid coffee, creating the observed pattern.
To establish causation, we’d need a randomized experiment where people are randomly assigned to drink coffee or not, and all other factors are controlled. Only then could we attribute differences in heart disease to the coffee itself.
The observational study suggests a relationship worth investigating, but doesn’t prove coffee is the cause.
2.1.7 Question 87
Explain what the 68-95-99.7 rule tells us about a normal distribution. Why is this useful?
The 68-95-99.7 Rule (Empirical Rule) states that for a normal distribution: - Approximately 68% of data falls within 1 standard deviation of the mean (μ ± σ) - Approximately 95% of data falls within 2 standard deviations of the mean (μ ± 2σ) - Approximately 99.7% of data falls within 3 standard deviations of the mean (μ ± 3σ)
Why this is useful:
Quick estimates: We can quickly estimate what percentage of data falls in a range without complex calculations
Identify unusual values: Values more than 2-3 standard deviations from the mean are rare (less than 5% or 0.3%)
Real-world application: Many biological measurements (height, blood pressure, birth weight) approximately follow a normal distribution
Example: If adult male heights are normally distributed with mean 175 cm and SD 7 cm: - About 68% of men are between 168-182 cm (175 ± 7) - About 95% of men are between 161-189 cm (175 ± 14) - A man who is 196 cm (175 + 21 = 175 + 3×7) is in the tallest 0.15% (very unusual)
2.1.8 Question 88
What is a confounding variable? Give an example from environmental or health research.
A confounding variable is a variable that: 1. Is associated with both the independent variable (exposure) and the dependent variable (outcome) 2. Creates a false or misleading association between the exposure and outcome 3. Makes it difficult to determine if the exposure actually causes the outcome
Example from Health Research:
Question: Does vitamin supplement use improve health outcomes?
Observed: People who take vitamins have better health
Confounding Variables: - Exercise: People who take vitamins may also exercise more - Exercise is associated with: taking vitamins (health-conscious behavior) AND better health - Income: Wealthier people may both take vitamins AND have better healthcare access - Diet: People who take vitamins may eat healthier overall
The Problem: Without accounting for these confounders, we can’t tell if better health is due to: - The vitamins themselves, OR - Exercise, income, diet, or other factors
Solution: Random assignment in an experiment would distribute these confounding variables evenly across groups, or we can measure and control for them statistically in observational studies.
2.1.9 Question 89
Describe what sensitivity and specificity mean for a medical diagnostic test. Why are both important?
Sensitivity = P(Test Positive | Disease Present) - The probability that the test correctly identifies someone who HAS the disease - Measures how good the test is at catching disease (detecting true positives) - High sensitivity means few false negatives
Specificity = P(Test Negative | Disease Absent) - The probability that the test correctly identifies someone who DOES NOT have the disease - Measures how good the test is at ruling out disease in healthy people (avoiding false positives) - High specificity means few false positives
Why both are important:
High Sensitivity is crucial when: - Missing disease has serious consequences (e.g., cancer screening) - Disease is treatable if caught early - We want to catch all possible cases (even if we get some false alarms)
High Specificity is crucial when: - False positives cause harm (unnecessary treatment, anxiety, cost) - Follow-up tests are invasive or expensive - Disease is rare (otherwise, we’d get too many false positives)
Ideal: High sensitivity AND high specificity, but there’s often a trade-off. Tests are designed based on which type of error is more costly in that particular situation.
Example: HIV screening uses high-sensitivity tests (don’t want to miss anyone infected), with positive results confirmed by high-specificity tests (ensure they truly have HIV before diagnosis).
2.1.10 Question 90
Explain what expected value E(X) means for a binomial random variable. What does it tell us in practical terms?
For a binomial random variable, E(X) = np represents the expected value or mean of the distribution.
What it means:
Long-run average: If we repeated the random process many times, the average number of successes would be close to E(X)
Most likely neighborhood: Values near E(X) are more probable than values far from E(X)
NOT a prediction for a single trial: E(X) might not even be a possible outcome!
Practical Example:
A biologist plants 100 seeds, each with 0.70 probability of germinating. - X = number of seeds that germinate - E(X) = np = 100 × 0.70 = 70 seeds
This tells us: - On average, we expect 70 seeds to germinate - If we planted 100 seeds many times, the average number germinating would be about 70 - Values near 70 (like 68, 69, 71, 72) are more likely than values far from 70 (like 50 or 90) - We won’t necessarily get exactly 70 seeds germinating in any particular trial due to random variation
Important: E(X) is the center of the distribution, but actual outcomes will vary around this value according to the binomial distribution.
2.2 Questions 91-100: Calculations and Applications
2.2.1 Question 91
A wildlife researcher captures and measures 50 randomly selected deer from a forest of 2000 deer. The average weight of the 50 deer is 68 kg.
- Is 68 kg a parameter or a statistic?
- What population is the researcher interested in?
- What is the sample in this study?
Statistic - It’s calculated from a sample (50 deer), not the entire population (2000 deer).
Population: All 2000 deer in the forest. The researcher wants to learn about the average weight of all deer in this forest.
Sample: The 50 deer that were captured and measured.
Key distinction: The true average weight of all 2000 deer is a parameter (unknown). The 68 kg is our sample statistic, an estimate of that parameter.
2.2.2 Question 92
Recovery times (in weeks) for 8 patients after surgery are: 2, 3, 3, 4, 4, 5, 6, 14
- Calculate the mean and median.
- Which measure better represents typical recovery time? Why?
- Is the value 14 an outlier? Show your work.
- Calculations:
- Mean = (2+3+3+4+4+5+6+14)/8 = 41/8 = 5.125 weeks
- Median: Middle values are 4 and 4, so median = (4+4)/2 = 4 weeks
- Median (4 weeks) is better because:
- Most patients (7 out of 8) recovered in 2-6 weeks
- One patient took 14 weeks (an outlier)
- The mean is pulled up by this outlier
- The median better represents what’s “typical”
- Check for outlier using IQR method:
- Q1 = 3 (between 2nd and 3rd values)
- Q3 = 5.5 (between 6th and 7th values)
- IQR = Q3 - Q1 = 5.5 - 3 = 2.5
- Upper boundary = Q3 + 1.5(IQR) = 5.5 + 1.5(2.5) = 5.5 + 3.75 = 9.25
- Since 14 > 9.25, yes, 14 is an outlier
2.2.3 Question 93
In a study of 200 patients: 120 received Treatment A, 80 received Treatment B. Among Treatment A patients, 90 improved. Among Treatment B patients, 40 improved.
- What is P(Improved)?
- What is P(Improved | Treatment A)?
- Are “Improved” and “Treatment A” independent? Justify your answer.
P(Improved) = Total improved / Total patients = (90+40)/200 = 130/200 = 0.65 or 65%
P(Improved | Treatment A) = Improved in A / Total in A = 90/120 = 0.75 or 75%
Check independence: Are P(Improved|Treatment A) and P(Improved) equal?
- P(Improved|Treatment A) = 0.75
- P(Improved) = 0.65
- Since 0.75 ≠ 0.65, NO, they are NOT independent
Interpretation: The probability of improvement depends on which treatment was received. Treatment A patients have a higher improvement rate (75%) than the overall rate (65%), showing that improvement and treatment type are related (dependent).
2.2.4 Question 94
A screening test for a disease has sensitivity = 0.92 and specificity = 0.88. The disease prevalence is 3% in the population.
- If a person has the disease, what is the probability they test positive?
- If a person does NOT have the disease, what is the probability they test negative?
- Explain why a positive test doesn’t necessarily mean the person has the disease.
P(Test + | Disease) = Sensitivity = 0.92 or 92%
P(Test - | No Disease) = Specificity = 0.88 or 88%
Why positive ≠ definitely has disease:
Even though the test is fairly accurate, the disease is rare (3% prevalence). This means:
- Out of 1000 people: About 30 have disease, 970 don’t
- Of the 30 with disease: ~28 test positive (sensitivity 92%)
- Of the 970 without disease: ~116 test positive (1-specificity = 12% false positive rate)
So we’d expect about 28 + 116 = 144 total positive tests, but only 28 actually have the disease!
P(Disease|Test+) = 28/144 ≈ 19%
This is called the positive predictive value (PPV) - only about 19% of positive tests actually indicate disease. The low prevalence combined with imperfect specificity leads to many false positives.
This is why positive screening tests usually require confirmatory testing!
2.2.5 Question 95
A geneticist studies inheritance of a trait. Each offspring has 0.25 probability of expressing the trait, independent of siblings. A mating pair has 6 offspring.
- What distribution models the number of offspring expressing the trait? State the parameters.
- What is the expected number of offspring expressing the trait?
- What is the probability that NO offspring express the trait? (Write the formula, you don’t need to compute the final number.)
- Binomial distribution with parameters:
- n = 6 (number of offspring)
- p = 0.25 (probability of expressing trait)
Justification: - Fixed number of trials (6 offspring) - Two outcomes (expresses trait or not) - Same probability for each (0.25) - Independence (siblings inherit independently)
- E(X) = np = 6 × 0.25 = 1.5 offspring
On average, we expect 1.5 out of 6 offspring to express the trait.
- P(X = 0) formula:
\[P(X = 0) = \binom{6}{0}(0.25)^0(0.75)^6 = 1 \times 1 \times (0.75)^6\]
This gives us P(X = 0) ≈ 0.178 or about 17.8% chance that no offspring express the trait.
2.2.6 Question 96
Birth weights are normally distributed with mean 3400 grams and standard deviation 500 grams.
- Between what two weights do approximately 95% of births fall?
- What percentage of babies weigh more than 4400 grams?
- A baby weighing 2400 grams has what z-score?
- 95% fall within 2 SD of mean:
- Lower bound: μ - 2σ = 3400 - 2(500) = 3400 - 1000 = 2400 grams
- Upper bound: μ + 2σ = 3400 + 2(500) = 3400 + 1000 = 4400 grams
- Answer: Between 2400 and 4400 grams
- 4400 grams is 2 SD above mean:
- 95% fall within 2 SD, leaving 5% outside
- By symmetry: 2.5% below 2400g and 2.5% above 4400g
- Answer: Approximately 2.5%
- Z-score calculation: \[z = \frac{x - \mu}{\sigma} = \frac{2400 - 3400}{500} = \frac{-1000}{500} = -2\]
Interpretation: A 2400g baby is 2 standard deviations below the mean.
2.2.7 Question 97
Consider the following contingency table for 500 patients:
| Has Fever | No Fever | Total | |
|---|---|---|---|
| Flu | 60 | 20 | 80 |
| No Flu | 80 | 340 | 420 |
| Total | 140 | 360 | 500 |
- What is P(Flu)?
- What is P(Fever | Flu)?
- What is P(Fever | No Flu)?
- Does having fever indicate flu is more likely? Explain.
P(Flu) = 80/500 = 0.16 or 16%
P(Fever | Flu) = 60/80 = 0.75 or 75%
P(Fever | No Flu) = 80/420 ≈ 0.19 or 19%
Yes, fever indicates flu is more likely:
Let’s compare P(Flu | Fever) to P(Flu): - P(Flu | Fever) = 60/140 ≈ 0.43 or 43% - P(Flu) = 0.16 or 16%
Among people with fever, 43% have flu, compared to only 16% in the general population. So knowing someone has a fever increases the probability they have flu from 16% to 43%.
Also note: 75% of flu patients have fever, but only 19% of non-flu patients have fever, showing fever is associated with flu.
2.2.8 Question 98
A researcher monitors 30 bird nests. Historically, 60% of nests successfully fledge at least one chick. Let X = number of successful nests.
- What are the parameters n and p?
- Calculate E(X) and interpret it.
- If monitoring costs $15 per nest plus a fixed $50 cost, and Y = total cost, what is E(Y)?
- Parameters:
- n = 30 (number of nests monitored)
- p = 0.60 (probability of success per nest)
- E(X) = np = 30 × 0.60 = 18 nests
Interpretation: On average, we expect 18 out of 30 nests to successfully fledge at least one chick. If we monitored many sets of 30 nests, the average number of successful nests would be close to 18.
- Cost calculation:
- Y = 15X + 50 (where X is number of successful nests)
- E(Y) = 15·E(X) + 50 (using property of expected value)
- E(Y) = 15(18) + 50 = 270 + 50 = $320
Interpretation: On average, the total monitoring cost is expected to be $320.
Note: This assumes cost is $15 per successful nest. If it’s $15 per nest monitored regardless of success, then Y = 15(30) + 50 = $500 always.
2.2.9 Question 99
Data on plant heights (cm): 10, 12, 12, 15, 18, 20, 22, 35
- Calculate Q1, Q3, and IQR.
- Determine the outlier boundaries.
- Are there any outliers?
- Describe the shape of this distribution.
- Quartiles (with 8 values):
- Q1 is between 2nd and 3rd values: Q1 = (12+12)/2 = 12 cm
- Q3 is between 6th and 7th values: Q3 = (20+22)/2 = 21 cm
- IQR = Q3 - Q1 = 21 - 12 = 9 cm
- Outlier boundaries:
- Lower: Q1 - 1.5(IQR) = 12 - 1.5(9) = 12 - 13.5 = -1.5 cm
- Upper: Q3 + 1.5(IQR) = 21 + 1.5(9) = 21 + 13.5 = 34.5 cm
- Check for outliers:
- All values ≥ -1.5, so no low outliers
- 35 > 34.5, so 35 is an outlier
- Distribution shape:
- Most values clustered 10-22 cm (relatively symmetric)
- One extreme value (35) to the right
- Right-skewed (positive skew) due to the high outlier
We’d also expect mean > median for right-skewed data.
2.2.10 Question 100
In a clinical trial, patients are randomly assigned to receive either Drug X or placebo. Neither the patients nor the doctors evaluating outcomes know who received which treatment.
- What experimental design feature is described by “neither patients nor doctors know”?
- Why is this important?
- What role does random assignment play?
- If 75% of Drug X patients improve vs. 50% of placebo patients, can we conclude Drug X causes improvement? Why?
- Double-blinding (or just “blinding”)
- Patients don’t know = single blind
- Doctors also don’t know = double blind
- Why blinding is important:
Without blinding: - Placebo effect: Patients who know they’re getting the drug might improve due to expectation/belief - Observer bias: Doctors who know which patients got the drug might unconsciously evaluate them more favorably - Treatment bias: Doctors might treat groups differently if they know who got what
With blinding: - Any differences in outcomes can be attributed to the drug itself - Eliminates psychological and observational biases
- Random assignment’s role:
- Creates comparable groups at the start
- Distributes potential confounding variables evenly across groups
- Ensures any pre-existing differences are due to chance, not systematic bias
- Makes groups similar in all ways except the treatment
- Yes, we can conclude Drug X causes improvement because:
This is a well-designed experiment with: 1. Random assignment → comparable groups 2. Control group (placebo) → proper comparison 3. Blinding → eliminates bias 4. Clear difference (75% vs 50%)
All these features together allow us to establish causation, not just correlation. The 25 percentage point difference can be attributed to Drug X itself, not to confounding variables or bias.
However, we should note: - Results apply to patients similar to those in the trial - Statistical significance should be verified - Clinical significance should be considered (is 25% improvement large enough to matter?)
2.3 Summary
You’ve completed all 100 practice questions! Here’s how they break down:
Multiple Choice (80 questions):
- Variables and Data Types: 10 questions
- Study Design and Sampling: 10 questions
- Descriptive Statistics - Center: 10 questions
- Descriptive Statistics - Spread: 10 questions
- Probability Basics: 10 questions
- Conditional Probability & Independence: 10 questions
- Binomial Distribution: 10 questions
- Normal Distribution: 10 questions
Short Answer (20 questions):
- Conceptual Understanding: 10 questions
- Calculations and Applications: 10 questions
Study Recommendations:
- Review any topics where you struggled - go back to lecture notes and homework
- Practice explaining concepts - can you teach them to someone else?
- Time yourself - can you work efficiently under exam conditions?
- Focus on interpretation - don’t just calculate, understand what results mean
- Use the actual exam format - 15 MC + 2 FR questions in 90 minutes
Good luck on your midterm!
