Study Guide

AP Statistics

9 Units · Tap any card to explore

0 / 9 opened
Unit 1
Exploring One-Variable Data
  • Types of VariablesCategorical (groups/labels) vs. Quantitative (numbers). Discrete = countable; Continuous = measurable.
  • CenterMean = average (sensitive to outliers). Median = middle value (resistant to outliers). Use median when data is skewed.
  • SpreadRange, IQR (Q3 − Q1), Standard Deviation (s), Variance (s²). IQR is resistant; std dev is not.
  • Shape of DistributionSymmetric, skewed left (tail left), skewed right (tail right). Look for outliers, clusters, and gaps.
  • 5-Number SummaryMin, Q1, Median, Q3, Max. Used to build boxplots. Outlier rule: < Q1 − 1.5×IQR or > Q3 + 1.5×IQR.
  • z-Score (Standardizing)Tells how many standard deviations a value is from the mean. Allows comparison across different distributions.
  • Normal DistributionBell-shaped, symmetric. Empirical Rule: 68% within ±1σ, 95% within ±2σ, 99.7% within ±3σ.
Key Formulas
Meanx̄ = Σxᵢ / n Sample Std Devs = √[ Σ(xᵢ − x̄)² / (n−1) ] z-Scorez = (x − μ) / σ IQRIQR = Q3 − Q1
Skewed right → mean > median. Skewed left → mean < median. Always match summary stats to the shape!
dotplothistogramboxplotstemplotnormal curve
Unit 2
Exploring Two-Variable Data
  • ScatterplotShows the relationship between two quantitative variables. Describe: Direction, Form (linear/curved), Strength, Outliers.
  • Correlation (r)Measures strength and direction of a linear relationship. Range: −1 to +1. r = 0 means no linear association. Sensitive to outliers!
  • Least-Squares Regression Line (LSRL)ŷ = a + bx. Minimizes the sum of squared residuals. The line always passes through (x̄, ȳ).
  • Slope (b)For each 1-unit increase in x, ŷ changes by b units. Must include units and context in interpretation.
  • y-intercept (a)Predicted value of y when x = 0. Only meaningful if x = 0 makes sense in context.
  • ResidualResidual = observed y − predicted ŷ. Positive residual = point above line. Residual plots check linearity.
  • r² (Coefficient of Determination)% of variation in y explained by the linear relationship with x. Example: r² = 0.81 → 81% of variation in y is explained by x.
  • Categorical AssociationsUse two-way tables. Compare conditional distributions (row % or column %). Look for an association.
Key Formulas
Slopeb = r · (Sᵧ / Sₓ) y-intercepta = ȳ − b·x̄ Residuale = y − ŷ
Correlation ≠ Causation! A strong r only means association, not that one variable causes the other.
scatterplotLSRLresidual plottwo-way table
Unit 3
Collecting Data
  • Census vs. SampleCensus = entire population. Sample = subset. Good samples are representative and avoid bias.
  • Sampling MethodsSRS (every set of n has equal chance), Stratified (divide into groups, SRS from each), Cluster (pick whole groups), Systematic (every kth), Convenience (avoid!).
  • Sources of BiasUndercoverage, nonresponse bias, response bias, voluntary response bias. These make results invalid — can't be fixed by larger n.
  • Observational Study vs. ExperimentObservational: observe without imposing treatment — can show association only. Experiment: impose treatments — can show causation.
  • Experimental Design PrinciplesRandomization (assign to treatments randomly), Control (keep variables constant), Replication (enough subjects), Blocking (group similar subjects together first).
  • Confounding VariableA variable that is related to both the explanatory and response variable, making it hard to isolate the true effect.
  • Placebo & BlindingPlacebo: fake treatment to control the placebo effect. Single-blind: subjects don't know group. Double-blind: neither subjects nor evaluators know.
Only a well-designed randomized experiment with control can establish cause-and-effect. Observational studies cannot!
SRSstratifiedclusterexperimentblockingbias
Unit 4
Probability, Random Variables & Distributions
  • Basic Probability RulesP(A) is between 0 and 1. P(A') = 1 − P(A). P(A ∪ B) = P(A) + P(B) − P(A ∩ B). If mutually exclusive: P(A ∩ B) = 0.
  • Independent EventsA and B are independent if P(A|B) = P(A), or equivalently P(A ∩ B) = P(A)·P(B). Don't confuse with mutually exclusive!
  • Conditional ProbabilityP(A|B) = P(A ∩ B) / P(B). The probability of A given that B has already occurred.
  • Random VariablesDiscrete (countable outcomes) vs. Continuous (range of values). Each has its own mean (μ) and standard deviation (σ).
  • Combining Random VariablesE(X ± Y) = E(X) ± E(Y). Var(X ± Y) = Var(X) + Var(Y) — only if X and Y are INDEPENDENT. Always add variances, never std devs!
  • Binomial DistributionFixed n trials, 2 outcomes (success/failure), constant p, independent trials. X ~ B(n, p).
  • Geometric DistributionCount trials until the first success. P(X = k) = (1−p)^(k−1) · p. Mean = 1/p.
Key Formulas
Binomial ProbabilityP(X=k) = C(n,k)·pᵏ·(1−p)ⁿ⁻ᵏ Binomial Meanμ = np Binomial Std Devσ = √[np(1−p)] Geometric Meanμ = 1/p
For combining random variables, you can only add standard deviations by first adding variances, then taking the square root.
binomialgeometricP(A|B)independenceexpected value
Unit 5
Sampling Distributions
  • Sampling DistributionThe distribution of a statistic (like x̄ or p̂) from all possible samples of size n from a population. Not the same as the data distribution!
  • Sampling Distribution of p̂ (Proportions)Mean = p, Std Dev = √[p(1−p)/n]. Approx. normal when np ≥ 10 and n(1−p) ≥ 10 (Large Counts condition).
  • Sampling Distribution of x̄ (Means)Mean = μ, Std Dev = σ/√n. Normal if population is normal OR n ≥ 30 (Central Limit Theorem).
  • Central Limit Theorem (CLT)For large enough n (≥ 30), the sampling distribution of x̄ is approximately normal, regardless of the shape of the population.
  • Standard ErrorThe standard deviation of a sampling distribution. SE(x̄) = σ/√n. As n increases, spread decreases.
  • Bias vs. VariabilityBias = center of sampling dist ≠ parameter (bad!). Variability = spread. Larger n reduces variability, but not bias.
Key Formulas
SE of p̂σ_p̂ = √[ p(1−p) / n ] SE of x̄σ_x̄ = σ / √n
The 10% condition: n must be ≤ 10% of the population for independence to be assumed when sampling without replacement.
CLTstandard errorlarge counts
Unit 6
Inference for Categorical Data: Proportions
  • Confidence Interval for pEstimate the true proportion with a range. CI = p̂ ± z*·SE. Wider interval = less precision but more confidence.
  • Interpreting a CI"We are 95% confident the true proportion of [context] is between [L] and [U]." NOT "there's a 95% chance the parameter is in this interval."
  • Conditions for Inference (Proportions)Random sample, 10% condition (n ≤ 10% of N), Large Counts (np̂ ≥ 10 and n(1−p̂) ≥ 10).
  • One-sample z-test for pH₀: p = p₀. Test statistic: z = (p̂ − p₀) / √[p₀(1−p₀)/n]. Use p₀ in the denominator, not p̂!
  • Two-sample z-test/CI for p₁ − p₂Compare two proportions. For test, use pooled p̂ in SE. For CI, use p̂₁ and p̂₂ separately.
  • p-valueProbability of getting a result as extreme (or more) as observed, assuming H₀ is true. Small p-value → evidence against H₀.
  • Type I & Type II ErrorsType I: Reject H₀ when it's true (α). Type II: Fail to reject H₀ when it's false (β). Power = 1 − β.
Key Formulas
CI for pp̂ ± z* · √[ p̂(1−p̂)/n ] z-test for pz = (p̂ − p₀) / √[ p₀(1−p₀)/n ] Margin of ErrorME = z* · SE
Always check ALL three conditions (Random, 10%, Large Counts) and state them explicitly on the AP exam!
z-testCIp-valueType I/II errorpower
Unit 7
Inference for Quantitative Data: Means
  • t-distributionUsed when population σ is unknown (almost always). Has heavier tails than normal. Degrees of freedom = n − 1. As df → ∞, t → z.
  • Conditions for Inference (Means)Random, 10% condition, Normal/Large Sample (population normal OR n ≥ 30; if n < 30, no strong skew or outliers in sample).
  • One-sample t-intervalEstimate a population mean. CI = x̄ ± t*·(s/√n). Use t* with df = n−1.
  • One-sample t-testH₀: μ = μ₀. Test stat: t = (x̄ − μ₀)/(s/√n). Find p-value using t-distribution with df = n−1.
  • Paired t-testFor matched pairs data, compute differences (d = x₁ − x₂), then do a one-sample t-test on the differences. H₀: μ_d = 0.
  • Two-sample t-testCompare two independent group means. H₀: μ₁ − μ₂ = 0. Use conservative df = smaller(n₁, n₂) − 1, or calculator's Welch df.
Key Formulas
t-test statistict = (x̄ − μ₀) / (s / √n) CI for μx̄ ± t* · (s / √n) Two-sample t-statt = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂)
Use paired t when data is matched (before/after, twin studies). Use two-sample t when groups are independent. Getting this wrong is a common mistake!
t-testt-intervalpaired ttwo-sample tdf
Unit 8
Inference for Categorical Data: Chi-Square
  • Chi-Square Goodness of FitTests if one categorical variable's observed distribution matches a hypothesized distribution. H₀: The population distribution is [stated]. df = k − 1.
  • Chi-Square Test for HomogeneityCompares the distribution of a single categorical variable across two or more independent groups/populations. df = (r−1)(c−1).
  • Chi-Square Test for IndependenceTests if two categorical variables are associated within a single population. H₀: The two variables are independent. df = (r−1)(c−1).
  • Expected CountsE = (row total × column total) / grand total. All expected counts must be ≥ 5 (Large Counts condition).
  • Chi-Square Statisticχ² = Σ (O − E)² / E. A larger χ² = more evidence against H₀. Always a right-tail test. Never negative!
  • ConditionsRandom sample(s), independence (10% condition), Large Counts (all expected ≥ 5).
Key Formula
Chi-Square Statisticχ² = Σ [ (O − E)² / E ] Expected CountE = (row total × col total) / n Goodness of Fit dfdf = k − 1 Two-way table dfdf = (rows−1)(cols−1)
Homogeneity = multiple populations, one variable. Independence = one population, two variables. The math is the same, but the context and hypotheses differ!
χ² GOFhomogeneityindependenceexpected countsright-tail
Unit 9
Inference for Quantitative Data: Slopes
  • Population Regression Modely = α + βx + ε. β is the true population slope. b (from sample) is the estimate of β. The residuals ε are assumed to be normal with mean 0.
  • Conditions for Regression Inference (LINER)Linear (scatterplot looks linear), Independent residuals, Normal residual distribution, Equal variance (residual plot shows no fan shape), Random sample.
  • t-test for slopeH₀: β = 0 (no linear relationship). Test stat: t = b / SE_b. df = n − 2. A significant result → useful linear model.
  • Confidence Interval for βEstimates the true slope. CI = b ± t*·SE_b, with df = n − 2. If the CI doesn't include 0, there's evidence of a real linear relationship.
  • SE of the Slope (SE_b)Measures how much the sample slope b would vary from sample to sample. Smaller SE_b → more precise estimate of β.
  • Reading Computer OutputAP exam often gives regression output tables. Find: Coef (b), SE Coef (SE_b), T (t-stat), P (p-value). Use these to build tests and intervals.
Key Formulas
t-test for slopet = b / SE_b CI for βb ± t* · SE_b Degrees of freedomdf = n − 2
Check LINER conditions using the residual plot (random scatter) and a histogram or normal prob. plot of residuals. Always state each condition clearly!
t-test for βCI for slopeLINERresidual plotcomputer output