AP Statistics – Concept Guide

Unit 1

Exploring One-Variable Data

▾

Core Concepts

Types of VariablesCategorical (groups/labels) vs. Quantitative (numbers). Discrete = countable; Continuous = measurable.
CenterMean = average (sensitive to outliers). Median = middle value (resistant to outliers). Use median when data is skewed.
SpreadRange, IQR (Q3 − Q1), Standard Deviation (s), Variance (s²). IQR is resistant; std dev is not.
Shape of DistributionSymmetric, skewed left (tail left), skewed right (tail right). Look for outliers, clusters, and gaps.
5-Number SummaryMin, Q1, Median, Q3, Max. Used to build boxplots. Outlier rule: < Q1 − 1.5×IQR or > Q3 + 1.5×IQR.
z-Score (Standardizing)Tells how many standard deviations a value is from the mean. Allows comparison across different distributions.
Normal DistributionBell-shaped, symmetric. Empirical Rule: 68% within ±1σ, 95% within ±2σ, 99.7% within ±3σ.

Key Formulas

Meanx̄ = Σxᵢ / n Sample Std Devs = √[ Σ(xᵢ − x̄)² / (n−1) ] z-Scorez = (x − μ) / σ IQRIQR = Q3 − Q1

Skewed right → mean > median. Skewed left → mean < median. Always match summary stats to the shape!

dotplothistogramboxplotstemplotnormal curve

Unit 2

Exploring Two-Variable Data

▾

Core Concepts

ScatterplotShows the relationship between two quantitative variables. Describe: Direction, Form (linear/curved), Strength, Outliers.
Correlation (r)Measures strength and direction of a linear relationship. Range: −1 to +1. r = 0 means no linear association. Sensitive to outliers!
Least-Squares Regression Line (LSRL)ŷ = a + bx. Minimizes the sum of squared residuals. The line always passes through (x̄, ȳ).
Slope (b)For each 1-unit increase in x, ŷ changes by b units. Must include units and context in interpretation.
y-intercept (a)Predicted value of y when x = 0. Only meaningful if x = 0 makes sense in context.
ResidualResidual = observed y − predicted ŷ. Positive residual = point above line. Residual plots check linearity.
r² (Coefficient of Determination)% of variation in y explained by the linear relationship with x. Example: r² = 0.81 → 81% of variation in y is explained by x.
Categorical AssociationsUse two-way tables. Compare conditional distributions (row % or column %). Look for an association.

Key Formulas

Slopeb = r · (Sᵧ / Sₓ) y-intercepta = ȳ − b·x̄ Residuale = y − ŷ

Correlation ≠ Causation! A strong r only means association, not that one variable causes the other.

scatterplotLSRLr²residual plottwo-way table

Unit 3

Collecting Data

▾

Core Concepts

Census vs. SampleCensus = entire population. Sample = subset. Good samples are representative and avoid bias.
Sampling MethodsSRS (every set of n has equal chance), Stratified (divide into groups, SRS from each), Cluster (pick whole groups), Systematic (every kth), Convenience (avoid!).
Sources of BiasUndercoverage, nonresponse bias, response bias, voluntary response bias. These make results invalid — can't be fixed by larger n.
Observational Study vs. ExperimentObservational: observe without imposing treatment — can show association only. Experiment: impose treatments — can show causation.
Experimental Design PrinciplesRandomization (assign to treatments randomly), Control (keep variables constant), Replication (enough subjects), Blocking (group similar subjects together first).
Confounding VariableA variable that is related to both the explanatory and response variable, making it hard to isolate the true effect.
Placebo & BlindingPlacebo: fake treatment to control the placebo effect. Single-blind: subjects don't know group. Double-blind: neither subjects nor evaluators know.

Only a well-designed randomized experiment with control can establish cause-and-effect. Observational studies cannot!

SRSstratifiedclusterexperimentblockingbias

Unit 4

Probability, Random Variables & Distributions

▾

Core Concepts

Basic Probability RulesP(A) is between 0 and 1. P(A') = 1 − P(A). P(A ∪ B) = P(A) + P(B) − P(A ∩ B). If mutually exclusive: P(A ∩ B) = 0.
Independent EventsA and B are independent if P(A|B) = P(A), or equivalently P(A ∩ B) = P(A)·P(B). Don't confuse with mutually exclusive!
Conditional ProbabilityP(A|B) = P(A ∩ B) / P(B). The probability of A given that B has already occurred.
Random VariablesDiscrete (countable outcomes) vs. Continuous (range of values). Each has its own mean (μ) and standard deviation (σ).
Combining Random VariablesE(X ± Y) = E(X) ± E(Y). Var(X ± Y) = Var(X) + Var(Y) — only if X and Y are INDEPENDENT. Always add variances, never std devs!
Binomial DistributionFixed n trials, 2 outcomes (success/failure), constant p, independent trials. X ~ B(n, p).
Geometric DistributionCount trials until the first success. P(X = k) = (1−p)^(k−1) · p. Mean = 1/p.

Key Formulas

Binomial ProbabilityP(X=k) = C(n,k)·pᵏ·(1−p)ⁿ⁻ᵏ Binomial Meanμ = np Binomial Std Devσ = √[np(1−p)] Geometric Meanμ = 1/p

For combining random variables, you can only add standard deviations by first adding variances, then taking the square root.

binomialgeometricP(A|B)independenceexpected value

Unit 5

Sampling Distributions

▾

Core Concepts

Sampling DistributionThe distribution of a statistic (like x̄ or p̂) from all possible samples of size n from a population. Not the same as the data distribution!
Sampling Distribution of p̂ (Proportions)Mean = p, Std Dev = √[p(1−p)/n]. Approx. normal when np ≥ 10 and n(1−p) ≥ 10 (Large Counts condition).
Sampling Distribution of x̄ (Means)Mean = μ, Std Dev = σ/√n. Normal if population is normal OR n ≥ 30 (Central Limit Theorem).
Central Limit Theorem (CLT)For large enough n (≥ 30), the sampling distribution of x̄ is approximately normal, regardless of the shape of the population.
Standard ErrorThe standard deviation of a sampling distribution. SE(x̄) = σ/√n. As n increases, spread decreases.
Bias vs. VariabilityBias = center of sampling dist ≠ parameter (bad!). Variability = spread. Larger n reduces variability, but not bias.

Key Formulas

SE of p̂σ_p̂ = √[ p(1−p) / n ] SE of x̄σ_x̄ = σ / √n

The 10% condition: n must be ≤ 10% of the population for independence to be assumed when sampling without replacement.

CLTstandard errorp̂x̄large counts

Unit 6

Inference for Categorical Data: Proportions

▾

Core Concepts

Confidence Interval for pEstimate the true proportion with a range. CI = p̂ ± z*·SE. Wider interval = less precision but more confidence.
Interpreting a CI"We are 95% confident the true proportion of [context] is between [L] and [U]." NOT "there's a 95% chance the parameter is in this interval."
Conditions for Inference (Proportions)Random sample, 10% condition (n ≤ 10% of N), Large Counts (np̂ ≥ 10 and n(1−p̂) ≥ 10).
One-sample z-test for pH₀: p = p₀. Test statistic: z = (p̂ − p₀) / √[p₀(1−p₀)/n]. Use p₀ in the denominator, not p̂!
Two-sample z-test/CI for p₁ − p₂Compare two proportions. For test, use pooled p̂ in SE. For CI, use p̂₁ and p̂₂ separately.
p-valueProbability of getting a result as extreme (or more) as observed, assuming H₀ is true. Small p-value → evidence against H₀.
Type I & Type II ErrorsType I: Reject H₀ when it's true (α). Type II: Fail to reject H₀ when it's false (β). Power = 1 − β.

Key Formulas

CI for pp̂ ± z* · √[ p̂(1−p̂)/n ] z-test for pz = (p̂ − p₀) / √[ p₀(1−p₀)/n ] Margin of ErrorME = z* · SE

Always check ALL three conditions (Random, 10%, Large Counts) and state them explicitly on the AP exam!

z-testCIp-valueType I/II errorpower

Unit 7

Inference for Quantitative Data: Means

▾

Core Concepts

t-distributionUsed when population σ is unknown (almost always). Has heavier tails than normal. Degrees of freedom = n − 1. As df → ∞, t → z.
Conditions for Inference (Means)Random, 10% condition, Normal/Large Sample (population normal OR n ≥ 30; if n < 30, no strong skew or outliers in sample).
One-sample t-intervalEstimate a population mean. CI = x̄ ± t*·(s/√n). Use t* with df = n−1.
One-sample t-testH₀: μ = μ₀. Test stat: t = (x̄ − μ₀)/(s/√n). Find p-value using t-distribution with df = n−1.
Paired t-testFor matched pairs data, compute differences (d = x₁ − x₂), then do a one-sample t-test on the differences. H₀: μ_d = 0.
Two-sample t-testCompare two independent group means. H₀: μ₁ − μ₂ = 0. Use conservative df = smaller(n₁, n₂) − 1, or calculator's Welch df.

Key Formulas

t-test statistict = (x̄ − μ₀) / (s / √n) CI for μx̄ ± t* · (s / √n) Two-sample t-statt = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂)

Use paired t when data is matched (before/after, twin studies). Use two-sample t when groups are independent. Getting this wrong is a common mistake!

t-testt-intervalpaired ttwo-sample tdf

Unit 8

Inference for Categorical Data: Chi-Square

▾

Core Concepts

Chi-Square Goodness of FitTests if one categorical variable's observed distribution matches a hypothesized distribution. H₀: The population distribution is [stated]. df = k − 1.
Chi-Square Test for HomogeneityCompares the distribution of a single categorical variable across two or more independent groups/populations. df = (r−1)(c−1).
Chi-Square Test for IndependenceTests if two categorical variables are associated within a single population. H₀: The two variables are independent. df = (r−1)(c−1).
Expected CountsE = (row total × column total) / grand total. All expected counts must be ≥ 5 (Large Counts condition).
Chi-Square Statisticχ² = Σ (O − E)² / E. A larger χ² = more evidence against H₀. Always a right-tail test. Never negative!
ConditionsRandom sample(s), independence (10% condition), Large Counts (all expected ≥ 5).

Key Formula

Chi-Square Statisticχ² = Σ [ (O − E)² / E ] Expected CountE = (row total × col total) / n Goodness of Fit dfdf = k − 1 Two-way table dfdf = (rows−1)(cols−1)

Homogeneity = multiple populations, one variable. Independence = one population, two variables. The math is the same, but the context and hypotheses differ!

χ² GOFhomogeneityindependenceexpected countsright-tail

Unit 9

Inference for Quantitative Data: Slopes

▾

Core Concepts

Population Regression Modely = α + βx + ε. β is the true population slope. b (from sample) is the estimate of β. The residuals ε are assumed to be normal with mean 0.
Conditions for Regression Inference (LINER)Linear (scatterplot looks linear), Independent residuals, Normal residual distribution, Equal variance (residual plot shows no fan shape), Random sample.
t-test for slopeH₀: β = 0 (no linear relationship). Test stat: t = b / SE_b. df = n − 2. A significant result → useful linear model.
Confidence Interval for βEstimates the true slope. CI = b ± t*·SE_b, with df = n − 2. If the CI doesn't include 0, there's evidence of a real linear relationship.
SE of the Slope (SE_b)Measures how much the sample slope b would vary from sample to sample. Smaller SE_b → more precise estimate of β.
Reading Computer OutputAP exam often gives regression output tables. Find: Coef (b), SE Coef (SE_b), T (t-stat), P (p-value). Use these to build tests and intervals.

Key Formulas

t-test for slopet = b / SE_b CI for βb ± t* · SE_b Degrees of freedomdf = n − 2

Check LINER conditions using the residual plot (random scatter) and a histogram or normal prob. plot of residuals. Always state each condition clearly!

t-test for βCI for slopeLINERresidual plotcomputer output