Understanding P-Values¶
P-values are widely used but often misunderstood. This guide explains what they actually mean and how to interpret them correctly.
What a P-Value Is¶
The p-value is the probability of seeing results as extreme as yours if there was no real difference between control and variant.
In Plain English
"If the variant was actually identical to control, how likely would I be to see this result just by chance?"
Interpreting P-Values¶
| P-value | Interpretation |
|---|---|
| < 0.01 | Very strong evidence of a real difference |
| 0.01 - 0.05 | Strong evidence (significant at 95% confidence) |
| 0.05 - 0.10 | Weak evidence, consider more data |
| > 0.10 | Not enough evidence to conclude there's a difference |
Example¶
from expstats import conversion
result = conversion.analyze(
control_visitors=10000,
control_conversions=500, # 5.0%
variant_visitors=10000,
variant_conversions=600, # 6.0%
)
print(f"P-value: {result.p_value:.4f}")
Output: P-value: 0.0003
Interpretation: There's only a 0.03% chance of seeing a 1 percentage point difference (or larger) if the variant was actually the same as control. This is very unlikely, so we conclude the variant really is different.
What P-Values Do NOT Mean¶
❌ Wrong: "There's a 0.03% chance the variant isn't better"
❌ Wrong: "The variant is 99.97% likely to be better"
❌ Wrong: "The effect is large"
✅ Right: "If there was no real difference, we'd only see results this extreme 0.03% of the time"
The 0.05 Threshold¶
The conventional threshold is p < 0.05 (5%), which corresponds to 95% confidence.
if result.p_value < 0.05:
print("Statistically significant at 95% confidence")
else:
print("Not statistically significant")
0.05 is arbitrary
The 0.05 threshold is a convention, not a law of nature. A p-value of 0.051 isn't meaningfully different from 0.049.
Relationship to Confidence Level¶
| Confidence Level | P-value Threshold |
|---|---|
| 90% | < 0.10 |
| 95% | < 0.05 |
| 99% | < 0.01 |
# 95% confidence (default)
result_95 = conversion.analyze(..., confidence=95)
# Significant if p < 0.05
# 99% confidence
result_99 = conversion.analyze(..., confidence=99)
# Significant if p < 0.01
Common Mistakes¶
Mistake 1: Peeking and Stopping Early¶
Problem: Checking results daily and stopping when p < 0.05.
Why it's wrong: The more you check, the more likely you'll see p < 0.05 by chance.
Solution: Calculate sample size before starting and don't peek.
Mistake 2: Ignoring Effect Size¶
Problem: A test shows p = 0.01 with a 0.1% lift.
Why it's wrong: Statistical significance doesn't mean business significance.
Solution: Always look at confidence intervals and consider if the effect is worth implementing.
Mistake 3: Multiple Comparisons¶
Problem: Running 20 tests and declaring 1 winner (p = 0.04).
Why it's wrong: With 20 tests, you expect ~1 false positive at p < 0.05.
Solution: Use Bonferroni correction for multi-variant tests.
P-Value vs. Confidence Interval¶
P-values tell you: "Is there a difference?"
Confidence intervals tell you: "How big is the difference?"
result = conversion.analyze(...)
# P-value approach
if result.p_value < 0.05:
print("Significant!")
# CI approach (more informative)
print(f"Lift: {result.lift_percent:+.1f}%")
print(f"CI: [{result.confidence_interval_lower:.4f}, {result.confidence_interval_upper:.4f}]")
Best practice
Report both the p-value AND the confidence interval. The CI tells stakeholders the likely range of the true effect.
One-Tailed vs. Two-Tailed¶
expstats uses two-tailed tests by default, which is appropriate when you want to detect effects in either direction.
| Test Type | Detects | Use When |
|---|---|---|
| Two-tailed | Effects in either direction | Most A/B tests |
| One-tailed | Effects in only one direction | Rarely appropriate |
Summary¶
- P-value = Probability of seeing your results if there's no real difference
- p < 0.05 is the conventional threshold for "significant"
- Don't peek - Calculate sample size first
- Look at CIs - They're more informative than p-values alone
- Consider business impact - Statistical significance ≠ business significance