Statistical Significance for PMs: The No-BS Guide

Statistical significance tells you whether your A/B test result is real or random noise. A p-value of 0.05 means there’s a 5% chance your “winning” variant won by luck alone. That’s it. Everything else is details.

You don’t need to understand the math. You need to understand the decision.

Key takeaways:

P-value = probability the result is random noise
95% confidence is a convention, not a law
Small sample + big claim = probably wrong
When in doubt, run longer

What You Actually Need to Know

The One Number: P-Value

Your A/B testing tool shows a p-value. Here’s how to read it:

P-Value	What It Means	Should You Ship?
< 0.01	Very likely real	Yes, confidently
0.01 - 0.05	Probably real	Yes, with monitoring
0.05 - 0.10	Might be real	Maybe, if low-risk
> 0.10	Probably noise	No, run longer

That’s the whole framework. Everything else is nuance.

The Other Number: Sample Size

P-values lie when sample sizes are small.

Rule of thumb: Don’t trust results until you have at least 1,000 users per variant. For conversion rates under 5%, you need even more—often 5,000+.

Why? Small samples have wild swings. You might see +30% on Day 1 and -10% by Day 7. The truth is usually somewhere boring in between.

The Number Everyone Ignores: Effect Size

A “statistically significant” result can be practically meaningless.

Example:

Variant B beats Control by 0.3%
P-value: 0.02 (significant!)
Should you ship? Probably not.

Ask: Is this effect worth the engineering cost to maintain? A 0.3% lift might be real but not worth the complexity.

Minimum Detectable Effect (MDE): Before running a test, decide the smallest improvement worth caring about. 1%? 5%? 10%? If your result is smaller than your MDE, the answer is “run a different test.”

The Cheat Sheet

Print this. Tape it to your monitor.

BEFORE THE TEST:
□ What's my hypothesis?
□ What's my primary metric?
□ What's my MDE (minimum I care about)?
□ How many users do I need? (Use a calculator)
□ How long will that take?

DURING THE TEST:
□ DON'T peek daily and make decisions
□ DO check for bugs and data issues
□ DON'T stop early because it "looks good"
□ DO stop if there's a serious negative

AFTER THE TEST:
□ Is p-value < 0.05?
□ Is effect size > my MDE?
□ Does the result make sense?
□ Did I check secondary metrics?

Common Mistakes (And Fixes)

1. Peeking and Stopping Early

The mistake: Checking results daily. Seeing p=0.03 on Day 3. Shipping the winner.

Why it’s wrong: P-values fluctuate. Check 20 times during a test and you’ll see p<0.05 by random chance—even if there’s no real effect. This is called the “peeking problem.”

The fix: Decide your sample size upfront. Don’t look until you hit it. Or use sequential testing methods that account for multiple looks.

2. Ignoring Practical Significance

The mistake: Shipping every “statistically significant” result.

Why it’s wrong: With enough users, you can detect trivially small effects. A 0.1% improvement is statistically significant with 1M users—but who cares?

The fix: Set your MDE before the test. Ignore results below it.

3. Running Too Many Variants

The mistake: Testing 5 variants at once because “more is better.”

Why it’s wrong: Each additional variant increases your chance of false positives. With 5 variants, you have ~20% chance of seeing a “significant” winner by pure luck.

The fix: Test 2-3 variants max. Run sequential tests instead of parallel.

4. Testing Low-Traffic Areas

The mistake: A/B testing your settings page that gets 50 visitors/day.

Why it’s wrong: You’ll never reach significance. You’ll wait months, then get an inconclusive result.

The fix: Only A/B test pages with enough traffic. For low-traffic areas, use qualitative research or just make a decision.

5. Changing the Test Midway

The mistake: Adding a new metric after seeing results. Extending the test because you didn’t like the outcome.

Why it’s wrong: This is p-hacking. You’re searching for significance instead of testing a hypothesis.

The fix: Document everything before the test. Stick to it.

When You Don’t Need Statistical Significance

Not every decision needs a p-value. Save your rigor for high-stakes choices.

Need stat sig:

Pricing changes
Core flow changes (checkout, signup)
Anything that affects revenue directly
Decisions that are hard to reverse

Don’t need stat sig:

Copy tweaks with low risk
Already-decided changes (just validate)
Qualitative improvements (faster, prettier)
Experiments with obvious winners (+50% is real, trust me)

The heuristic: If you’d be embarrassed to ship a regression, test rigorously. If it’s low-risk and reversible, move faster.

FAQ

What confidence level should I use?

95% (p<0.05) is the standard, but it’s arbitrary. For low-risk tests, 90% is fine. For pricing or core flows, consider 99%. Match rigor to risk.

How do I calculate sample size?

Use a calculator. Evan Miller’s is simple and free. Input your baseline conversion rate, your MDE, and desired power (80% is standard). It tells you users per variant.

My test has been running for 3 weeks and it’s still not significant. What do I do?

Three options:

Run longer — If you’re close to significance and the effect is meaningful
Call it inconclusive — If the effect is tiny and you have enough data
Kill it — If there’s no effect after sufficient sample, there’s probably no effect

An inconclusive result is still a result. It tells you this lever doesn’t move the needle.

What’s the difference between one-tailed and two-tailed tests?

Two-tailed: Tests whether B is different from A (better OR worse) One-tailed: Tests whether B is specifically better than A

Use two-tailed. Always. One-tailed halves your required sample size but doubles your chance of missing regressions. Not worth it.

My tool says “95% confidence” but also shows p=0.08. What?

Some tools confuse confidence intervals with p-values. They’re related but different. Look for the p-value specifically. If the tool doesn’t show it, get a better tool.

The 5-Minute Stats Check

Before making any A/B test decision, ask:

Is my sample size big enough? (1,000+ per variant minimum)
Is my p-value low enough? (<0.05 for important decisions)
Is my effect size big enough? (Above my MDE)
Does it make sense? (Can I explain why this worked?)
What do secondary metrics show? (Any red flags?)

If all five check out, ship with confidence. If any fail, dig deeper.

One More Thing

The biggest stats mistake isn’t technical. It’s political.

Teams run tests, get inconclusive results, then ship the variant anyway because someone important wanted it. That’s not experimentation. That’s theater.

If you’re going to ignore results, stop wasting time testing. Make decisions. Move fast. Test things that actually matter.

Statistics can’t tell you what to build. It can only tell you if what you built is working. The hard part is asking the right questions.

Want to run more experiments without the stats headache? Clayva handles the math automatically—you focus on the decisions.