pyexpstats — A/B Testing Done Right, in Python
Every product team runs A/B tests. Very few do the statistics properly.
I’ve seen it happen over and over. Someone changes a button colour, splits traffic 50/50, waits a few days, and then declares victory because the conversion rate “went up.” No power analysis. No check for sample ratio mismatch. No idea if they had enough traffic to detect the effect in the first place. And definitely no one thinking about whether that initial lift is just a novelty effect that’ll fade in two weeks.
The tools don’t help either. Enterprise platforms like Optimizely and VWO cost a fortune. Open-source alternatives are fragmented — one library for sample size calculation, another for significance testing, a third for Bayesian analysis, and none of them talk to each other. And if your PM asks “so what does this mean in revenue?”, good luck translating a p-value into rupees.
I got tired of stitching together scripts every time I needed to analyze an experiment. So I built pyexpstats — a comprehensive Python library (and web calculator) that handles the entire A/B testing lifecycle.
Planning. Analysis. Diagnostics. Business impact. All in one place.
pip install pyexpstatsWhat Does It Actually Do?
Let me walk you through the full picture, because this isn’t just another significance calculator.
1. Plan Your Test Properly
Before you run anything, you need to know: how many visitors do I actually need?
Most teams skip this step. They just start the test and “wait until it looks significant.” That’s how you end up with underpowered tests that can’t detect real effects, or overpowered tests that waste weeks of traffic.
from pyexpstats.effects.outcome import conversion
plan = conversion.sample_size(
current_rate=0.05, # 5% baseline conversion
lift_percent=10, # want to detect a 10% relative lift
confidence=95, # 95% confidence level
power=80 # 80% statistical power
)
print(f"You need {plan['visitors_per_variant']:,} visitors per variant")
print(f"Total visitors: {plan['total_visitors']:,}")
This tells you upfront — “you need 30,000 visitors per variant to detect a 10% lift.” If your site only gets 2,000 visitors a day, now you know this test will take at least a month. No more guessing.
It works for continuous metrics too — revenue, average order value, time on site:
from pyexpstats.effects.outcome import magnitude
plan = magnitude.sample_size(
current_mean=75.0, # Rs 75 average order value
current_std=25.0, # standard deviation
lift_percent=5, # detect 5% revenue lift
confidence=95,
power=80
)
And even for time-to-event analysis — like testing whether a new onboarding flow gets users to their first purchase faster:
from pyexpstats.effects.outcome import timing
plan = timing.sample_size(
control_median=14.0, # control: 14 days to first purchase
treatment_median=12.0, # hoping for 12 days
confidence=95,
power=80
)
2. Analyze Results With the Right Test
This is where most people start — and where most mistakes happen.
For conversion rates (did they click? did they buy?), pyexpstats uses a Z-test:
result = conversion.analyze(
control_visitors=10000,
control_conversions=500, # 5.0% conversion rate
variant_visitors=10000,
variant_conversions=550, # 5.5% conversion rate
confidence=95
)
print(f"Lift: {result['relative_lift']:.1f}%")
print(f"P-value: {result['p_value']:.4f}")
print(f"Significant: {result['significant']}")
For revenue and continuous metrics, it uses Welch’s t-test (which correctly handles unequal variances — something a lot of tools get wrong):
result = magnitude.analyze(
control_visitors=5000,
control_mean=75.0,
control_std=25.0,
variant_visitors=5000,
variant_mean=79.50,
variant_std=26.0,
confidence=95
)
Running more than two variants? A/B/n testing with proper Bonferroni correction:
variants = [
{"name": "Control", "visitors": 10000, "conversions": 500},
{"name": "Variant A", "visitors": 10000, "conversions": 540},
{"name": "Variant B", "visitors": 10000, "conversions": 580},
]
result = conversion.analyze_multi(variants, confidence=95)
It automatically adjusts for multiple comparisons. No inflated false positives.
3. Don’t Just Test — Diagnose
This is the part I’m most proud of. Because getting a significant result doesn’t mean your test is valid.
Sample Ratio Mismatch (SRM) is one of the most common experiment bugs, and almost nobody checks for it. If you intended a 50/50 split but got 52/48, something is wrong — maybe a redirect is failing, maybe bots are hitting one variant more, maybe there’s a caching issue. Pyexpstats catches it:
from pyexpstats.diagnostics import srm
check = srm.check_sample_ratio(
control_visitors=10200,
variant_visitors=9800
)
print(f"Status: {check['status']}") # "warning" or "ok"
print(f"P-value: {check['p_value']:.4f}")
Novelty Effect Detection tells you if your treatment effect is fading over time. Maybe users liked the new UI initially, but the lift is shrinking every week. That’s a novelty effect — and if you don’t catch it, you’ll ship something that won’t deliver long-term:
from pyexpstats.diagnostics import novelty
daily_results = [
{"day": 1, "control_rate": 0.050, "variant_rate": 0.060},
{"day": 2, "control_rate": 0.050, "variant_rate": 0.058},
# ... more daily data
{"day": 14, "control_rate": 0.050, "variant_rate": 0.052},
]
check = novelty.detect_novelty_effect(daily_results)
print(f"Effect type: {check['effect_type']}") # "novelty", "primacy", or "stable"
The Health Dashboard wraps all diagnostics into one comprehensive check:
from pyexpstats.diagnostics import health
report = health.check_health(
control_visitors=10000,
control_conversions=500,
variant_visitors=10000,
variant_conversions=550,
test_duration_days=14,
daily_visitors=2000
)
print(f"Health Score: {report['health_score']}/100")
print(f"Can Trust Results: {report['can_trust_results']}")
print(f"Issues: {report['primary_issues']}")
It checks sample ratio, minimum sample size, test duration (did you run for at least one full business cycle?), statistical power, peeking risk, and effect stability. All in one call.
Beyond P-Values: Bayesian and Sequential Testing
Here’s where pyexpstats really separates itself from the basic calculators.
Bayesian A/B Testing
Traditional p-values are confusing. “The probability of observing this data given the null hypothesis is true” — try explaining that to your product manager at the Monday standup.
Bayesian analysis gives you something much more intuitive: “there is a 94% probability that the variant is better than control.”
from pyexpstats.methods import bayesian
result = bayesian.analyze(
control_visitors=10000,
control_conversions=500,
variant_visitors=10000,
variant_conversions=550
)
print(f"P(Variant > Control): {result['prob_variant_better']:.1%}")
print(f"Expected Loss (choosing variant): {result['expected_loss_variant']:.4f}")
No fixed sample size requirement. You can check results anytime without penalty. And the “expected loss” metric tells you the cost of making the wrong decision.
Sequential Testing (Early Stopping)
Sometimes the effect is so large that you don’t need to wait for the full planned sample size. Sequential testing lets you stop early for clear winners — without inflating your false positive rate:
from pyexpstats.methods import sequential
result = sequential.analyze(
control_visitors=5000,
control_conversions=250,
variant_visitors=5000,
variant_conversions=300,
expected_visitors_per_variant=15000
)
print(f"Decision: {result['decision']}")
# "variant_wins", "control_wins", "no_difference", or "keep_running"
It uses O’Brien-Fleming spending boundaries by default — the gold standard for group sequential methods. Conservative early on, more liberal as you approach your planned sample size.
The Part Your PM Actually Cares About: Business Impact
Your test showed a 10% lift. Great. But what does that mean in revenue?
from pyexpstats.business import impact
projection = impact.project_impact(
control_rate=0.05,
variant_rate=0.055,
lift_percent=10.0,
lift_ci_lower=-1.0,
lift_ci_upper=21.0,
monthly_visitors=100000,
revenue_per_conversion=500
)
print(f"Monthly revenue lift: Rs {projection['monthly_revenue_lift']:,.0f}")
print(f"Annual revenue lift: Rs {projection['annual_revenue_lift']:,.0f}")
print(f"Revenue range: Rs {projection['annual_revenue_range'][0]:,.0f} to Rs {projection['annual_revenue_range'][1]:,.0f}")
This takes your test results — including the confidence interval — and projects them into actual money. Monthly lift, annual lift, with uncertainty ranges. Something you can put in a slide deck and present to leadership.
Guardrail Metrics
You’re testing a new checkout flow and conversions are up. But what if page load time increased? Or bounce rate spiked? Guardrail metrics let you monitor the things you don’t want to break:
from pyexpstats.business import guardrails
result = guardrails.check_guardrails([
{
"name": "Page Load Time",
"metric_type": "continuous",
"direction": "increase", # increase is bad
"threshold": 10, # alert if >10% worse
"control_mean": 2.5,
"variant_mean": 2.6,
"control_std": 0.5,
"variant_std": 0.5,
"control_n": 10000,
"variant_n": 10000
},
{
"name": "Bounce Rate",
"metric_type": "proportion",
"direction": "increase",
"threshold": 5,
"control_rate": 0.30,
"variant_rate": 0.31,
"control_n": 10000,
"variant_n": 10000
}
])
print(f"Can ship: {result['can_ship']}")
One call. All your guardrails checked. Clear ship/no-ship recommendation.
Segment Analysis That Doesn’t Lie to You
“The overall test isn’t significant, but if you look at mobile users in the 25-34 age group, it’s a 30% lift!”
We’ve all heard this. And it’s almost always wrong. When you slice data into enough segments, you’ll find “significant” results by pure chance. It’s the multiple comparisons problem.
Pyexpstats handles this properly:
from pyexpstats.segments import analysis
segments_data = [
{
"name": "Mobile",
"control_visitors": 6000, "control_conversions": 300,
"variant_visitors": 6000, "variant_conversions": 360,
},
{
"name": "Desktop",
"control_visitors": 4000, "control_conversions": 200,
"variant_visitors": 4000, "variant_conversions": 190,
},
]
result = analysis.analyze_segments(
segments_data,
correction_method="holm" # Holm-Bonferroni correction
)
It applies Bonferroni or Holm correction to control false discovery rate. And it even detects Simpson’s Paradox — the bizarre situation where the overall result says A wins, but every individual segment says B wins. If that happens, your segmentation is confounded and the overall result is misleading.
You Don’t Even Need to Write Code
Not everyone on the team writes Python. That’s why pyexpstats also ships with a web calculator — a clean React frontend that covers all the major features.
It’s live right now at pyexpstats.vercel.app. No installation, no signup. Just open the link and start analyzing.
The web interface covers:
Sample size planning
Significance testing (Classic, Bayesian, and Sequential)
Confidence intervals
Diagnostics dashboard
Segment analysis
Revenue impact projections
Difference-in-Differences analysis
Time-to-event analysis
For teams where the data scientist sets up the test but the PM needs to check results — this is perfect. Share the link, walk them through the inputs, and they can run analyses themselves.
Auto-Generated Reports
One more thing that I find incredibly useful in practice. Pyexpstats can generate plain-language markdown reports from any analysis:
report = conversion.summarize(result, test_name="Checkout Flow Redesign")
print(report)
Output:
## A/B Test Report: Checkout Flow Redesign
**Result: Variant wins** (statistically significant)
- Control conversion rate: 5.00%
- Variant conversion rate: 5.50%
- Relative lift: +10.0%
- P-value: 0.0234
- Confidence: 95%
No more manually writing up results. Run the analysis, generate the report, paste it into Slack or Notion or your experiment log. Done.
The Design Philosophy
A few things I was intentional about:
Cover the full lifecycle. Most tools do one thing — either planning or analysis or impact projection. Pyexpstats does all of it. Plan your test, run the analysis, check for problems, translate to business impact, generate the report. One library, one import.
Business-first language. The library internally uses proper statistical terminology, but every user-facing output is translated into language that product and marketing teams can understand. “94% chance the variant is better” instead of “p = 0.06 under the null hypothesis.”
Multiple valid methodologies. Frequentist for the traditional hypothesis testing crowd. Bayesian for teams that want probability-based decisions. Sequential for when you need early stopping. All available, all properly implemented, no judgment about which one you choose.
Built-in diagnostic safety nets. SRM detection, novelty effect checks, power analysis, minimum duration validation. These aren’t optional nice-to-haves — they’re the checks that prevent you from shipping a change based on a broken experiment.
Getting Started
Install via pip
pip install pyexpstatsRun your first analysis
from pyexpstats.effects.outcome import conversion
# Did the new landing page convert better?
result = conversion.analyze(
control_visitors=10000,
control_conversions=500,
variant_visitors=10000,
variant_conversions=550,
confidence=95
)
print(f"Significant: {result['significant']}")
print(f"Lift: {result['relative_lift']:.1f}%")
print(f"P-value: {result['p_value']:.4f}")Or just use the web calculator
Head to pyexpstats.vercel.app — no installation needed.
Explore the examples
The repo includes 6 Jupyter notebooks that walk you through everything:
Quickstart — end-to-end workflow in 5 minutes
Planning — sample size, MDE, duration recommendations
Analysis — multi-variant tests, confidence intervals
Advanced Methods — sequential testing, Bayesian analysis
Diagnostics — SRM, health checks, novelty detection
Business Impact — revenue projections, guardrails
Who Is This For?
Data Scientists / Analysts who run experiments and want one library instead of five
Product Managers who need to understand test results without a statistics degree (use the web calculator)
Growth Teams who run high-velocity experiments and need proper statistical controls
Marketing Teams testing landing pages, email campaigns, ad creatives
E-commerce Teams optimizing checkout flows, pricing, product recommendations
SaaS Teams running experiments on trial conversion, onboarding, churn
If you’re running A/B tests on any kind of digital product, pyexpstats has something for you.
Open Source and Free
pyexpstats is MIT licensed — free for personal and commercial use. The entire codebase is on GitHub with 367+ tests across 13 test files.
GitHub: github.com/pyexpstats/pyexpstats
PyPI: pypi.org/project/pyexpstats
Web Calculator: pyexpstats.vercel.app
We’re also preparing a submission to the Journal of Open Source Software (JOSS) — because we believe open-source experimentation tools deserve the same rigour as the commercial ones.
Star the repo if you find it useful. Open an issue if something breaks. And if you’ve got ideas for new features, PRs are always welcome.
Try it out: pip install pyexpstats
pyexpstats is open source under the MIT license. Built for teams who believe experimentation should be rigorous, accessible, and free.

