Title: Best Practices Locale: en URL: https://sensorswave.com/en/docs/experiments/best-practices/ Description: Summary of experiment design and execution best practices This article summarizes best practices for A/B experiments from design through implementation to analysis, helping you avoid common mistakes and achieve reliable experiment conclusions. ## Experiment Design Principles ### 1. Test One Hypothesis at a Time **Principle**: Single-variable principle — ensure experiment results are attributable. **Correct approach**: ``` Experiment: Test the impact of button color on click-through rate Variable: Only change button color (blue vs red) Keep everything else the same: copy, size, position ``` **Wrong approach**: ``` Experiment: Optimize add-to-cart button Change simultaneously: color, copy, size, position Result: Cannot determine which factor caused the change ``` ### 2. Ensure Sufficient Sample Size **Principle**: At least 1,000 users per group, preferably 5,000+. **Recommended**: - Calculate sample size in advance - Set minimum Duration (at least 1 week) - Extend Duration if sample size is insufficient **Avoid**: - Ending experiments too early (sample size < 500) - Immediately rolling out when the Test Group appears to lead - Skipping sample size calculation ### 3. Run Complete Experiment Cycles **Principle**: Run for at least one full week (7 days), covering weekdays and weekends. **Considerations**: - Weekday vs weekend: User behavior may differ - Holiday impact: Avoid major holidays or extend Duration - Marketing campaigns: Avoid running experiments during promotions **Recommended Duration**: - Standard experiments: 2–4 weeks - High-traffic products: 1–2 weeks - Low-traffic products: 4–8 weeks ### 4. Set Guardrail Metrics **Principle**: Prevent optimizing one Metric at the expense of other important Metrics. **Guardrail Metric examples**: | Experiment Type | Guardrail Metrics | |---------|---------| | UI redesign | Page load time, error rate, user satisfaction | | Recommendation algorithm | User retention, total revenue, content diversity | | Pricing strategy | Customer lifetime value (LTV), churn rate, brand perception | | Checkout flow | Payment success rate, total revenue, user complaint rate | --- ## Naming Conventions ### Experiment Naming **Format**: `module_feature_purpose` **Recommended**: - `cart_button_color_test` - `checkout_flow_optimization` - `recommendation_algorithm_comparison` - `pricing_strategy_test` **Avoid**: - `test1`, `experiment_new` (not semantic enough) - Non-English naming - `cart-button-test` (uses hyphens; use underscores instead) ### Variant Naming **Recommended**: - Control: `control` - Single Test Group: `treatment` - Multiple Test Groups: `treatment_a`, `treatment_b` **Avoid**: - `v1`, `v2` (not semantic enough) - `old`, `new` (easily confused) ### Variable Naming **Use snake_case**: - `button_color` - `checkout_steps` - `recommendation_count` **Avoid**: - `color` (too generic) - `buttonColor` (camelCase) --- ## Traffic Management ### 1. Prioritize Traffic for Important Experiments **Allocation example**: ``` Total traffic: 100% ├─ P0 experiment (Checkout flow optimization): 40% ├─ P1 experiment (Recommendation algorithm test): 30% ├─ P2 experiment (Copy optimization): 10% └─ Reserved traffic (future experiments): 20% ``` **Principles**: - Higher-priority experiments get more traffic - Reserve 10–20% traffic for urgent experiments - Core flow experiments take priority ### 2. Avoid Running Conflicting Experiments Simultaneously **Conflict types**: **Feature conflicts**: - ❌ Experiment A: Optimize checkout flow - ❌ Experiment B: Optimize checkout page layout - ✅ Solution: Run Experiment A first, then Experiment B after it ends **Metric conflicts**: - ❌ Experiment A: Improve add-to-cart rate (primary Metric) - ❌ Experiment B: Improve detail page dwell time (secondary Metric: add-to-cart rate) - ✅ Solution: Run separately or use different user groups ### 3. Reserve Traffic for Future Experiments **Recommendations**: - High-traffic products: Reserve 15–20% - Medium-traffic products: Reserve 10–15% - Low-traffic products: Reserve 5–10% --- ## Data Quality ### 1. Verify Exposure Logs Are Being Reported **Verification steps**: Immediately after release, check exposure logs: ``` Event: $ABImpress Filter: experiment_key = 'your_experiment_key' Time range: Last 1 hour ``` **Verification checklist**: - ✅ Exposure Events are being reported normally - ✅ Each user has only one exposure log - ✅ Traffic distribution is even (deviation < 5%) ### 2. Monitor Data Anomalies **Anomaly types**: | Anomaly | Possible Cause | Resolution | |------|---------|---------| | Exposure count drops suddenly | Code error, experiment paused | Check code and experiment status | | Uneven traffic distribution | Targeting Rules too restrictive | Check targeting conditions | | Abnormally high conversion rate | Data error, fraud | Check data quality | | Abnormally low conversion rate | Bug, UX issue | Pause experiment, investigate | ### 3. Regularly Check Split Uniformity **Frequency**: Check daily for the first 3 days of the experiment. **Method**: ```sql SELECT variant, COUNT(DISTINCT user_id) AS user_count, user_count * 1.0 / SUM(user_count) OVER() AS percentage FROM experiment_assignments WHERE experiment_key = 'your_experiment_key' GROUP BY variant ``` **Expected result** (50/50 Allocation): | Variant | Users | Percentage | |------|--------|------| | control | ~5,000 | ~50% | | treatment | ~5,000 | ~50% | --- ## Decision Standards ### 1. Wait for Sufficient Sample Size **Do not draw conclusions too early**: ``` ❌ Wrong: Day 1: Test Group leads by 10%, immediately roll out Result: Long-term effect is poor, resources wasted ✅ Correct: Wait 2 weeks, reach sample size (5,000 users per group) Confirm statistical significance (p < 0.05) Observe all Metrics holistically Make decision ``` ### 2. Rely on Statistical Significance **Criteria**: ``` p < 0.05: Result is Significant, can roll out p ≥ 0.05: Result is not Significant, need more data or abandon Hypothesis ``` **Avoid misinterpretation**: ``` ❌ Wrong: Conversion rate lift 5%, p = 0.12 Conclusion: Experiment succeeded, roll out ✅ Correct: Conversion rate lift 5%, p = 0.12 (not Significant) Conclusion: Lift may be due to chance, need to extend experiment or abandon Hypothesis ``` ### 3. Consider the Full Picture **Observe all Metrics**: ``` ✅ Recommended: Primary Metric: CTR +15% (p = 0.001) Secondary Metric: Conversion rate +8% (p = 0.02) Guardrail Metric: Total revenue +12% (p = 0.005) Conclusion: Comprehensive Success, roll out ❌ Avoid: Primary Metric: CTR +15% (p = 0.001) Secondary Metric: Conversion rate -5% (p = 0.03) Guardrail Metric: Total revenue -10% (p = 0.01) Wrong conclusion: Experiment succeeded (only looking at primary Metric) Correct conclusion: Experiment Failed (CTR improved but conversion and revenue declined) ``` ### 4. Document the Decision Process **Build experiment archives**: Every experiment should have complete records: - Experiment Hypothesis - Experiment design - Key data - Decision rationale - Follow-up actions **Purposes**: - Team review and learning - Avoid repeating experiments - Accumulate experiment experience --- ## Combining with Feature Gates ### Feature Gate First, Then A/B Experiment **Recommended workflow**: ``` Phase 1: Feature Gate for stability validation (1–2 weeks) - Validate technical Metrics: error rate, response time, crash rate - Gradual rollout: 1% → 10% → 50% → 100% Phase 2: A/B Experiment for effectiveness validation (2–4 weeks) - Validate business Metrics: conversion rate, retention rate, revenue - Random grouping: 50% vs 50% - Statistical significance testing Phase 3: Full release - Apply the winning solution - Clean up code ``` **Advantages**: - Technical risk is controlled - Data-driven decisions - Quick rollback capability See [Feature Gates vs A/B Testing](../feature-gates/gates-vs-experiments.mdx) for details. --- ## Common Mistakes and Solutions ### Mistake 1: Insufficient Sample Size **Problem**: Experiment runs 3 days with 200 users per group, and conclusions are drawn. **Consequence**: Unreliable results, easily influenced by random factors. **Solution**: - Calculate sample size in advance - Set minimum Duration (at least 1 week) - Analyze only after reaching sample size ### Mistake 2: Ignoring Statistical Significance **Problem**: Only looking at lift, ignoring P-Value. **Example**: ``` Conversion rate lift 5% (p = 0.15) Wrong conclusion: Experiment succeeded Correct conclusion: Result is not Significant, may be random fluctuation ``` **Solution**: - Consider both lift and P-Value - Only consider Significant when p < 0.05 ### Mistake 3: Changing Multiple Variables at Once **Problem**: Simultaneously changing button color, copy, and size. **Consequence**: Cannot determine which factor caused the change. **Solution**: - Test one Hypothesis at a time - If multiple variables must change, design a multi-arm experiment ### Mistake 4: Stopping Experiments Too Early **Problem**: Seeing the Test Group lead and immediately stopping to roll out. **Consequences**: - Short-term fluctuations may not represent long-term effects - Novelty effect fades and Metrics may regress **Solution**: - Wait until predetermined sample size is reached - Run for a complete cycle - Monitor long-term Metrics (retention, LTV) ### Mistake 5: Ignoring Guardrail Metrics **Problem**: Only focusing on the primary Metric, ignoring others. **Example**: ``` Primary Metric: CTR +20% (Success) Guardrail Metric: Total revenue -15% (severe decline) Wrong decision: Roll out Correct decision: Abandon (CTR improved but revenue declined) ``` **Solution**: - Set guardrail Metrics - Observe all Metrics holistically - Avoid "vanity Metrics" --- ## Experiment Checklist Use this checklist before releasing an experiment: ### Experiment Design - [ ] Hypothesis is clear and testable - [ ] Only one variable being tested - [ ] Sample size calculated correctly - [ ] Duration is sufficient (≥ 1 week) - [ ] Allocation is reasonable (total = 100%) ### Metric Selection - [ ] Primary Metric aligns with business goals - [ ] Secondary Metrics are set - [ ] Guardrail Metrics are set - [ ] All Metrics can be accurately measured ### Configuration Check - [ ] Experiment Key is correct and unique - [ ] Variant naming follows conventions - [ ] Dynamic variable types are consistent - [ ] Targeting Rules are configured correctly ### Code Integration - [ ] SDK has A/B testing enabled - [ ] Code integration is correct - [ ] Verified in test environment - [ ] Error handling is thorough ### Monitoring Preparation - [ ] Data monitoring is set up - [ ] Rollback plan is prepared - [ ] Team members understand the experiment --- ## Related Documentation - [Experiment Design](experiment-design.mdx): Learn scientific experiment design methods - [Use Cases](use-cases.mdx): Learn best practices through real-world cases - [FAQ](faq.mdx): See answers to common questions --- **Last updated**: January 29, 2026