Title: Experiment Design Locale: en URL: https://sensorswave.com/en/docs/experiments/experiment-design/ Description: Learn to design scientifically rigorous A/B experiments Scientific experiment design is the foundation for reliable conclusions. This article introduces how to design a rigorous A/B experiment from problem definition, covering Hypothesis formulation, Metric selection, sample size calculation, Duration determination, and other key steps. ## Five-Step Experiment Design ### Step 1: Define the Problem and Hypothesis **Problem Identification** Clearly define what problem you want to solve: - Low conversion rate: Is the checkout flow too complex? - Low click-through rate: Is the button design not prominent enough? - Low retention rate: Is the new user onboarding unclear? - Slow revenue growth: Does the pricing strategy need adjustment? **Formulate a Hypothesis** Based on the problem, propose a testable Hypothesis: **Good Hypotheses**: - Specific: Clearly states what to change - Measurable: Has clear Metrics - Has expectations: Specifies expected improvement **Examples**: | Problem | Hypothesis | |------|------| | Checkout conversion rate is only 20% | Simplifying the checkout flow from 5 steps to 3 steps can increase conversion rate to 25% | | Low add-to-cart button click-through rate | Changing the button color from blue to red can increase click-through rate by 10% | | Unsatisfactory recommendation click-through rate | Using a deep learning algorithm can increase recommendation click-through rate by 15% | | Low VIP conversion rate | Reducing the annual fee from ¥299 to ¥249 can increase purchase conversion rate by 30% | ### Step 2: Select Experiment Metrics Experiment Metrics fall into three categories: #### Primary Metric The core Metric of focus, used to determine experiment Success or Failure: - **Conversion metrics**: Click-through rate, registration rate, payment conversion rate - **Revenue metrics**: Average revenue per user, total revenue, ROI - **Engagement metrics**: Active days, session duration, content consumption **Selection principles**: - Directly related to business goals - Can be accurately measured - Sensitive to changes in user behavior **Examples**: | Experiment Type | Primary Metric | |---------|---------| | Checkout flow optimization | Payment conversion rate | | Button color test | Button click-through rate | | Recommendation algorithm comparison | Recommendation click-through rate | | Pricing strategy test | Purchase conversion rate, total revenue | #### Secondary Metrics Supporting Metrics that help fully understand experiment impact: - **User experience metrics**: Page dwell time, bounce rate - **Downstream metrics**: Add-to-cart rate, bookmark rate, share rate - **Long-term metrics**: Next-day retention, 7-day retention **Example**: Checkout flow optimization experiment: - Primary Metric: Payment conversion rate - Secondary Metrics: Average checkout duration, abandonment rate, order value #### Guardrail Metrics Ensure the experiment does not negatively affect critical Metrics: - **Technical metrics**: Page load time, error rate, crash rate - **User satisfaction**: NPS, complaint rate, uninstall rate - **Revenue protection**: ARPU (Average Revenue Per User), total revenue should not decline **Importance**: - Prevent optimizing one Metric at the expense of others - Example: While improving click-through rate, ensure conversion rate does not decline **Examples**: | Experiment Type | Guardrail Metrics | |---------|---------| | Recommendation algorithm comparison | Add-to-cart conversion rate, total revenue, page load time | | UI redesign | Page load time, error rate, user complaint rate | | Pricing strategy test | Total revenue, user satisfaction, churn rate | ### Step 3: Determine Experiment Variants #### Two-Variant Experiment (Recommended) **Control vs Test Group**: - Control: Current solution (baseline) - Test Group: New solution **Advantages**: - Simple to interpret results - Lower sample size requirement - Shorter Duration **Applicable scenarios**: - Validating a single optimization Hypothesis - Comparing two clearly defined solutions #### Multi-Variant Experiment **Control vs Multiple Test Groups**: - Control: Current solution - Test Group A: Solution A - Test Group B: Solution B - Test Group C: Solution C (optional) **Advantages**: - Compare multiple solutions in one experiment - Saves time and traffic **Disadvantages**: - Higher sample size requirement (longer Duration needed) - Multiple comparison problem (significance level adjustment needed) **Applicable scenarios**: - Pricing strategy testing (comparing 3–4 price points) - Design solution selection (comparing multiple designs) **Considerations**: - Keep the number of Variants manageable (recommended no more than 4) - Each Variant needs sufficient sample size ### Step 4: Calculate Sample Size Sample size determines how many users need to participate for reliable conclusions. #### Sample Size Formula ``` Required sample per group = (Z_α/2 + Z_β)² × 2 × p × (1-p) / (MDE)² ``` **Parameter definitions**: | Parameter | Meaning | Common Value | |------|------|--------| | **α (Alpha)** | Significance level | 0.05 (5%) | | **β (Beta)** | Complement of statistical power | 0.2 (Power = 80%) | | **p** | Baseline conversion rate | Actual business data | | **MDE** | Minimum Detectable Effect | Expected improvement | **Simplified calculation**: For conversion rate Metrics, use this approximation: ``` Sample per group ≈ 16 × p × (1-p) / (MDE)² ``` **Example**: Scenario: Current click-through rate is 20%, want to detect a 10% relative lift (from 20% to 22%) ``` p = 0.20 MDE = 0.02 (10% relative lift = 2% absolute lift) Sample per group ≈ 16 × 0.20 × 0.80 / (0.02)² ≈ 16 × 0.16 / 0.0004 ≈ 6,400 ``` **Conclusion**: At least 6,400 users per group, 12,800 total for two groups. #### MDE (Minimum Detectable Effect) Selection **Definition**: The smallest improvement the experiment can detect. **Smaller MDE**: - Requires larger sample size - Longer Duration - Can detect more subtle differences **Larger MDE**: - Requires smaller sample size - Shorter Duration - Can only detect obvious differences **Recommended MDE**: | Metric Type | Recommended MDE | Notes | |---------|---------|------| | Conversion rate (low baseline) | 10–20% relative lift | When baseline 10% | | Revenue metrics | 5–10% relative lift | Core business Metrics | | Engagement metrics | 10–15% relative lift | Supporting Metrics | **Practical tips**: - New products or features: MDE can be larger (15–20%) - Mature product optimization: MDE should be smaller (5–10%) - Core Metric optimization: MDE can be relaxed slightly, extend experiment Duration #### Online Sample Size Calculators Use online tools for quick sample size calculation: - Evan Miller's calculator: https://www.evanmiller.org/ab-testing/sample-size.html - Optimizely's calculator: https://www.optimizely.com/sample-size-calculator/ ### Step 5: Determine Experiment Duration #### Duration Based on Sample Size ``` Duration (days) = Total sample size / (Daily users × Experiment traffic ratio) ``` **Example**: Assumptions: - Total sample size: 12,800 (two-variant, 6,400 per group) - Daily users: 5,000 - Experiment traffic ratio: 50% (Control 25% + Test Group 25%) ``` Duration = 12,800 / (5,000 × 0.5) = 5.12 days ``` **Recommendation**: Round up to a complete cycle — at least 7 days (covering a full weekday + weekend cycle). #### Consider Cyclical Factors **Weekday vs Weekend differences**: - E-commerce: Weekend traffic and conversion rates may be higher - B2B products: Weekday traffic and activity are higher **Recommendations**: - Run for at least one full week (7 days) - Extend Duration or avoid holidays when running across holidays **Special periods**: - During major promotions (Singles' Day, Black Friday): Avoid running experiments, or analyze separately - During marketing campaigns: May affect experiment results, need to exclude or analyze separately #### Duration Reference Table | Traffic Scale | Recommended Duration | Notes | |---------|---------|------| | High traffic (> 100K daily) | 1–2 weeks | Sufficient sample size, quick conclusions | | Medium traffic (10K–100K daily) | 2–4 weeks | Needs enough time to collect samples | | Low traffic ( 10%, statistically Significant (p < 0.05), no negative impact on guardrail Metrics - **Failure**: Click-through rate lift < 5%, or guardrail Metrics decline ``` --- ## Related Documentation - [Core Concepts](core-concepts.mdx): Understand how A/B experiments work - [Create and Configure](create-and-configure.mdx): Create experiments in the console - [Targeting and Allocation](targeting-and-allocation.mdx): Deep dive into the split mechanism - [Metrics and Analysis](metrics-and-analysis.mdx): Analyze experiment results --- **Last updated**: January 29, 2026