Title: Best Practices
Locale: en
URL: https://sensorswave.com/en/docs/experiments/best-practices/
Description: Summary of experiment design and execution best practices
This article summarizes best practices for A/B experiments from design through implementation to analysis, helping you avoid common mistakes and achieve reliable experiment conclusions.

## Experiment Design Principles

### 1. Test One Hypothesis at a Time

**Principle**: Single-variable principle — ensure experiment results are attributable.

**Correct approach**:
```
Experiment: Test the impact of button color on click-through rate
Variable: Only change button color (blue vs red)
Keep everything else the same: copy, size, position
```

**Wrong approach**:
```
Experiment: Optimize add-to-cart button
Change simultaneously: color, copy, size, position
Result: Cannot determine which factor caused the change
```

### 2. Ensure Sufficient Sample Size

**Principle**: At least 1,000 users per group, preferably 5,000+.

**Recommended**:
- Calculate sample size in advance
- Set minimum Duration (at least 1 week)
- Extend Duration if sample size is insufficient

**Avoid**:
- Ending experiments too early (sample size < 500)
- Immediately rolling out when the Test Group appears to lead
- Skipping sample size calculation

### 3. Run Complete Experiment Cycles

**Principle**: Run for at least one full week (7 days), covering weekdays and weekends.

**Considerations**:
- Weekday vs weekend: User behavior may differ
- Holiday impact: Avoid major holidays or extend Duration
- Marketing campaigns: Avoid running experiments during promotions

**Recommended Duration**:
- Standard experiments: 2–4 weeks
- High-traffic products: 1–2 weeks
- Low-traffic products: 4–8 weeks

### 4. Set Guardrail Metrics

**Principle**: Prevent optimizing one Metric at the expense of other important Metrics.

**Guardrail Metric examples**:

| Experiment Type | Guardrail Metrics |
|---------|---------|
| UI redesign | Page load time, error rate, user satisfaction |
| Recommendation algorithm | User retention, total revenue, content diversity |
| Pricing strategy | Customer lifetime value (LTV), churn rate, brand perception |
| Checkout flow | Payment success rate, total revenue, user complaint rate |

---

## Naming Conventions

### Experiment Naming

**Format**: `module_feature_purpose`

**Recommended**:
- `cart_button_color_test`
- `checkout_flow_optimization`
- `recommendation_algorithm_comparison`
- `pricing_strategy_test`

**Avoid**:
- `test1`, `experiment_new` (not semantic enough)
- Non-English naming
- `cart-button-test` (uses hyphens; use underscores instead)

### Variant Naming

**Recommended**:
- Control: `control`
- Single Test Group: `treatment`
- Multiple Test Groups: `treatment_a`, `treatment_b`

**Avoid**:
- `v1`, `v2` (not semantic enough)
- `old`, `new` (easily confused)

### Variable Naming

**Use snake_case**:
- `button_color`
- `checkout_steps`
- `recommendation_count`

**Avoid**:
- `color` (too generic)
- `buttonColor` (camelCase)

---

## Traffic Management

### 1. Prioritize Traffic for Important Experiments

**Allocation example**:

```
Total traffic: 100%
├─ P0 experiment (Checkout flow optimization): 40%
├─ P1 experiment (Recommendation algorithm test): 30%
├─ P2 experiment (Copy optimization): 10%
└─ Reserved traffic (future experiments): 20%
```

**Principles**:
- Higher-priority experiments get more traffic
- Reserve 10–20% traffic for urgent experiments
- Core flow experiments take priority

### 2. Avoid Running Conflicting Experiments Simultaneously

**Conflict types**:

**Feature conflicts**:
- ❌ Experiment A: Optimize checkout flow
- ❌ Experiment B: Optimize checkout page layout
- ✅ Solution: Run Experiment A first, then Experiment B after it ends

**Metric conflicts**:
- ❌ Experiment A: Improve add-to-cart rate (primary Metric)
- ❌ Experiment B: Improve detail page dwell time (secondary Metric: add-to-cart rate)
- ✅ Solution: Run separately or use different user groups

### 3. Reserve Traffic for Future Experiments

**Recommendations**:
- High-traffic products: Reserve 15–20%
- Medium-traffic products: Reserve 10–15%
- Low-traffic products: Reserve 5–10%

---

## Data Quality

### 1. Verify Exposure Logs Are Being Reported

**Verification steps**:

Immediately after release, check exposure logs:

```
Event: $ABImpress
Filter: experiment_key = 'your_experiment_key'
Time range: Last 1 hour
```

**Verification checklist**:
- ✅ Exposure Events are being reported normally
- ✅ Each user has only one exposure log
- ✅ Traffic distribution is even (deviation < 5%)

### 2. Monitor Data Anomalies

**Anomaly types**:

| Anomaly | Possible Cause | Resolution |
|------|---------|---------|
| Exposure count drops suddenly | Code error, experiment paused | Check code and experiment status |
| Uneven traffic distribution | Targeting Rules too restrictive | Check targeting conditions |
| Abnormally high conversion rate | Data error, fraud | Check data quality |
| Abnormally low conversion rate | Bug, UX issue | Pause experiment, investigate |

### 3. Regularly Check Split Uniformity

**Frequency**: Check daily for the first 3 days of the experiment.

**Method**:

```sql
SELECT
  variant,
  COUNT(DISTINCT user_id) AS user_count,
  user_count * 1.0 / SUM(user_count) OVER() AS percentage
FROM experiment_assignments
WHERE experiment_key = 'your_experiment_key'
GROUP BY variant
```

**Expected result** (50/50 Allocation):

| Variant | Users | Percentage |
|------|--------|------|
| control | ~5,000 | ~50% |
| treatment | ~5,000 | ~50% |

---

## Decision Standards

### 1. Wait for Sufficient Sample Size

**Do not draw conclusions too early**:

```
❌ Wrong:
Day 1: Test Group leads by 10%, immediately roll out
Result: Long-term effect is poor, resources wasted

✅ Correct:
Wait 2 weeks, reach sample size (5,000 users per group)
Confirm statistical significance (p < 0.05)
Observe all Metrics holistically
Make decision
```

### 2. Rely on Statistical Significance

**Criteria**:

```
p < 0.05: Result is Significant, can roll out
p ≥ 0.05: Result is not Significant, need more data or abandon Hypothesis
```

**Avoid misinterpretation**:

```
❌ Wrong:
Conversion rate lift 5%, p = 0.12
Conclusion: Experiment succeeded, roll out

✅ Correct:
Conversion rate lift 5%, p = 0.12 (not Significant)
Conclusion: Lift may be due to chance, need to extend experiment or abandon Hypothesis
```

### 3. Consider the Full Picture

**Observe all Metrics**:

```
✅ Recommended:
Primary Metric: CTR +15% (p = 0.001)
Secondary Metric: Conversion rate +8% (p = 0.02)
Guardrail Metric: Total revenue +12% (p = 0.005)
Conclusion: Comprehensive Success, roll out

❌ Avoid:
Primary Metric: CTR +15% (p = 0.001)
Secondary Metric: Conversion rate -5% (p = 0.03)
Guardrail Metric: Total revenue -10% (p = 0.01)
Wrong conclusion: Experiment succeeded (only looking at primary Metric)
Correct conclusion: Experiment Failed (CTR improved but conversion and revenue declined)
```

### 4. Document the Decision Process

**Build experiment archives**:

Every experiment should have complete records:
- Experiment Hypothesis
- Experiment design
- Key data
- Decision rationale
- Follow-up actions

**Purposes**:
- Team review and learning
- Avoid repeating experiments
- Accumulate experiment experience

---

## Combining with Feature Gates

### Feature Gate First, Then A/B Experiment

**Recommended workflow**:

```
Phase 1: Feature Gate for stability validation (1–2 weeks)
- Validate technical Metrics: error rate, response time, crash rate
- Gradual rollout: 1% → 10% → 50% → 100%

Phase 2: A/B Experiment for effectiveness validation (2–4 weeks)
- Validate business Metrics: conversion rate, retention rate, revenue
- Random grouping: 50% vs 50%
- Statistical significance testing

Phase 3: Full release
- Apply the winning solution
- Clean up code
```

**Advantages**:
- Technical risk is controlled
- Data-driven decisions
- Quick rollback capability

See [Feature Gates vs A/B Testing](../feature-gates/gates-vs-experiments.mdx) for details.

---

## Common Mistakes and Solutions

### Mistake 1: Insufficient Sample Size

**Problem**: Experiment runs 3 days with 200 users per group, and conclusions are drawn.

**Consequence**: Unreliable results, easily influenced by random factors.

**Solution**:
- Calculate sample size in advance
- Set minimum Duration (at least 1 week)
- Analyze only after reaching sample size

### Mistake 2: Ignoring Statistical Significance

**Problem**: Only looking at lift, ignoring P-Value.

**Example**:
```
Conversion rate lift 5% (p = 0.15)
Wrong conclusion: Experiment succeeded
Correct conclusion: Result is not Significant, may be random fluctuation
```

**Solution**:
- Consider both lift and P-Value
- Only consider Significant when p < 0.05

### Mistake 3: Changing Multiple Variables at Once

**Problem**: Simultaneously changing button color, copy, and size.

**Consequence**: Cannot determine which factor caused the change.

**Solution**:
- Test one Hypothesis at a time
- If multiple variables must change, design a multi-arm experiment

### Mistake 4: Stopping Experiments Too Early

**Problem**: Seeing the Test Group lead and immediately stopping to roll out.

**Consequences**:
- Short-term fluctuations may not represent long-term effects
- Novelty effect fades and Metrics may regress

**Solution**:
- Wait until predetermined sample size is reached
- Run for a complete cycle
- Monitor long-term Metrics (retention, LTV)

### Mistake 5: Ignoring Guardrail Metrics

**Problem**: Only focusing on the primary Metric, ignoring others.

**Example**:
```
Primary Metric: CTR +20% (Success)
Guardrail Metric: Total revenue -15% (severe decline)
Wrong decision: Roll out
Correct decision: Abandon (CTR improved but revenue declined)
```

**Solution**:
- Set guardrail Metrics
- Observe all Metrics holistically
- Avoid "vanity Metrics"

---

## Experiment Checklist

Use this checklist before releasing an experiment:

### Experiment Design

- [ ] Hypothesis is clear and testable
- [ ] Only one variable being tested
- [ ] Sample size calculated correctly
- [ ] Duration is sufficient (≥ 1 week)
- [ ] Allocation is reasonable (total = 100%)

### Metric Selection

- [ ] Primary Metric aligns with business goals
- [ ] Secondary Metrics are set
- [ ] Guardrail Metrics are set
- [ ] All Metrics can be accurately measured

### Configuration Check

- [ ] Experiment Key is correct and unique
- [ ] Variant naming follows conventions
- [ ] Dynamic variable types are consistent
- [ ] Targeting Rules are configured correctly

### Code Integration

- [ ] SDK has A/B testing enabled
- [ ] Code integration is correct
- [ ] Verified in test environment
- [ ] Error handling is thorough

### Monitoring Preparation

- [ ] Data monitoring is set up
- [ ] Rollback plan is prepared
- [ ] Team members understand the experiment

---

## Related Documentation

- [Experiment Design](experiment-design.mdx): Learn scientific experiment design methods
- [Use Cases](use-cases.mdx): Learn best practices through real-world cases
- [FAQ](faq.mdx): See answers to common questions

---

**Last updated**: January 29, 2026