Thompson Sampling

Definition

Thompson Sampling, also known as Bayesian Bandit or posterior sampling, is a probabilistic algorithm used in decision-making scenarios to balance exploration and exploitation in uncertain environments. It is widely used in multi-armed bandit problems, online experimentation, and adaptive testing to determine which action (e.g., ad, product recommendation, or website variant) is most likely to yield the best outcome—such as clicks, conversions, or purchases.

Thompson Sampling works by maintaining a probability distribution for the expected reward of each option (or “arm”), and at each step, randomly sampling from these distributions to choose the action. Over time, it favors actions that have shown better results while still occasionally exploring alternatives.


How Thompson Sampling Works

  1. Initialize Prior Distributions
    • Begin with a prior belief about the probability of success for each option (e.g., a Beta distribution for binary outcomes like click/no-click).
  2. Sample from Posteriors
    • For each option, sample a value from its probability distribution (posterior).
  3. Select the Best Sample
    • Choose the action (or arm) with the highest sampled value to execute.
  4. Observe Outcome
    • After executing the action, observe the result (e.g., a conversion or no conversion).
  5. Update Beliefs
    • Use Bayes’ Theorem to update the probability distribution (posterior) based on the observed data.

Repeat the process for each round. As more data is collected, the algorithm increasingly favors better-performing options, while still allowing for exploration.


Use Cases for Thompson Sampling

  • Adaptive Bandit Testing
    Allocate traffic to the best-performing web page or ad variation while still testing alternatives.
  • Recommendation Engines
    Present content or product suggestions to users based on prior engagement performance.
  • Online Advertising
    Optimize display ads, creatives, and targeting strategies in real-time bidding environments.
  • Clinical Trials
    Allocate patients to treatment groups more likely to succeed, while still gathering data on other options.

Thompson Sampling vs. Other Bandit Algorithms

AlgorithmExploration StrategyStrengths
Thompson SamplingProbabilistic sampling based on belief distributionsBalances exploration and exploitation efficiently
Epsilon-GreedyMostly exploits, explores randomly with probability εSimple to implement, may underexplore
Upper Confidence Bound (UCB)Explores based on upper bounds of confidence intervalsMore deterministic, prioritizes uncertainty

Thompson Sampling is often preferred for its Bayesian approach, which enables it to naturally incorporate uncertainty and perform well with limited data.


Benefits of Thompson Sampling

  • Efficient Balance of Learning and Earning
    Finds the best option quickly while continuing to explore alternatives just enough to avoid missing better opportunities.
  • Strong Theoretical Guarantees
    Has proven performance bounds and often outperforms other algorithms in both simulations and real-world applications.
  • Adaptability
    Can handle non-stationary environments where probabilities change over time.
  • Natural Handling of Uncertainty
    Bayesian updating allows for intuitive modeling of confidence and uncertainty.

Limitations

  • Computational Complexity
    For problems with many options or continuous action spaces, sampling and updating can become computationally intensive.
  • Requires Prior Knowledge
    Performance may depend on how well prior distributions are chosen, especially in the early stages.
  • Difficult to Implement for Non-Binary Rewards
    Extensions exist for non-binary or non-Bernoulli distributions, but they add complexity.

Example: Website Button Testing

Imagine a company testing three versions of a CTA button. Using Thompson Sampling:

  • Each version starts with a Beta(1,1) prior (uniform distribution).
  • As users click (or don’t), the algorithm updates the Beta distribution for each version.
  • At each new session, Thompson Sampling draws a sample from each version’s distribution.
  • The version with the highest sample is shown.
  • Over time, the best-performing button is shown more often, while still occasionally showing others for continued learning.

Conclusion

Thompson Sampling is a powerful and elegant method for sequential decision-making under uncertainty. Its ability to blend exploitation of known winners with exploration of potential alternatives makes it ideal for real-time optimization in areas like digital marketing, personalization, product testing, and operations research. As AI-driven systems become more autonomous, algorithms like Thompson Sampling will play a critical role in continuously improving decision outcomes in complex, data-rich environments.

Resources

House of the Customer by Greg Kihlström