Multi-Armed Bandit Setting
Environment
A bandit problem is a sequential game between a learner and a environment. The game is played over n rounds, where n is called the horizon. At round t ∈ [1, n], the learner selects an action At from a set of possible actions A.  The actions here is often mentioned as the arms. Subsequently, the environment would provide a reward Xt ∈ R. If the bandit environment is stochastic, the reward provided with each action is not deterministic. Instead, it would follow a certain probability distribution.Â
Â
 Under such a framework, the goal of the learner is to maximize the cumulative reward it receives by choosing the action with the highest reward.
To achieve such goal, the learner would have a policy π, which is a algorithm mapping from its observed history Ht−1= (A1, X1, A2, X2, . . . , At−1, Xt−1) to its actions.
In order to evaluate the performance of a learner, one could look at the measurement of the learner's policy regret Rn:
In plain words, regret is the difference between the cumulative rewards obtained through optimal action and the expected cumulative reward obtained through the learner's policy. A smaller regret would correspond to a better-performing policy.Â
1-Armed Bandit Setting
Two-armed bandit setting consisting of an unknown or stochastic arm with a Bernoulli distributed reward with probability p and a known or deterministic arm with some constant reward c.
Policies
Upper Confidence Bound (UCB)
UCB is often referred to as an "optimistic" policy, because it evaluates at each round n while considering the potential of unexplored arms. This policy essentially sets an upper confidence bound for each arm already explored, and chooses the arm with the highest value. The bound is set to lean towards an overestimate, so that sequential pulls will eventually bring the bound down to its true mean. Over time, each arm will decrease in confidence, but move closer to its true mean for evaluation.
Thompson Sampling
Thompson sampling utilizes a preconceived assumption, or prior, regarding each arm to assess the current optimal action at round n. After pulling that arm, the reward is used to update the prior for that arm, or the posterior, and it is then used for that arm in the next round. The initial priors for each arm will allow the policy to explore the arms more in order to guide the posterior towards the true behavior of each arm.
Bayesian Optimal Policy (BOP)
For this project, we will use the BOP in the one-armed bandit setting. In this case, there are two arms and there still exists a horizon with n rounds. However, the mean of the second arm is known in advance, while the first arm follows an unknown behavior. Thus, the setting turns into a situation where a "retirement policy" is used. In this case, the policy chooses the unknown arm until it yields lower than the known arm. Once this criteria is met, the policy switches over to the known arm for the rest of the rounds. This policy also uses priors and sets a statistic metric to use at each round to determine whether to continue pulling the first arm, or to switch to the second. By doing so, this policy is able to work backwards and set a matrix of all settings at each round depending on the results of the pulled arm. The policy is then able to iterate forwards, pulling the arm and using the pre-calculated value given based on the reward of the pulled arm.