 
Summary: 968 IEEE TRANSACTIONSON AUTOMATIC CONTROL, VOL.AC32.NO. 11, NOVEMl3ER 1987
Asymptotically Efficient Allocation Rules for the
Multiarmed Bandit Problem with Multiple
PlaysPart I: I.I.D. Rewards
AbsrractAt each instant of time we are required to sample a fixed
number rn 2 1 out of N i.i.d. processes whose distributions belong to a
family suitably parameterized by a real number 8. Theobjective is to
maximize the long run total expected value of the samples. Following Lai
and Robbins, the learning loss of a sampling scheme corresponding to a
configuration of parameters C = (e,, ..a , e,\,) is quantified by the regret
Rn(o.This is the difference between the maximum expected reward at
time n that could be achieved if C were known and the expected reward
actually obtained by the sampling scheme. We provide a lower bound for
the regret associated with any uniformly good scheme, and construct a
scheme which attains the lowerboundfor every configuration C. The
lower bound is given explicitly in terms of the KullbackLiebler number
between pairs of distributions. Part I1 of this paper considers the same
problem when the reward processes are Markovian.
I. INTRODUCTTON
IN this paper we study a version of the multiarmed bandit
