 
Summary: IEEERANSACTIONS ON AUTOMATICCONTROL, VOL. AC32, NO. 11, NOVEMBER 1987 977
Asymptotically Efficient Allocation Rules for the
 Multiarmed Bandit Problem with Multiple
PlaysPart 11: Markovian Rewards
AbstractAt each instant of lime we arerequired to sample a fixed
number rn 2 1 out of N Markov chains whose stationary transition
probability matrices belong to a family suitably parameterizedby a real
number 8. The objective is to maximize the long run expected value of the
samples. The learning loss of a sampling scheme corresponding to a
parameters configuration C = (el, ..., e,%*) is quantified bytheregret
RJC). This is the difference between the maximum expected reward that
could be achieved if C were known and the expected reward actually
achieved. We provide a lower bound for the regret associated with any
uniformly good scheme, and construct a sampling scheme which attains
the lower bound for every C. The lower bound is given explicitly in terms
of the KullbackLiebler number between pairs of transition probabilities.
I. INTRODUCTION
wE study the problem of Part I of this paper [l] when the
reward statistics are Markovian and given by a
oneparameter family of stochastic transition matrices P(0) =
