| | |
Summary: IEEERANSACTIONS ON AUTOMATICCONTROL, VOL. AC-32, NO. 11, NOVEMBER 1987 977
Asymptotically Efficient Allocation Rules for the
- Multiarmed Bandit Problem with Multiple
Plays-Part 11: Markovian Rewards
Abstract-At each instant of lime we arerequired to sample a fixed
number rn 2 1 out of N Markov chains whose stationary transition
probability matrices belong to a family suitably parameterizedby a real
number 8. The objective is to maximize the long run expected value of the
samples. The learning loss of a sampling scheme corresponding to a
parameters configuration C = (el, ..., e,%*) is quantified bytheregret
RJC). This is the difference between the maximum expected reward that
could be achieved if C were known and the expected reward actually
achieved. We provide a lower bound for the regret associated with any
uniformly good scheme, and construct a sampling scheme which attains
the lower bound for every C. The lower bound is given explicitly in terms
of the Kullback-Liebler number between pairs of transition probabilities.
I. INTRODUCTION
wE study the problem of Part I of this paper [l] when the
reward statistics are Markovian and given by a
one-parameter family of stochastic transition matrices P(0) =
|