Home

About

Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network
FAQHELPSITE MAPCONTACT US


  Advanced Search  

 
IEEERANSACTIONS ON AUTOMATICCONTROL, VOL. AC-32, NO. 11, NOVEMBER 1987 977 Asymptotically Efficient Allocation Rules for the
 

Summary: IEEERANSACTIONS ON AUTOMATICCONTROL, VOL. AC-32, NO. 11, NOVEMBER 1987 977
Asymptotically Efficient Allocation Rules for the
- Multiarmed Bandit Problem with Multiple
Plays-Part 11: Markovian Rewards
Abstract-At each instant of lime we arerequired to sample a fixed
number rn 2 1 out of N Markov chains whose stationary transition
probability matrices belong to a family suitably parameterizedby a real
number 8. The objective is to maximize the long run expected value of the
samples. The learning loss of a sampling scheme corresponding to a
parameters configuration C = (el, ..., e,%*) is quantified bytheregret
RJC). This is the difference between the maximum expected reward that
could be achieved if C were known and the expected reward actually
achieved. We provide a lower bound for the regret associated with any
uniformly good scheme, and construct a sampling scheme which attains
the lower bound for every C. The lower bound is given explicitly in terms
of the Kullback-Liebler number between pairs of transition probabilities.
I. INTRODUCTION
wE study the problem of Part I of this paper [l] when the
reward statistics are Markovian and given by a
one-parameter family of stochastic transition matrices P(0) =

  

Source: Anantharam, Venkat - Department of Electrical Engineering and Computer Sciences, University of California at Berkeley

 

Collections: Engineering