| | |
Summary: Bandit game
Parameters available to the forecaster:
the number of arms (or actions) K and the number of rounds n
Unknown to the forecaster: the way the gain vectors
gt = (g1,t, . . . , gK,t) [0, 1]K are generated
For each round t = 1, 2, . . . , n
1. the forecaster chooses an arm It {1, . . . , K}
2. the forecaster receives the gain gIt ,t
3. only gIt ,t is revealed to the forecaster
Cumulative regret goal: maximize the cumulative gains obtained.
More precisely, minimize
Rn = max
i=1,...,K
E
n
t=1
gi,t - E
n
t=1
gIt ,t
|