## Abstract

In a multi-armed bandit problem, a gambler needs to choose at each round one of K arms, each characterized by an unknown reward distribution. The objective is to maximize cumulative expected earnings over a planning horizon of length T, and performance is measured in terms of regret relative to a (static) oracle that knows the identity of the best arm a priori. This problem has been studied extensively when the reward distributions do not change over time, and uncertainty essentially amounts to identifying the optimal arm. We complement this literature by developing a flexible non-parametric model for temporal uncertainty in the rewards. The extent of temporal uncertainty is measured via the cumulative mean change in the rewards over the horizon, a metric we refer to as temporal variation, and regret is measured relative to a (dynamic) oracle that plays the point-wise optimal action at each period. Assuming that nature can choose any sequence of mean rewards such that their temporal variation does not exceed V (a temporal uncertainty budget), we characterize the complexity of this problem via the minimax regret, which depends on V (the hardness of the problem), the horizon length T, and the number of arms K.

Original language | English |
---|---|

Pages (from-to) | 319-337 |

Number of pages | 19 |

Journal | Stochastic Systems |

Volume | 9 |

Issue number | 4 |

DOIs | |

State | Published - Dec 2019 |

Externally published | Yes |

## Keywords

- dynamic oracle
- dynamic regret
- exploration/exploitation
- minimax regret
- multi-armed bandit
- nonstationary