Stochastic multi-armed-bandit problem with non-stationary rewards

Omar Besbes, Yonatan Gur, Assaf Zeevi

Research output: Contribution to journalConference articlepeer-review

249 Scopus citations

Abstract

In a multi-armed bandit (MAB) problem a gambler needs to choose at each round of play one of K arms, each characterized by an unknown reward distribution. Reward realizations are only observed when an arm is selected, and the gambler's objective is to maximize his cumulative expected earnings over some given horizon of play T. To do this, the gambler needs to acquire information about arms (exploration) while simultaneously optimizing immediate rewards (exploitation); the price paid due to this trade off is often referred to as the regret, and the main question is how small can this price be as a function of the horizon length T. This problem has been studied extensively when the reward distributions do not change over time; an assumption that supports a sharp characterization of the regret, yet is often violated in practical settings. In this paper, we focus on a MAB formulation which allows for a broad range of temporal uncertainties in the rewards, while still maintaining mathematical tractability. We fully characterize the (regret) complexity of this class of MAB problems by establishing a direct link between the extent of allowable reward "variation" and the minimal achievable regret, and by establishing a connection between the adversarial and the stochastic MAB frameworks.

Original languageEnglish
Pages (from-to)199-207
Number of pages9
JournalAdvances in Neural Information Processing Systems
Volume1
Issue numberJanuary
StatePublished - 2014
Externally publishedYes
Event28th Annual Conference on Neural Information Processing Systems 2014, NIPS 2014 - Montreal, Canada
Duration: 8 Dec 201413 Dec 2014

Funding

FundersFunder number
Directorate for Computer and Information Science and Engineering0964170

    Fingerprint

    Dive into the research topics of 'Stochastic multi-armed-bandit problem with non-stationary rewards'. Together they form a unique fingerprint.

    Cite this