The blinded bandit: Learning with adaptive feedback

Ofer Dekel, Elad Hazan, Tomer Koren

Research output: Contribution to journalConference articlepeer-review

3 Scopus citations

Abstract

We study an online learning setting where the player is temporarily deprived of feedback each time it switches to a different action. Such a model of adaptive feedback naturally occurs in scenarios where the environment reacts to the player's actions and requires some time to recover and stabilize after the algorithm switches actions. This motivates a variant of the multi-armed bandit problem, which we call the blinded multi-armed bandit, in which no feedback is given to the algorithm whenever it switches arms. We develop efficient online learning algorithms for this problem and prove that they guarantee the same asymptotic regret as the optimal algorithms for the standard multi-armed bandit problem. This result stands in stark contrast to another recent result, which states that adding a switching cost to the standard multi-armed bandit makes it substantially harder to learn, and provides a direct comparison of how feedback and loss contribute to the difficulty of an online learning problem. We also extend our results to the general prediction framework of bandit linear optimization, again attaining near-optimal regret bounds.

Original languageEnglish
Pages (from-to)1610-1618
Number of pages9
JournalAdvances in Neural Information Processing Systems
Volume2
Issue numberJanuary
StatePublished - 2014
Externally publishedYes
Event28th Annual Conference on Neural Information Processing Systems 2014, NIPS 2014 - Montreal, Canada
Duration: 8 Dec 201413 Dec 2014

Fingerprint

Dive into the research topics of 'The blinded bandit: Learning with adaptive feedback'. Together they form a unique fingerprint.

Cite this