Bandits with switching costs: T2/3 regret

Ofer Dekel, Jian Ding, Tomer Koren, Yuval Peres

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

67 Scopus citations

Abstract

We study the adversarial multi-armed bandit problem in a setting where the player incurs a unit cost each time he switches actions. We prove that the player's T-round minimax regret in this setting is ⊖̃(T 2/3), thereby closing a fundamental gap in our understanding of learning with bandit feedback. In the corresponding full-information version of the problem, the minimax regret is known to grow at a much slower rate of ⊖ (√T). The difference between these two rates provides the first indication that learning with bandit feedback can be significantly harder than learning with fullinformation feedback (previous results only showed a different dependence on the number of actions, but not on T.) In addition to characterizing the inherent dificulty of the multi-armed bandit problem with switching costs, our results also resolve several other open problems in online learning. One direct implication is that learning with bandit feedback against bounded-memory adaptive adversaries has a minimax regret of ⊖̃ (T2/3). Another implication is that the minimax regret of online learning in adversarial Markov decision processes (MDPs) is ⊖̃ (T2/3). The key to all of our results is a new randomized construction of a multi-scale random walk, which is of independent interest and likely to prove useful in additional settings.

Original languageEnglish
Title of host publicationSTOC 2014 - Proceedings of the 2014 ACM Symposium on Theory of Computing
PublisherAssociation for Computing Machinery
Pages459-467
Number of pages9
ISBN (Print)9781450327107
DOIs
StatePublished - 2014
Externally publishedYes
Event4th Annual ACM Symposium on Theory of Computing, STOC 2014 - New York, NY, United States
Duration: 31 May 20143 Jun 2014

Publication series

NameProceedings of the Annual ACM Symposium on Theory of Computing
ISSN (Print)0737-8017

Conference

Conference4th Annual ACM Symposium on Theory of Computing, STOC 2014
Country/TerritoryUnited States
CityNew York, NY
Period31/05/143/06/14

Funding

FundersFunder number
Microsoft Research

    Keywords

    • Lower bounds
    • Multi-armed Bandit
    • Online learning
    • Switching costs

    Fingerprint

    Dive into the research topics of 'Bandits with switching costs: T2/3 regret'. Together they form a unique fingerprint.

    Cite this