A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs

Dirk van der Hoeven, Lukas Zierahn, Tal Lancewicki, Aviv Rosenberg, Nicoló Cesa-Bianchi

Research output: Contribution to journalConference articlepeer-review

Abstract

We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with delayed bandit feedback. By separating the cost of delayed feedback from that of bandit feedback, our analysis allows us to obtain new results in three important settings. On the one hand, we derive the first optimal (up to logarithmic factors) regret bounds for combinatorial semi-bandits with delay and adversarial Markov decision processes with delay (and known transition functions). On the other hand, we use our analysis to derive an efficient algorithm for linear bandits with delay achieving near-optimal regret bounds. Our novel regret decomposition shows that FTRL remains stable across multiple rounds under mild assumptions on the Hessian of the regularizer.

Original languageEnglish
Pages (from-to)1285-1321
Number of pages37
JournalProceedings of Machine Learning Research
Volume195
StatePublished - 2023
Event36th Annual Conference on Learning Theory, COLT 2023 - Bangalore, India
Duration: 12 Jul 202315 Jul 2023

Keywords

  • Online learning
  • bandit feedback
  • delayed feedback

Fingerprint

Dive into the research topics of 'A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs'. Together they form a unique fingerprint.

Cite this