Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation

Uri Sherman*, Tomer Koren, Yishay Mansour

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

5 Scopus citations

Abstract

We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory conditions. We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses. Our algorithm obtains an Oe(K6/7) regret bound, improving significantly over previous state-of-the-art of Oe(K14/15) in this setting. In addition, we present a version of the same algorithm under the assumption a simulator of the environment is available to the learner (but otherwise no exploratory assumptions are made), and prove it obtains state-of-the-art regret of Oe(K2/3).

Original languageEnglish
Pages (from-to)31117-31150
Number of pages34
JournalProceedings of Machine Learning Research
Volume202
StatePublished - 2023
Event40th International Conference on Machine Learning, ICML 2023 - Honolulu, United States
Duration: 23 Jul 202329 Jul 2023

Funding

FundersFunder number
Yandex Initiative in Machine Learning
Horizon 2020 Framework Programme
Blavatnik Family Foundation
European Research Council
Israel Science Foundation2549/19, 993/17
Tel Aviv University
Horizon 2020882396

    Fingerprint

    Dive into the research topics of 'Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation'. Together they form a unique fingerprint.

    Cite this