Chasing ghosts: Competing with stateful policies

Uriel Feige, Tomer Koren, Moshe Tennenholtz

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

We consider sequential decision making in a setting where regret is measured with respect to a set of stateful reference policies, and feedback is limited to observing the rewards of the actions performed (the so-called bandit setting). If either the reference policies are stateless rather than stateful or the feedback includes the rewards of all actions (the so-called experts setting), previous work shows that the optimal regret grows like Θ(√T) in terms of the number of decision rounds T. The difficulty in our setting is that the decision maker unavoidably loses track of the internal states of the reference policies and thus cannot reliably attribute rewards observed in a certain round to any of the reference policies. In fact, in this setting it is impossible for the algorithm to estimate which policy gives the highest (or even approximately highest) total reward. Nevertheless, we design an algorithm that achieves expected regret that is sublinear in T, of the form O(T/ log1/4 T). Our algorithm is based on a certain local repetition lemma that may be of independent interest. We also show that no algorithm can guarantee expected regret better than O(T/ log3/2 T).

Original languageEnglish
Pages (from-to)190-223
Number of pages34
JournalSIAM Journal on Computing
Volume46
Issue number1
DOIs
StatePublished - 2017
Externally publishedYes

Funding

FundersFunder number
Microsoft Research
Israel Science Foundation621/12
Planning and Budgeting Committee of the Council for Higher Education of Israel4/11

    Keywords

    • Bandit feedback
    • Multiarmed bandit
    • Online learning
    • Sequential decision making
    • Stateful policies

    Fingerprint

    Dive into the research topics of 'Chasing ghosts: Competing with stateful policies'. Together they form a unique fingerprint.

    Cite this