TY - JOUR

T1 - A Last Switch Dependent Analysis of Satiation and Seasonality in Bandits

AU - Laforgue, Pierre

AU - Clerici, Giulia

AU - Cesa-Bianchi, Nicolò

AU - Gilad-Bachrach, Ran

N1 - Publisher Copyright:
Copyright © 2022 by the author(s)

PY - 2022

Y1 - 2022

N2 - Motivated by the fact that humans like some level of unpredictability or novelty, and might therefore get quickly bored when interacting with a stationary policy, we introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions. Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function. This enables the modeling of phenomena such as progressive satiation and periodic behaviours. Building upon the Combinatorial Semi-Bandits (CSB) framework, we design an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy (which is NP-hard to compute). Similarly to previous works, our regret analysis is based on defining and solving an appropriate trade-off between approximation and estimation. Preliminary experiments confirm the superiority of our algorithm over both the oracle greedy approach and a vanilla CSB solver.

AB - Motivated by the fact that humans like some level of unpredictability or novelty, and might therefore get quickly bored when interacting with a stationary policy, we introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions. Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function. This enables the modeling of phenomena such as progressive satiation and periodic behaviours. Building upon the Combinatorial Semi-Bandits (CSB) framework, we design an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy (which is NP-hard to compute). Similarly to previous works, our regret analysis is based on defining and solving an appropriate trade-off between approximation and estimation. Preliminary experiments confirm the superiority of our algorithm over both the oracle greedy approach and a vanilla CSB solver.

UR - http://www.scopus.com/inward/record.url?scp=85131659005&partnerID=8YFLogxK

M3 - ???researchoutput.researchoutputtypes.contributiontojournal.conferencearticle???

AN - SCOPUS:85131659005

SN - 2640-3498

VL - 151

SP - 971

EP - 990

JO - Proceedings of Machine Learning Research

JF - Proceedings of Machine Learning Research

T2 - 25th International Conference on Artificial Intelligence and Statistics, AISTATS 2022

Y2 - 28 March 2022 through 30 March 2022

ER -