TY - JOUR

T1 - No Weighted-Regret Learning in Adversarial Bandits with Delays

AU - Bistritz, Ilai

AU - Zhou, Zhengyuan

AU - Cheny, Xi

AU - Bambos, Nicholas

AU - Blanchet, Jose

N1 - Publisher Copyright:
© 2022 Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, Jose Blanchet.

PY - 2022/3/1

Y1 - 2022/3/1

N2 - Consider a scenario where a player chooses an action in each round t out of T rounds and observes the incurred cost after a delay of dt rounds. The cost functions and the delay sequence are chosen by an adversary. We show that in a non-cooperative game, the expected weighted ergodic distribution of play converges to the set of coarse correlated equilibria if players use algorithms that have \no weighted-regret"in the above scenario, even if they have linear regret due to too large delays. For a two-player zero-sum game, we show that no weighted-regret is sufficient for the weighted ergodic average of play to converge to the set of Nash equilibria. We prove that the FKM algorithm with n dimensions achieves an expected regret of O ( nT 3 4 + p nT 1 3D 13 ) and the EXP3 algorithm with K arms achieves an expected regret of O (√ logK (KT + D) ) even when D = ΣT t=1 dt and T are unknown. These bounds use a novel doubling trick that, under mild assumptions, provably retains the regret bound for when D and T are known. Using these bounds, we show that FKM and EXP3 have no weighted-regret even for dt = O (t log t). Therefore, algorithms with no weighted-regret can be used to approximate a CCE of a finite or convex unknown game that can only be simulated with bandit feedback, even if the simulation involves significant delays.

AB - Consider a scenario where a player chooses an action in each round t out of T rounds and observes the incurred cost after a delay of dt rounds. The cost functions and the delay sequence are chosen by an adversary. We show that in a non-cooperative game, the expected weighted ergodic distribution of play converges to the set of coarse correlated equilibria if players use algorithms that have \no weighted-regret"in the above scenario, even if they have linear regret due to too large delays. For a two-player zero-sum game, we show that no weighted-regret is sufficient for the weighted ergodic average of play to converge to the set of Nash equilibria. We prove that the FKM algorithm with n dimensions achieves an expected regret of O ( nT 3 4 + p nT 1 3D 13 ) and the EXP3 algorithm with K arms achieves an expected regret of O (√ logK (KT + D) ) even when D = ΣT t=1 dt and T are unknown. These bounds use a novel doubling trick that, under mild assumptions, provably retains the regret bound for when D and T are known. Using these bounds, we show that FKM and EXP3 have no weighted-regret even for dt = O (t log t). Therefore, algorithms with no weighted-regret can be used to approximate a CCE of a finite or convex unknown game that can only be simulated with bandit feedback, even if the simulation involves significant delays.

KW - Adversarial Bandits

KW - Delays

KW - Non-cooperative Games

KW - Online Learning

UR - http://www.scopus.com/inward/record.url?scp=85131435554&partnerID=8YFLogxK

M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???

AN - SCOPUS:85131435554

SN - 1532-4435

VL - 23

JO - Journal of Machine Learning Research

JF - Journal of Machine Learning Research

ER -