TY - GEN

T1 - Adversarial Dueling Bandits

AU - Saha, Aadirupa

AU - Koren, Tomer

AU - Mansour, Yishay

N1 - Publisher Copyright:
Copyright © 2021 by the author(s)

PY - 2021

Y1 - 2021

N2 - We introduce the problem of regret minimization in Adversarial Dueling Bandits. As in classic Dueling Bandits, the learner has to repeatedly choose a pair of items and observe only a relative binary 'win-loss' feedback for this pair, but here this feedback is generated from an arbitrary preference matrix, possibly chosen adversarially. Our main result is an algorithm whose T-round regret compared to the Borda-winner from a set of K items is Õ(K1/3T2/3), as well as a matching Ω(K1/3T2/3) lower bound. We also prove a similar high probability regret bound. We further consider a simpler fixed-gap adversarial setup, which bridges between two extreme preference feedback models for dueling bandits: stationary preferences and an arbitrary sequence of preferences. For the fixed-gap adversarial setup we give an Õ((K/∆2) log T) regret algorithm, where ∆ is the gap in Borda scores between the best item and all other items, and show a lower bound of Ω(K/∆2) indicating that our dependence on the main problem parameters K and ∆ is tight (up to logarithmic factors). Finally, we corroborate the theoretical results with empirical evaluations.

AB - We introduce the problem of regret minimization in Adversarial Dueling Bandits. As in classic Dueling Bandits, the learner has to repeatedly choose a pair of items and observe only a relative binary 'win-loss' feedback for this pair, but here this feedback is generated from an arbitrary preference matrix, possibly chosen adversarially. Our main result is an algorithm whose T-round regret compared to the Borda-winner from a set of K items is Õ(K1/3T2/3), as well as a matching Ω(K1/3T2/3) lower bound. We also prove a similar high probability regret bound. We further consider a simpler fixed-gap adversarial setup, which bridges between two extreme preference feedback models for dueling bandits: stationary preferences and an arbitrary sequence of preferences. For the fixed-gap adversarial setup we give an Õ((K/∆2) log T) regret algorithm, where ∆ is the gap in Borda scores between the best item and all other items, and show a lower bound of Ω(K/∆2) indicating that our dependence on the main problem parameters K and ∆ is tight (up to logarithmic factors). Finally, we corroborate the theoretical results with empirical evaluations.

UR - http://www.scopus.com/inward/record.url?scp=85161268333&partnerID=8YFLogxK

M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???

AN - SCOPUS:85161268333

T3 - Proceedings of Machine Learning Research

SP - 9235

EP - 9244

BT - Proceedings of the 38th International Conference on Machine Learning, ICML 2021

PB - ML Research Press

T2 - 38th International Conference on Machine Learning, ICML 2021

Y2 - 18 July 2021 through 24 July 2021

ER -