TY - JOUR

T1 - Online EXP3 learning in adversarial bandits with delayed feedback

AU - Bistritz, Ilai

AU - Zhou, Zhengyuan

AU - Chen, Xi

AU - Bambos, Nicholas

AU - Blanchet, Jose

N1 - Publisher Copyright:
© 2019 Neural information processing systems foundation. All rights reserved.

PY - 2019

Y1 - 2019

N2 - Consider a player that in each of T rounds chooses one of K arms. An adversary chooses the cost of each arm in a bounded interval, and a sequence of feedback delays {dt} that are unknown to the player. After picking arm at at round t, the player receives the cost of playing this arm dt rounds later. In cases where t + dt > T, this feedback is simply missing. We prove that the EXP3 algorithm (that uses the delayed feedback upon its arrival) achieves a regret of O (equation presented). For the case where PTt=1 dt and T are unknown, we propose a novel doubling trick for online learning with delays and prove that this adaptive EXP3 achieves a regret of O (equation presented). We then consider a two player zero-sum game where players experience asynchronous delays. We show that even when the delays are large enough such that players no longer enjoy the “no-regret property”, (e.g., where dt = O(tlog t)) the ergodic average of the strategy profile still converges to the set of Nash equilibria of the game. The result is made possible by choosing an adaptive step size ?t that is not summable but is square summable, and proving a “weighted regret bound” for this general case.

AB - Consider a player that in each of T rounds chooses one of K arms. An adversary chooses the cost of each arm in a bounded interval, and a sequence of feedback delays {dt} that are unknown to the player. After picking arm at at round t, the player receives the cost of playing this arm dt rounds later. In cases where t + dt > T, this feedback is simply missing. We prove that the EXP3 algorithm (that uses the delayed feedback upon its arrival) achieves a regret of O (equation presented). For the case where PTt=1 dt and T are unknown, we propose a novel doubling trick for online learning with delays and prove that this adaptive EXP3 achieves a regret of O (equation presented). We then consider a two player zero-sum game where players experience asynchronous delays. We show that even when the delays are large enough such that players no longer enjoy the “no-regret property”, (e.g., where dt = O(tlog t)) the ergodic average of the strategy profile still converges to the set of Nash equilibria of the game. The result is made possible by choosing an adaptive step size ?t that is not summable but is square summable, and proving a “weighted regret bound” for this general case.

UR - http://www.scopus.com/inward/record.url?scp=85090174942&partnerID=8YFLogxK

M3 - ???researchoutput.researchoutputtypes.contributiontojournal.conferencearticle???

AN - SCOPUS:85090174942

SN - 1049-5258

VL - 32

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

T2 - 33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019

Y2 - 8 December 2019 through 14 December 2019

ER -