TY - JOUR
T1 - Delay and cooperation in nonstochastic bandits
AU - Cesa-Bianchi, Nicolo
AU - Gentile, Claudio
AU - Mansour, Yishay
N1 - Publisher Copyright:
© 2019 Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour.
PY - 2019/2/1
Y1 - 2019/2/1
N2 - We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order qd + 1 + KN a=d (T ln K), where a=d is the independence number of the d-th power of the communication graph G. We then show that for any connected graph, for d = pK the regret bound is K1=4pT, strictly better than the minimax regret pKT for noncooperating agents. More informed choices of d lead to bounds which are arbitrarily close to the full information minimax regret pT ln K when G is dense. When G has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in G, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.
AB - We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order qd + 1 + KN a=d (T ln K), where a=d is the independence number of the d-th power of the communication graph G. We then show that for any connected graph, for d = pK the regret bound is K1=4pT, strictly better than the minimax regret pKT for noncooperating agents. More informed choices of d lead to bounds which are arbitrarily close to the full information minimax regret pT ln K when G is dense. When G has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in G, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.
KW - Cooperative multi-agent systems
KW - Distributed learning
KW - LOCAL communication
KW - Multi-armed bandits
KW - Regret minimization
UR - http://www.scopus.com/inward/record.url?scp=85072649063&partnerID=8YFLogxK
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85072649063
SN - 1532-4435
VL - 20
JO - Journal of Machine Learning Research
JF - Journal of Machine Learning Research
ER -