## Abstract

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce EXP3-COOP, a cooperative version of the EXP3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order q(d + 1 + ^{K}_{N} α_{≤d)}(T ln K), where α_{≤d} is the independence number of the d-th power of the communication graph G. We then show that for any connected graph, for d = √K the regret bound is K^{1}/^{4} √T, strictly better than the minimax regret √KT for noncooperating agents. More informed choices of d lead to bounds which are arbitrarily close to the full information minimax regret √T ln K when G is dense. When G has sparse components, we show that a variant of EXP3-COOP, allowing agents to choose their parameters according to their centrality in G, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.

Original language | English |
---|---|

Pages (from-to) | 605-622 |

Number of pages | 18 |

Journal | Journal of Machine Learning Research |

Volume | 49 |

Issue number | June |

State | Published - 6 Jun 2016 |

Event | 29th Conference on Learning Theory, COLT 2016 - New York, United States Duration: 23 Jun 2016 → 26 Jun 2016 |