Consider N cooperative agents such that for T turns, each agent n takes an action an and receives a stochastic reward rn (a1,..., aN ). Agents cannot observe the actions of other agents and do not know even their own reward function. The agents can communicate with their neighbors on a connected graph G with diameter d (G). We want each agent n to achieve an expected average reward of at least λn over time, for a given quality of service (QoS) vector λ. A QoS vector λ is not necessarily achievable. By giving up on immediate reward, knowing that the other agents will compensate later, agents can improve their achievable capacity region. Our main observation is that the gap between λnt and the accumulated reward of agent n, which we call the QoS regret, behaves like a queue. Inspired by this observation, we propose a distributed algorithm that aims to learn a max-weight matching of agents to actions. In each epoch, the algorithm employs a consensus phase where the agents agree on a certain weighted sum of rewards by communicating only O (d (G)) numbers every turn. Then, the algorithm uses distributed successive elimination on a random subset of action profiles to approximately maximize this weighted sum of rewards. We prove a bound on the accumulated sum of expected QoS regrets of all agents, that holds if λ is a safety margin εT away from the boundary of the capacity region, where εT → 0 as T → ∞. This bound implies that, for large T, our algorithm can achieve any λ in the interior of the dynamic capacity region, while all agents are guaranteed an empirical average expected QoS regret of Õ (1) over t = 1,..., T which never exceeds (Equation presented) for any t. We then extend our result to time-varying i.i.d. communication graphs.