TY - JOUR
T1 - Distributed distillation for on-device learning
AU - Bistritz, Ilai
AU - Mann, Ariana J.
AU - Bambos, Nicholas
N1 - Publisher Copyright:
© 2020 Neural information processing systems foundation. All rights reserved.
PY - 2020
Y1 - 2020
N2 - On-device learning promises collaborative training of machine learning models across edge devices without the sharing of user data. In state-of-the-art on-device learning algorithms, devices communicate their model weights over a decentralized communication network. Transmitting model weights requires huge communication overhead and means only devices with identical model architectures can be included. To overcome these limitations, we introduce a distributed distillation algorithm where devices communicate and learn from soft-decision (softmax) outputs, which are inherently architecture-agnostic and scale only with the number of classes. The communicated soft-decisions are each model’s outputs on a public, unlabeled reference dataset, which serves as a common vocabulary between devices. We prove that the gradients of the distillation regularized loss functions of all devices converge to zero with probability 1. Hence, all devices distill the entire knowledge of all other devices on the reference data, regardless of their local connections. Our analysis assumes smooth loss functions, which can be non-convex. Simulations support our theoretical findings and show that even a naive implementation of our algorithm significantly reduces the communication overhead while achieving an overall comparable accuracy to the state-of-the-art. By requiring little communication overhead and allowing for cross-architecture training, we remove two main obstacles to scaling on-device learning.
AB - On-device learning promises collaborative training of machine learning models across edge devices without the sharing of user data. In state-of-the-art on-device learning algorithms, devices communicate their model weights over a decentralized communication network. Transmitting model weights requires huge communication overhead and means only devices with identical model architectures can be included. To overcome these limitations, we introduce a distributed distillation algorithm where devices communicate and learn from soft-decision (softmax) outputs, which are inherently architecture-agnostic and scale only with the number of classes. The communicated soft-decisions are each model’s outputs on a public, unlabeled reference dataset, which serves as a common vocabulary between devices. We prove that the gradients of the distillation regularized loss functions of all devices converge to zero with probability 1. Hence, all devices distill the entire knowledge of all other devices on the reference data, regardless of their local connections. Our analysis assumes smooth loss functions, which can be non-convex. Simulations support our theoretical findings and show that even a naive implementation of our algorithm significantly reduces the communication overhead while achieving an overall comparable accuracy to the state-of-the-art. By requiring little communication overhead and allowing for cross-architecture training, we remove two main obstacles to scaling on-device learning.
UR - http://www.scopus.com/inward/record.url?scp=85108443023&partnerID=8YFLogxK
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.conferencearticle???
AN - SCOPUS:85108443023
SN - 1049-5258
VL - 2020-December
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 34th Conference on Neural Information Processing Systems, NeurIPS 2020
Y2 - 6 December 2020 through 12 December 2020
ER -