TY - JOUR
T1 - Revisiting the Noise Model of Stochastic Gradient Descent
AU - Battash, Barak
AU - Wolf, Lior
AU - Lindenbaum, Ofir
N1 - Publisher Copyright:
Copyright 2024 by the author(s).
PY - 2024
Y1 - 2024
N2 - The effectiveness of stochastic gradient descent (SGD) in neural network optimization is significantly influenced by stochastic gradient noise (SGN). Following the central limit theorem, SGN was initially described as Gaussian, but recently Simsekli et al. (2019) demonstrated that the SαS Lévy distribution provides a better fit for the SGN. This assertion was purportedly debunked and rebounded to the Gaussian noise model that had been previously proposed. This study provides robust, comprehensive empirical evidence that SGN is heavy-tailed and is better represented by the SαS distribution. Our experiments include several datasets and multiple models, both discriminative and generative. Furthermore, we argue that different network parameters preserve distinct SGN properties. We develop a novel framework based on a Lévy-driven stochastic differential equation (SDE), where one-dimensional Lévy processes describe each parameter. This leads to a more accurate characterization of the dynamics of SGD around local minima. We use our framework to study SGD properties near local minima; these include the mean escape time and preferable exit directions.
AB - The effectiveness of stochastic gradient descent (SGD) in neural network optimization is significantly influenced by stochastic gradient noise (SGN). Following the central limit theorem, SGN was initially described as Gaussian, but recently Simsekli et al. (2019) demonstrated that the SαS Lévy distribution provides a better fit for the SGN. This assertion was purportedly debunked and rebounded to the Gaussian noise model that had been previously proposed. This study provides robust, comprehensive empirical evidence that SGN is heavy-tailed and is better represented by the SαS distribution. Our experiments include several datasets and multiple models, both discriminative and generative. Furthermore, we argue that different network parameters preserve distinct SGN properties. We develop a novel framework based on a Lévy-driven stochastic differential equation (SDE), where one-dimensional Lévy processes describe each parameter. This leads to a more accurate characterization of the dynamics of SGD around local minima. We use our framework to study SGD properties near local minima; these include the mean escape time and preferable exit directions.
UR - http://www.scopus.com/inward/record.url?scp=85194186807&partnerID=8YFLogxK
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.conferencearticle???
AN - SCOPUS:85194186807
SN - 2640-3498
VL - 238
SP - 4780
EP - 4788
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
T2 - 27th International Conference on Artificial Intelligence and Statistics, AISTATS 2024
Y2 - 2 May 2024 through 4 May 2024
ER -