Abstract
Empirical evidence suggests that neural networks with ReLU activations generalize better with overparameterization. However, there is currently no theoretical analysis that explains this observation. In this work, we provide theoretical and empirical evidence that, in certain cases, overparameterized convolutional networks generalize better than small networks because of an interplay between weight clustering and feature exploration at initialization. We demonstrate this theoretically for a 3-layer convolutional neural network with max-pooling, in a novel setting which extends the XOR problem. We show that this interplay implies that with overparameterization, gradient descent converges to global minima with better generalization performance compared to global minima of small networks. Empirically, we demonstrate these phenomena for a 3-layer convolutional neural network in the MNIST task.
| Original language | English |
|---|---|
| Pages (from-to) | 822-830 |
| Number of pages | 9 |
| Journal | Proceedings of Machine Learning Research |
| Volume | 97 |
| State | Published - 2019 |
| Event | 36th International Conference on Machine Learning, ICML 2019 - Long Beach, United States Duration: 9 Jun 2019 → 15 Jun 2019 |
Funding
| Funders |
|---|
| Blavatnik Computer Science Research Fund |
| Yandex Initiative in Machine Learning |
Fingerprint
Dive into the research topics of 'Why do Larger Models Generalize Better? A Theoretical Perspective via the XOR Problem'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver