Viewing Transformers Through the Lens of Long Convolutions Layers

Itamar Zimerman*, Lior Wolf*

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

Abstract

Despite their dominance in modern DL and, especially, NLP domains, transformer architectures exhibit sub-optimal performance on long-range tasks compared to recent layers that are specifically designed for this purpose.In this work, drawing inspiration from key attributes of long-range layers, such as state-space layers, linear RNN layers, and global convolution layers, we demonstrate that minimal modifications to the transformer architecture can significantly enhance performance on the Long Range Arena (LRA) benchmark, thus narrowing the gap with these specialized layers.We identify that two key principles for long-range tasks are (i) incorporating an inductive bias towards smoothness, and (ii) locality.As we show, integrating these ideas into the attention mechanism improves results with a negligible amount of additional computation and without any additional trainable parameters.Our theory and experiments also shed light on the reasons for the inferior performance of transformers on long-range tasks and identify critical properties that are essential for successfully capturing long-range dependencies.

Original languageEnglish
Pages (from-to)62815-62831
Number of pages17
JournalProceedings of Machine Learning Research
Volume235
StatePublished - 2024
Event41st International Conference on Machine Learning, ICML 2024 - Vienna, Austria
Duration: 21 Jul 202427 Jul 2024

Funding

FundersFunder number
Tel Aviv University
Ministry of Innovation, Science & Technology,Israel1001576154
Michael J. Fox Foundation for Parkinson's ResearchMJFF-022407

    Fingerprint

    Dive into the research topics of 'Viewing Transformers Through the Lens of Long Convolutions Layers'. Together they form a unique fingerprint.

    Cite this