Improving Transformer Models by Reordering their Sublayers

Ofir Press, Noah A. Smith, Omer Levy

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern lead to better performance? We generate randomly ordered transformers and train them with the language modeling objective. We observe that some of these models are able to achieve better performance than the interleaved baseline, and that those successful variants tend to have more self-attention at the bottom and more feedforward sublayers at the top. We propose a new transformer pattern that adheres to this property, the sandwich transformer, and show that it improves perplexity on multiple word-level and character-level language modeling benchmarks, at no cost in parameters, memory, or training time. However, the sandwich reordering pattern does not guarantee performance gains across every task, as we demonstrate on machine translation models. Instead, we suggest that further exploration of task-specific sublayer reorderings is needed in order to unlock additional gains.
Original languageEnglish
Title of host publicationProceedings of the 58th Annual Meeting of the Association for Computational Linguistics
EditorsDan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
PublisherAssociation for Computational Linguistics
Pages2996-3005
Number of pages10
ISBN (Electronic)978-1-952148-25-5
DOIs
StatePublished - 1 Jul 2020
Externally publishedYes
Event58th annual meeting of the Association for Computational Linguistics, ACL 2020 - Virtual
Duration: 5 Jul 202010 Jul 2020
Conference number: 58

Conference

Conference58th annual meeting of the Association for Computational Linguistics, ACL 2020
Abbreviated titleACL 2020
Period5/07/2010/07/20

Fingerprint

Dive into the research topics of 'Improving Transformer Models by Reordering their Sublayers'. Together they form a unique fingerprint.

Cite this