TY - JOUR
T1 - Efficient Long-Text Understanding with Short-Text Models
AU - Ivgi, Maor
AU - Shaham, Uri
AU - Berant, Jonathan
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Transformer-based pretrained language models (LMs) are ubiquitous across natural language understanding, but cannot be applied to long sequences such as stories, scien-tific articles, and long documents due to their quadratic complexity. While a myriad of efficient transformer variants have been proposed, they are typically based on cus-tom implementations that require expensive pretraining from scratch. In this work, we pro-pose SLED: SLiding-Encoder and Decoder, a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs. Specifically, we partition the input into overlapping chunks, encode each with a short-text LM encoder and use the pretrained decoder to fuse information across chunks (fusion-in-decoder). We illustrate through controlled experiments that SLED offers a viable strategy for long text understanding and evaluate our approach on SCROLLS, a benchmark with seven datasets across a wide range of language understanding tasks. We find that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pretraining step.
AB - Transformer-based pretrained language models (LMs) are ubiquitous across natural language understanding, but cannot be applied to long sequences such as stories, scien-tific articles, and long documents due to their quadratic complexity. While a myriad of efficient transformer variants have been proposed, they are typically based on cus-tom implementations that require expensive pretraining from scratch. In this work, we pro-pose SLED: SLiding-Encoder and Decoder, a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs. Specifically, we partition the input into overlapping chunks, encode each with a short-text LM encoder and use the pretrained decoder to fuse information across chunks (fusion-in-decoder). We illustrate through controlled experiments that SLED offers a viable strategy for long text understanding and evaluate our approach on SCROLLS, a benchmark with seven datasets across a wide range of language understanding tasks. We find that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pretraining step.
UR - http://www.scopus.com/inward/record.url?scp=85151404318&partnerID=8YFLogxK
U2 - 10.1162/tacl_a_00547
DO - 10.1162/tacl_a_00547
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85151404318
SN - 2307-387X
VL - 11
SP - 284
EP - 299
JO - Transactions of the Association for Computational Linguistics
JF - Transactions of the Association for Computational Linguistics
ER -