Abstract
The advantage of Vision Transformers over CNNs is only fully manifested when trained over a large dataset, mainly due to the reduced inductive bias towards spatial locality within the transformer’s self-attention mechanism. In this work, we present a data-efficient vision transformer that does not rely on self-attention. Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer. We propose several alternative approaches for obtaining this generalization and delve into their unique distinctions and considerations from both empirical and theoretical perspectives. The proposed Hyena ND layer boosts the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT across multiple datasets. Furthermore, in the small dataset regime, our Hyena-based ViT is favorable to ViT variants from the recent literature that are specifically designed for solving the same challenge. Finally, we show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures. Our code is available at this git https URL.
Original language | English |
---|---|
Pages (from-to) | 973-981 |
Number of pages | 9 |
Journal | Proceedings of Machine Learning Research |
Volume | 238 |
State | Published - 2024 |
Event | 27th International Conference on Artificial Intelligence and Statistics, AISTATS 2024 - Valencia, Spain Duration: 2 May 2024 → 4 May 2024 |
Funding
Funders | Funder number |
---|---|
Tel Aviv University | |
Ministry of Innovation, Science & Technology,Israel | 1001576154 |
Michael J. Fox Foundation for Parkinson's Research | MJFF-022407 |