Multi-Dimensional Hyena for Spatial Inductive Bias

Itamar Zimerman, Lior Wolf

Research output: Contribution to journalConference articlepeer-review

2 Scopus citations

Abstract

The advantage of Vision Transformers over CNNs is only fully manifested when trained over a large dataset, mainly due to the reduced inductive bias towards spatial locality within the transformer’s self-attention mechanism. In this work, we present a data-efficient vision transformer that does not rely on self-attention. Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer. We propose several alternative approaches for obtaining this generalization and delve into their unique distinctions and considerations from both empirical and theoretical perspectives. The proposed Hyena ND layer boosts the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT across multiple datasets. Furthermore, in the small dataset regime, our Hyena-based ViT is favorable to ViT variants from the recent literature that are specifically designed for solving the same challenge. Finally, we show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures. Our code is available at this git https URL.

Original languageEnglish
Pages (from-to)973-981
Number of pages9
JournalProceedings of Machine Learning Research
Volume238
StatePublished - 2024
Event27th International Conference on Artificial Intelligence and Statistics, AISTATS 2024 - Valencia, Spain
Duration: 2 May 20244 May 2024

Funding

FundersFunder number
Tel Aviv University
Ministry of Innovation, Science & Technology,Israel1001576154
Michael J. Fox Foundation for Parkinson's ResearchMJFF-022407

    Fingerprint

    Dive into the research topics of 'Multi-Dimensional Hyena for Spatial Inductive Bias'. Together they form a unique fingerprint.

    Cite this