Cross-Modal Retrieval Augmentation for Multi-Modal Classification

Shir Gur, Natalia Neverova, Chris Stauffer, Ser Nam Lim, Douwe Kiela, Austin Reiter

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    Abstract

    Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvements in performance on image-caption retrieval w.r.t. similar methods. Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hot-swapping indices.

    Original languageEnglish
    Title of host publicationFindings of the Association for Computational Linguistics, Findings of ACL
    Subtitle of host publicationEMNLP 2021
    EditorsMarie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-Tau Yih
    PublisherAssociation for Computational Linguistics (ACL)
    Pages111-123
    Number of pages13
    ISBN (Electronic)9781955917100
    StatePublished - 2021
    Event2021 Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021 - Punta Cana, Dominican Republic
    Duration: 7 Nov 202111 Nov 2021

    Publication series

    NameFindings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021

    Conference

    Conference2021 Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021
    Country/TerritoryDominican Republic
    CityPunta Cana
    Period7/11/2111/11/21

    Fingerprint

    Dive into the research topics of 'Cross-Modal Retrieval Augmentation for Multi-Modal Classification'. Together they form a unique fingerprint.

    Cite this