Fine-Tuning CLIP via Explainability Map Propagation for Boosting Image and Video Retrieval

Yoav Shalev*, Lior Wolf

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recent studies have highlighted the remarkable performance of CLIP for diverse downstream tasks. To understand how CLIP performs these tasks, various explainability methods have been formulated. In this paper, we reveal that the explainability maps associated with CLIP are often focused on a limited portion of the image and overlook objects that are explicitly mentioned in the text. This phenomenon may result in a high similarity score for incongruent image-text pairs, thereby potentially introducing a bias. To address this issue, we introduce a novel fine-tuning technique for CLIP that leverages a transformer explainability method. Unlike traditional approaches that generate a single heatmap using an image-text pair, our method produces multiple heatmaps directly from the image itself. We use these heatmaps both during the fine-tuning process and at inference time to highlight key visual elements, applying them to the features during the image encoding process, steering the visual encoder’s attention toward these key elements. This process guides the image encoder across different spatial regions and generates a set of visual embeddings, thereby allowing the model to consider various aspects of the image, ensuring a detailed and comprehensive understanding that surpasses the limited scope of the original CLIP model. Our method leads to a notable improvement in text, image, and video retrieval across multiple benchmarks. It also results in reduced gender bias, making our model more equitable.

Original languageEnglish
Title of host publicationAdvances in Information Retrieval - 46th European Conference on Information Retrieval, ECIR 2024, Proceedings
EditorsNazli Goharian, Nicola Tonellotto, Yulan He, Aldo Lipani, Graham McDonald, Craig Macdonald, Iadh Ounis
PublisherSpringer Science and Business Media Deutschland GmbH
Pages356-370
Number of pages15
ISBN (Print)9783031560262
DOIs
StatePublished - 2024
Event46th European Conference on Information Retrieval, ECIR 2024 - Glasgow, United Kingdom
Duration: 24 Mar 202428 Mar 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14608 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference46th European Conference on Information Retrieval, ECIR 2024
Country/TerritoryUnited Kingdom
CityGlasgow
Period24/03/2428/03/24

Funding

FundersFunder number
Tel Aviv University

    Keywords

    • CLIP
    • Explainability
    • Image and Video Retrieval

    Fingerprint

    Dive into the research topics of 'Fine-Tuning CLIP via Explainability Map Propagation for Boosting Image and Video Retrieval'. Together they form a unique fingerprint.

    Cite this