TY - GEN
T1 - Fine-Tuning CLIP via Explainability Map Propagation for Boosting Image and Video Retrieval
AU - Shalev, Yoav
AU - Wolf, Lior
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.
PY - 2024
Y1 - 2024
N2 - Recent studies have highlighted the remarkable performance of CLIP for diverse downstream tasks. To understand how CLIP performs these tasks, various explainability methods have been formulated. In this paper, we reveal that the explainability maps associated with CLIP are often focused on a limited portion of the image and overlook objects that are explicitly mentioned in the text. This phenomenon may result in a high similarity score for incongruent image-text pairs, thereby potentially introducing a bias. To address this issue, we introduce a novel fine-tuning technique for CLIP that leverages a transformer explainability method. Unlike traditional approaches that generate a single heatmap using an image-text pair, our method produces multiple heatmaps directly from the image itself. We use these heatmaps both during the fine-tuning process and at inference time to highlight key visual elements, applying them to the features during the image encoding process, steering the visual encoder’s attention toward these key elements. This process guides the image encoder across different spatial regions and generates a set of visual embeddings, thereby allowing the model to consider various aspects of the image, ensuring a detailed and comprehensive understanding that surpasses the limited scope of the original CLIP model. Our method leads to a notable improvement in text, image, and video retrieval across multiple benchmarks. It also results in reduced gender bias, making our model more equitable.
AB - Recent studies have highlighted the remarkable performance of CLIP for diverse downstream tasks. To understand how CLIP performs these tasks, various explainability methods have been formulated. In this paper, we reveal that the explainability maps associated with CLIP are often focused on a limited portion of the image and overlook objects that are explicitly mentioned in the text. This phenomenon may result in a high similarity score for incongruent image-text pairs, thereby potentially introducing a bias. To address this issue, we introduce a novel fine-tuning technique for CLIP that leverages a transformer explainability method. Unlike traditional approaches that generate a single heatmap using an image-text pair, our method produces multiple heatmaps directly from the image itself. We use these heatmaps both during the fine-tuning process and at inference time to highlight key visual elements, applying them to the features during the image encoding process, steering the visual encoder’s attention toward these key elements. This process guides the image encoder across different spatial regions and generates a set of visual embeddings, thereby allowing the model to consider various aspects of the image, ensuring a detailed and comprehensive understanding that surpasses the limited scope of the original CLIP model. Our method leads to a notable improvement in text, image, and video retrieval across multiple benchmarks. It also results in reduced gender bias, making our model more equitable.
KW - CLIP
KW - Explainability
KW - Image and Video Retrieval
UR - http://www.scopus.com/inward/record.url?scp=85189755870&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-56027-9_22
DO - 10.1007/978-3-031-56027-9_22
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85189755870
SN - 9783031560262
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 356
EP - 370
BT - Advances in Information Retrieval - 46th European Conference on Information Retrieval, ECIR 2024, Proceedings
A2 - Goharian, Nazli
A2 - Tonellotto, Nicola
A2 - He, Yulan
A2 - Lipani, Aldo
A2 - McDonald, Graham
A2 - Macdonald, Craig
A2 - Ounis, Iadh
PB - Springer Science and Business Media Deutschland GmbH
T2 - 46th European Conference on Information Retrieval, ECIR 2024
Y2 - 24 March 2024 through 28 March 2024
ER -