Altogether: Image Captioning via Re-aligning Alt-text

  • Hu Xu
  • , Po Yao Huang
  • , Xiaoqing Ellen Tan
  • , Ching Feng Yeh
  • , Jacob Kahn
  • , Christine Jou
  • , Gargi Ghosh
  • , Omer Levy
  • , Luke Zettlemoyer
  • , Wen Tau Yih
  • , Shang Wen Li
  • , Saining Xie
  • , Christoph Feichtenhofer

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

This paper focuses on creating synthetic data to improve the quality of image captions.Existing works typically have two shortcomings.First, they caption images from scratch, ignoring existing alt-text metadata, and second, lack transparency if the captioners' training data (e.g.GPT) is unknown.In this paper, we study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images.To generate training data, we perform human annotation where annotators start with the existing alt-text and realign it to the image content in multiple rounds, consequently constructing captions with rich visual concepts.This differs from prior work that carries out human annotation as a one-time description task solely based on images and annotator knowledge.We train a captioner on this data that generalizes the process of realigning alt-texts at scale.Our results show our Altogether approach leads to richer image captions that also improve text-to-image generation and zero-shot image classification tasks.

Original languageEnglish
Title of host publicationEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
EditorsYaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
PublisherAssociation for Computational Linguistics (ACL)
Pages19302-19318
Number of pages17
ISBN (Electronic)9798891761643
DOIs
StatePublished - 2024
Externally publishedYes
Event2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 - Hybrid, Miami, United States
Duration: 12 Nov 202416 Nov 2024

Publication series

NameEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

Conference

Conference2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
Country/TerritoryUnited States
CityHybrid, Miami
Period12/11/2416/11/24

Fingerprint

Dive into the research topics of 'Altogether: Image Captioning via Re-aligning Alt-text'. Together they form a unique fingerprint.

Cite this