DATACOMP: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi ChertiRanjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, Ludwig Schmidt

Research output: Contribution to journalConference articlepeer-review

2 Scopus citations

Abstract

Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DATACOMP, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DATACOMP workflow leads to better training sets. Our best baseline, DATACOMP-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release DATACOMP and all accompanying code at www.datacomp.ai.

Original languageEnglish
JournalAdvances in Neural Information Processing Systems
Volume36
StatePublished - 2023
Event37th Conference on Neural Information Processing Systems, NeurIPS 2023 - New Orleans, United States
Duration: 10 Dec 202316 Dec 2023

Funding

FundersFunder number
Blavatnik Family Foundation
National Science Foundation
Google
UT Austin Machine Learning Lab
National Institute of Corrections
Gujarat Cancer Society
Allen Institute
Microsoft
John von Neumann Institute for Computing
Helmholtz Data Federation
Jülich Supercomputing Centre, Forschungszentrum Jülich
Alexander S. Onassis Public Benefit FoundationIFML CCF 2019844, DMS 2134012, F ZS 056-1/2022-2023, AF 1901292, F ZS 012-1/2022-2023, CCF 1934932, CNS 2148141
Alexander S. Onassis Public Benefit Foundation

    Fingerprint

    Dive into the research topics of 'DATACOMP: In search of the next generation of multimodal datasets'. Together they form a unique fingerprint.

    Cite this