TY - GEN
T1 - Towers of Babel
T2 - 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021
AU - Wu, Xiaoshi
AU - Averbuch-Elor, Hadar
AU - Sun, Jin
AU - Snavely, Noah
N1 - Publisher Copyright:
© 2021 IEEE
PY - 2021
Y1 - 2021
N2 - The abundance and richness of Internet photos of landmarks and cities has led to significant progress in 3D vision over the past two decades, including automated 3D reconstructions of the world's landmarks from tourist photos. However, a major source of information available for these 3D-augmented collections-namely language, e.g., from image captions-has been virtually untapped. In this work, we present WikiScenes, a new, large-scale dataset of landmark photo collections that contains descriptive text in the form of captions and hierarchical category names. WikiScenes forms a new testbed for multimodal reasoning involving images, text, and 3D geometry. We demonstrate the utility of WikiScenes for learning semantic concepts over images and 3D models. Our weakly-supervised framework connects images, 3D structure, and semantics-utilizing the strong constraints provided by 3D geometry-to associate semantic concepts to image pixels and 3D points.
AB - The abundance and richness of Internet photos of landmarks and cities has led to significant progress in 3D vision over the past two decades, including automated 3D reconstructions of the world's landmarks from tourist photos. However, a major source of information available for these 3D-augmented collections-namely language, e.g., from image captions-has been virtually untapped. In this work, we present WikiScenes, a new, large-scale dataset of landmark photo collections that contains descriptive text in the form of captions and hierarchical category names. WikiScenes forms a new testbed for multimodal reasoning involving images, text, and 3D geometry. We demonstrate the utility of WikiScenes for learning semantic concepts over images and 3D models. Our weakly-supervised framework connects images, 3D structure, and semantics-utilizing the strong constraints provided by 3D geometry-to associate semantic concepts to image pixels and 3D points.
UR - http://www.scopus.com/inward/record.url?scp=85121817859&partnerID=8YFLogxK
U2 - 10.1109/ICCV48922.2021.00048
DO - 10.1109/ICCV48922.2021.00048
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85121817859
T3 - Proceedings of the IEEE International Conference on Computer Vision
SP - 418
EP - 427
BT - Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 11 October 2021 through 17 October 2021
ER -