TY - GEN
T1 - Detector-Free Weakly Supervised Grounding by Separation
AU - Arbelle, Assaf
AU - Doveh, Sivan
AU - Alfassy, Amit
AU - Shtok, Joseph
AU - Lev, Guy
AU - Schwartz, Eli
AU - Kuehne, Hilde
AU - Levi, Hila Barak
AU - Sattigeri, Prasanna
AU - Panda, Rameswar
AU - Chen, Chun Fu
AU - Bronstein, Alex
AU - Saenko, Kate
AU - Ullman, Shimon
AU - Giryes, Raja
AU - Feris, Rogerio
AU - Karlinsky, Leonid
N1 - Publisher Copyright:
© 2021 IEEE
PY - 2021
Y1 - 2021
N2 - Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume an existence of a pre-trained object detector, relying on it to produce the ROIs for localization. In this work, we focus on the task of Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector. The key idea behind our proposed Grounding by Separation (GbS) method is synthesizing 'text to image-regions' associations by random alpha-blending of arbitrary image pairs and using the corresponding texts of the pair as conditions to recover the alpha map from the blended image via a segmentation network. At test time, this allows using the query phrase as a condition for a non-blended query image, thus interpreting the test image as a composition of a region corresponding to the phrase and the complement region. Our GbS shows an 8.5% accuracy improvement over previous DF-WSG SotA, for a range of benchmarks including Flickr30K, Visual Genome, and ReferIt, as well as a complementary improvement (above 7%) over the detector-based approaches for WSG.
AB - Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume an existence of a pre-trained object detector, relying on it to produce the ROIs for localization. In this work, we focus on the task of Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector. The key idea behind our proposed Grounding by Separation (GbS) method is synthesizing 'text to image-regions' associations by random alpha-blending of arbitrary image pairs and using the corresponding texts of the pair as conditions to recover the alpha map from the blended image via a segmentation network. At test time, this allows using the query phrase as a condition for a non-blended query image, thus interpreting the test image as a composition of a region corresponding to the phrase and the complement region. Our GbS shows an 8.5% accuracy improvement over previous DF-WSG SotA, for a range of benchmarks including Flickr30K, Visual Genome, and ReferIt, as well as a complementary improvement (above 7%) over the detector-based approaches for WSG.
UR - http://www.scopus.com/inward/record.url?scp=85127787083&partnerID=8YFLogxK
U2 - 10.1109/ICCV48922.2021.00182
DO - 10.1109/ICCV48922.2021.00182
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85127787083
T3 - Proceedings of the IEEE International Conference on Computer Vision
SP - 1781
EP - 1792
BT - Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021
Y2 - 11 October 2021 through 17 October 2021
ER -