TY - GEN
T1 - Object-level Scene Deocclusion
AU - Liu, Zhengzhe
AU - Liu, Qing
AU - Chang, Chirui
AU - Zhang, Jianming
AU - Pakhomov, Daniil
AU - Zheng, Haitian
AU - Lin, Zhe
AU - Cohen-Or, Daniel
AU - Fu, Chi Wing
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/7/13
Y1 - 2024/7/13
N2 - Deoccluding the hidden portions of objects in a scene is a formidable task, particularly when addressing real-world scenes. In this paper, we present a new self-supervised PArallel visible-to-COmplete diffusion framework, named PACO, a foundation model for object-level scene deocclusion. Leveraging the rich prior of pre-trained models, we first design the parallel variational autoencoder, which produces a full-view feature map that simultaneously encodes multiple complete objects, and the visible-to-complete latent generator, which learns to implicitly predict the full-view feature map from partial-view feature map and text prompts extracted from the incomplete objects in the input image. To train PACO, we create a large-scale dataset with 500k samples to enable self-supervised learning, avoiding tedious annotations of the amodal masks and occluded regions. At inference, we devise a layer-wise deocclusion strategy to improve efficiency while maintaining the deocclusion quality. Extensive experiments on COCOA and various real-world scenes demonstrate the superior capability of PACO for scene deocclusion, surpassing the state of the arts by a large margin. Our method can also be extended to cross-domain scenes and novel categories that are not covered by the training set. Further, we demonstrate the deocclusion applicability of PACO in single-view 3D scene reconstruction and object recomposition.
AB - Deoccluding the hidden portions of objects in a scene is a formidable task, particularly when addressing real-world scenes. In this paper, we present a new self-supervised PArallel visible-to-COmplete diffusion framework, named PACO, a foundation model for object-level scene deocclusion. Leveraging the rich prior of pre-trained models, we first design the parallel variational autoencoder, which produces a full-view feature map that simultaneously encodes multiple complete objects, and the visible-to-complete latent generator, which learns to implicitly predict the full-view feature map from partial-view feature map and text prompts extracted from the incomplete objects in the input image. To train PACO, we create a large-scale dataset with 500k samples to enable self-supervised learning, avoiding tedious annotations of the amodal masks and occluded regions. At inference, we devise a layer-wise deocclusion strategy to improve efficiency while maintaining the deocclusion quality. Extensive experiments on COCOA and various real-world scenes demonstrate the superior capability of PACO for scene deocclusion, surpassing the state of the arts by a large margin. Our method can also be extended to cross-domain scenes and novel categories that are not covered by the training set. Further, we demonstrate the deocclusion applicability of PACO in single-view 3D scene reconstruction and object recomposition.
KW - c.
KW - completion-w.
KW - image recomposition
KW - object
KW - scene deocclusion
UR - http://www.scopus.com/inward/record.url?scp=85199857624&partnerID=8YFLogxK
U2 - 10.1145/3641519.3657409
DO - 10.1145/3641519.3657409
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85199857624
T3 - Proceedings - SIGGRAPH 2024 Conference Papers
BT - Proceedings - SIGGRAPH 2024 Conference Papers
A2 - Spencer, Stephen N.
PB - Association for Computing Machinery, Inc
T2 - SIGGRAPH 2024 Conference Papers
Y2 - 28 July 2024 through 1 August 2024
ER -