TY - JOUR
T1 - JOKR
T2 - Joint Keypoint Representation for Unsupervised Video Retargeting
AU - Mokady, R.
AU - Tzaban, R.
AU - Benaim, S.
AU - Bermano, A. H.
AU - Cohen-Or, D.
N1 - Publisher Copyright:
© 2022 The Authors. Computer Graphics Forum published by Eurographics - The European Association for Computer Graphics and John Wiley & Sons Ltd.
PY - 2022/9
Y1 - 2022/9
N2 - In unsupervised video retargeting, content is transferred from one video to another while preserving the original appearance and style, without any additional annotations. While this challenge has seen substantial advancements through the use of deep neural networks, current methods struggle when the source and target videos are of shapes that are different in limb lengths or other body proportions. In this work, we consider this task for the case of objects of different shapes and appearances, that consist of similar skeleton connectivity and depict similar motion. We introduce JOKR—a JOint Keypoint Representation that captures the geometry common to both videos, while being disentangled from their unique styles. Our model first extracts unsupervised keypoints from the given videos. From this representation, two decoders reconstruct geometry and appearance, one for each of the input sequences. By employing an affine-invariant domain confusion term over the keypoints bottleneck, we enforce the unsupervised keypoint representations of both videos to be indistinguishable. This encourages the aforementioned disentanglement between motion and appearance, mapping similar poses from both domains to the same representation. This allows yielding a sequence with the appearance and style of one video, but the content of the other. Our applicability is demonstrated through challenging video pairs compared to state-of-the-art methods. Furthermore, we demonstrate that this geometry-driven representation enables intuitive control, such as temporal coherence and manual pose editing. Videos can be viewed in the supplement HTML.
AB - In unsupervised video retargeting, content is transferred from one video to another while preserving the original appearance and style, without any additional annotations. While this challenge has seen substantial advancements through the use of deep neural networks, current methods struggle when the source and target videos are of shapes that are different in limb lengths or other body proportions. In this work, we consider this task for the case of objects of different shapes and appearances, that consist of similar skeleton connectivity and depict similar motion. We introduce JOKR—a JOint Keypoint Representation that captures the geometry common to both videos, while being disentangled from their unique styles. Our model first extracts unsupervised keypoints from the given videos. From this representation, two decoders reconstruct geometry and appearance, one for each of the input sequences. By employing an affine-invariant domain confusion term over the keypoints bottleneck, we enforce the unsupervised keypoint representations of both videos to be indistinguishable. This encourages the aforementioned disentanglement between motion and appearance, mapping similar poses from both domains to the same representation. This allows yielding a sequence with the appearance and style of one video, but the content of the other. Our applicability is demonstrated through challenging video pairs compared to state-of-the-art methods. Furthermore, we demonstrate that this geometry-driven representation enables intuitive control, such as temporal coherence and manual pose editing. Videos can be viewed in the supplement HTML.
KW - video generation
KW - video retargeting
UR - http://www.scopus.com/inward/record.url?scp=85134186927&partnerID=8YFLogxK
U2 - 10.1111/cgf.14616
DO - 10.1111/cgf.14616
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85134186927
SN - 0167-7055
VL - 41
SP - 245
EP - 257
JO - Computer Graphics Forum
JF - Computer Graphics Forum
IS - 6
ER -