TY - GEN
T1 - Speech resynthesis from discrete disentangled self-supervised representations
AU - Polyak, Adam
AU - Adi, Yossi
AU - Copet, Jade
AU - Kharitonov, Eugene
AU - Lakhotia, Kushal
AU - Hsu, Wei Ning
AU - Mohamed, Abdelrahman
AU - Dupoux, Emmanuel
N1 - Publisher Copyright:
© 2021 ISCA
PY - 2021
Y1 - 2021
N2 - We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods. Audio samples can be found under the following link: resynthesis-ssl.github.io.
AB - We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods. Audio samples can be found under the following link: resynthesis-ssl.github.io.
KW - Self-supervised learning
KW - Speech codec
KW - Speech generation
KW - Speech resynthesis
UR - http://www.scopus.com/inward/record.url?scp=85115366697&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-475
DO - 10.21437/Interspeech.2021-475
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85115366697
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 3531
EP - 3535
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
Y2 - 30 August 2021 through 3 September 2021
ER -