Taco-VC: A single speaker tacotron based voice conversion with limited data

Roee Levy-Leshem, Raja Giryes

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations


This paper introduces Taco-VC, a novel architecture for voice conversion based on Tacotron synthesizer, which is a sequence-to-sequence with attention model. The training of multi-speaker voice conversion systems requires a large number of resources, both in training and corpus size. Taco-VC is implemented using a single speaker Tacotron synthesizer based on Phonetic PosteriorGrams (PPGs) and a single speaker WaveNet vocoder conditioned on mel spectrograms. To enhance the converted speech quality, and to overcome over-smoothing, the outputs of Tacotron are passed through a novel speech-enhancement network, which is composed of a combination of the phoneme recognition and Tacotron networks. Our system is trained just with a single speaker corpus and adapts to new speakers using only a few minutes of training data. Using mid-size public datasets, our method outperforms the baseline in the VCC 2018 SPOKE non-parallel voice conversion task and achieves competitive results compared to multi-speaker networks trained on large private datasets.

Original languageEnglish
Title of host publication28th European Signal Processing Conference, EUSIPCO 2020 - Proceedings
PublisherEuropean Signal Processing Conference, EUSIPCO
Number of pages5
ISBN (Electronic)9789082797053
StatePublished - 24 Jan 2021
Event28th European Signal Processing Conference, EUSIPCO 2020 - Amsterdam, Netherlands
Duration: 24 Aug 202028 Aug 2020

Publication series

NameEuropean Signal Processing Conference
ISSN (Print)2219-5491


Conference28th European Signal Processing Conference, EUSIPCO 2020


  • Adaptation
  • Speech Recognition
  • Speech Synthesis
  • Voice Conversion


Dive into the research topics of 'Taco-VC: A single speaker tacotron based voice conversion with limited data'. Together they form a unique fingerprint.

Cite this