wendison / vqmivc

One-shot (any-to-any) Voice Conversion

  • Public
  • 6.1K runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model runs on CPU hardware. Predictions typically complete within 21 seconds. The predict time for this model varies significantly based on the inputs.

Readme

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion (Interspeech 2021)

This paper proposes a speech representation disentanglement framework for one-shot/any-to-any voice conversion, which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference. Vector quantization with contrastive predictive coding (VQCPC) is used for content encoding and mutual information (MI) is introduced as the correlation metric during training, to achieve proper disentanglement of content, speaker and pitch representations, by reducing their inter-dependencies in an unsupervised manner.

Citation

If the code is used in your research, please Star our repo and cite our paper:

@article{wang2021vqmivc,
  title={VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion},
  author={Wang, Disong and Deng, Liqun and Yeung, Yu Ting and Chen, Xiao and Liu, Xunying and Meng, Helen},
  journal={arXiv preprint arXiv:2106.10132},
  year={2021}
}

Acknowledgements:

  • The content encoder is borrowed from VectorQuantizedCPC, which also inspires the negative sampling within-utterance for CPC;
  • The speaker encoder is borrowed from AdaIN-VC;
  • The decoder is modified from AutoVC;
  • Estimation of mutual information is modified from CLUB;
  • Speech features extraction is based on espnet and Pyworld.