We developed a three-stage deep learning framework that performs voice cloning in real-time. This framework is the result of a 2018 paper from Google for which there existed no public implementation before ours. From an utterance of speech of only 5 seconds, the framework is able to capture in a digital format a meaningful representation of the voice spoken. Given a text prompt, it is able to perform text-to-speech using any voice extracted by this process. We reproduced each of the three stages of the framework with our own implementations or open-source ones. We implemented efficient deep learning models and adequate data preprocessing pipelines. We trained these models for weeks or months on large datasets of tens of thousands of hours of speech from several thousands of speakers. We analyzed their capabilities and their drawbacks. We focused on making this framework operate in real-time, that is, to make it possible to capture a voice and generate speech in less time than the duration of the generated speech. The framework is able to clone voices it has never heard during training, and to generate speech from text it has never seen. We made our code and pretrained models public, in addition to developing a graphical interface to the framework, so that it is accessible even for users unfamiliar with deep learning.
\vspace{2cm}
\noindent\textbf{Author}: Corentin Jemine {\small\textit{(C.S. bachelor 2014-2017, master in Data Science 2017-2019)})}
Generalized End-To-End Loss For Speaker Verification
------ Todo ------
- Analyze perf of the encoder on the test set (join vc1_test and vc2_test, they're small)
- Behaviour of the encoder outputs on a sequence (when does it converge, what if another speaker comes in etc) -> try to achieve speaker diarization from this
------ Ideas ------
- Add a term in the loss that forces centroid of synthesized utterances to be close to that of ground truth utterances
- How much is the encoding affected by the quality of the microphone? Record the same sentences on 2 or 3 different microphones and project the embeddings. Look at how far they are from the rest of the dataset.
- How stable is the encoding w.r.t. the length of the utterance? Compare side by side embeddings of several utterances cut at 1.6s, 2.4s, 3.2s... Show the distribution of the distance to the centroids w.r.t. to the length of the utterances, the EER
- Analyze the components of the embeddings
...
...
@@ -16,7 +11,7 @@ Generalized End-To-End Loss For Speaker Verification
------ Things to not forget to write about ------
- The contents of problems.txt, improvements.txt, encoder.txt, synthesizer.txt, questions.txt.
- (slides) Interpretation of EER
- Thank mr. Louppe, the authors of GE2E and Joeri
- Thank mr. Louppe, the authors of GE2E, Joeri, Virgine, github users
- LibriTTS, Improving data efficiency in end-to-end,