提交 0ff4376e 编写于 作者: C Corentin Jemine

Thesis ready for submission

上级 8c45ff4b
\documentclass[a4paper, oneside, 12pt, english]{article}
\usepackage{geometry}
\geometry{
a4paper,
total={154mm,232mm},
left=28mm,
top=32mm,
}
\setlength{\parskip}{0.85em}
\setlength{\parindent}{2.6em}
\usepackage{array}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsfonts}
\usepackage{enumerate}
\usepackage{graphicx}
\usepackage{url}
\usepackage{color}
\usepackage{wrapfig}
\usepackage{systeme}
\usepackage{tabularx}
\usepackage{subfig}
\usepackage{hyperref}
\usepackage[authoryear, round]{natbib}
\usepackage[dvipsnames]{xcolor}
\usepackage{booktabs}
\usepackage{multirow}
\begin{document}
\section*{Real-Time Voice Cloning}
We developed a three-stage deep learning framework that performs voice cloning in real-time. This framework is the result of a 2018 paper from Google for which there existed no public implementation before ours. From an utterance of speech of only 5 seconds, the framework is able to capture in a digital format a meaningful representation of the voice spoken. Given a text prompt, it is able to perform text-to-speech using any voice extracted by this process. We reproduced each of the three stages of the framework with our own implementations or open-source ones. We implemented efficient deep learning models and adequate data preprocessing pipelines. We trained these models for weeks or months on large datasets of tens of thousands of hours of speech from several thousands of speakers. We analyzed their capabilities and their drawbacks. We focused on making this framework operate in real-time, that is, to make it possible to capture a voice and generate speech in less time than the duration of the generated speech. The framework is able to clone voices it has never heard during training, and to generate speech from text it has never seen. We made our code and pretrained models public, in addition to developing a graphical interface to the framework, so that it is accessible even for users unfamiliar with deep learning.
\vspace{2cm}
\noindent \textbf{Author}: Corentin Jemine {\small \textit{(C.S. bachelor 2014-2017, master in Data Science 2017-2019)})}
\noindent \textbf{Supervisor}: Prof. Gilles Louppe
\noindent \textbf{Academic year}: 2018-2019
\end{document}
此差异已折叠。
Generalized End-To-End Loss For Speaker Verification
------ Todo ------
- Analyze perf of the encoder on the test set (join vc1_test and vc2_test, they're small)
- Behaviour of the encoder outputs on a sequence (when does it converge, what if another speaker comes in etc) -> try to achieve speaker diarization from this
------ Ideas ------
- Add a term in the loss that forces centroid of synthesized utterances to be close to that of ground truth utterances
- How much is the encoding affected by the quality of the microphone? Record the same sentences on 2 or 3 different microphones and project the embeddings. Look at how far they are from the rest of the dataset.
- How stable is the encoding w.r.t. the length of the utterance? Compare side by side embeddings of several utterances cut at 1.6s, 2.4s, 3.2s... Show the distribution of the distance to the centroids w.r.t. to the length of the utterances, the EER
- Analyze the components of the embeddings
......@@ -16,7 +11,7 @@ Generalized End-To-End Loss For Speaker Verification
------ Things to not forget to write about ------
- The contents of problems.txt, improvements.txt, encoder.txt, synthesizer.txt, questions.txt.
- (slides) Interpretation of EER
- Thank mr. Louppe, the authors of GE2E and Joeri
- Thank mr. Louppe, the authors of GE2E, Joeri, Virgine, github users
- LibriTTS, Improving data efficiency in end-to-end,
......
......@@ -20,3 +20,5 @@ Notes on alignment:
- Some users report their number of training steps (https://github.com/Rayhane-mamah/Tacotron-2/issues/175)
Using the L1 loss (section 2.2) didn't change anything.
-> Actually made things way worse because I forgot to scale it, like a complete idiot.
\ No newline at end of file
How about a nonlinear encoding? https://i.imgur.com/xRWn6AE.png
-> talk about linear encoding and mulaw
gen_s_mel_raw:
- 110k: 4 3 3 2 3 4 4 3 4 5
- 196k: Pretty good overall. A few artifacts (~1 per utterance)
gen_s_mel_raw_no_pad:
- 110k: 2 3 3 3 3 5 4 4 4 3
- 200k: Pretty good as well. I'm not 100% sure, but I feel that gen_s_mel_raw is a bit better.
gt_s_mel:
- 209k: 5 3 3 3 2 1 1 1 1 1
Adding preemphasis now
-> Models trained with preemphasis sound better with deemphasis, as expected
-> Hard to tell if they sound better than non-preemphasis models however
NOTES:
- tacotron_model.ckpt-486000 was the model used to generate GTA.
- best lost on mu_law: 2.935 (????)
TODO:
- Pruning
- Check if there is any significant speedup with 100% sparsity
- Begin merging the three projects:
- Fix saving wavs in the synthesizer preprocessor
- Single inference demo (without GUI)
- Single big inference tool (with GUI)
- Single config file (or args???? -> could be both: default to config values)
- Three hparams files (inside packages?)
- Proper versions in requirements.txt (do something different for inference and training?)
- Clean up the rest of the code (' to "), Check all TODOs
- Proper versions in requirements.txt (do something different for inference and training?)
- Check all TODOs
- Move root of repo to SV2TTS (the rest isn't necessary)
- Put on github (RECHECK ALL REQUIREMENTS (+VLIBS))
- Make demo website
- Write something
- Cite all datasets
- Argparse with default values in config
- Namespace for hparams
Some noisy speakers: 40,
Full: 3.7
-I 3.9
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册