提交 0ff4376e 编写于 作者: C Corentin Jemine

Thesis ready for submission

上级 8c45ff4b
\documentclass[a4paper, oneside, 12pt, english]{article}
\usepackage{geometry}
\geometry{
a4paper,
total={154mm,232mm},
left=28mm,
top=32mm,
}
\setlength{\parskip}{0.85em}
\setlength{\parindent}{2.6em}
\usepackage{array}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsfonts}
\usepackage{enumerate}
\usepackage{graphicx}
\usepackage{url}
\usepackage{color}
\usepackage{wrapfig}
\usepackage{systeme}
\usepackage{tabularx}
\usepackage{subfig}
\usepackage{hyperref}
\usepackage[authoryear, round]{natbib}
\usepackage[dvipsnames]{xcolor}
\usepackage{booktabs}
\usepackage{multirow}
\begin{document}
\section*{Real-Time Voice Cloning}
We developed a three-stage deep learning framework that performs voice cloning in real-time. This framework is the result of a 2018 paper from Google for which there existed no public implementation before ours. From an utterance of speech of only 5 seconds, the framework is able to capture in a digital format a meaningful representation of the voice spoken. Given a text prompt, it is able to perform text-to-speech using any voice extracted by this process. We reproduced each of the three stages of the framework with our own implementations or open-source ones. We implemented efficient deep learning models and adequate data preprocessing pipelines. We trained these models for weeks or months on large datasets of tens of thousands of hours of speech from several thousands of speakers. We analyzed their capabilities and their drawbacks. We focused on making this framework operate in real-time, that is, to make it possible to capture a voice and generate speech in less time than the duration of the generated speech. The framework is able to clone voices it has never heard during training, and to generate speech from text it has never seen. We made our code and pretrained models public, in addition to developing a graphical interface to the framework, so that it is accessible even for users unfamiliar with deep learning.
\vspace{2cm}
\noindent \textbf{Author}: Corentin Jemine {\small \textit{(C.S. bachelor 2014-2017, master in Data Science 2017-2019)})}
\noindent \textbf{Supervisor}: Prof. Gilles Louppe
\noindent \textbf{Academic year}: 2018-2019
\end{document}
......@@ -27,8 +27,6 @@
\usepackage{multirow}
\newcommand{\rw}{\color{red}\textbf{?} \color{black}} % When something needs to be reworded
\begin{document}
\begin{titlepage}
\newcommand{\HRule}{\rule{\linewidth}{0.5mm}}
......@@ -73,25 +71,25 @@ Recent advances in deep learning have shown impressive results in the domain of
\section{Introduction}
% What is now possible with deep learning in TTS
Deep learning models have become predominant in many fields of applied machine learning. text-to-speech (TTS), the process of synthesizing artificial speech from a text prompt, is no exception. Deep models that would produce more natural-sounding speech than the traditional concatenative approaches begun appearing in 2016. Much of the research focus has been since gathered around making these deep models more efficient, more natural, or training them in an end-to-end fashion. Inference has come from being hundreds of times slower than real-time on GPU \citep{WaveNet} to possible in real-time on a mobile CPU \citep{WaveRNN}. As for the quality of the generated speech, \citet{Tacotron2} demonstrate near-human naturalness. Interestingly, speech naturalness is best rated with subjective metrics; and comparison with actual human speech leads to the conclusion that there might be such a thing as "speech more natural than human speech". In fact, some argue that the human naturalness threshold has already been crossed \citep{MOSNaturalness}.
Deep learning models have become predominant in many fields of applied machine learning. Text-to-speech (TTS), the process of synthesizing artificial speech from a text prompt, is no exception. Deep models that would produce more natural-sounding speech than the traditional concatenative approaches begun appearing in 2016. Much of the research focus has been since gathered around making these deep models more efficient, sound more natural, or training them in an end-to-end fashion. Inference has come from being hundreds of times slower than real-time on GPU \citep{WaveNet} to possible in real-time on a mobile CPU \citep{WaveRNN}. As for the quality of the generated speech, \citet{Tacotron2} demonstrate near-human naturalness. Interestingly, speech naturalness is best rated with subjective metrics; and comparison with actual human speech leads to the conclusion that there might be such a thing as "speech more natural than human speech". In fact, some argue that the human naturalness threshold has already been crossed \citep{MOSNaturalness}.
% The state of things in TTS
Datasets of professionally recorded speech are a scarce resource. Synthesizing a natural voice with a correct pronunciation, lively intonation and a minimum of background noise requires training data with the same qualities. Furthermore, data efficiency remains a core issue of deep learning. Training a common text-to-speech model such as Tacotron \citep{Tacotron1} typically requires tens of hours of speech. Yet the ability of generating speech with any voice is attractive for a range of applications be they useful or merely a matter of customization. Research has led to frameworks for voice conversion and voice cloning. They differ in that voice conversion is a form of style transfer on a speech segment from a voice to another, whereas voice cloning consists in capturing the voice of a speaker to perform text-to-speech on arbitrary inputs.
Datasets of professionally recorded speech are a scarce resource. Synthesizing a natural voice with a correct pronunciation, lively intonation and a minimum of background noise requires training data with the same qualities. Furthermore, data efficiency remains a core issue of deep learning. Training a common text-to-speech model such as Tacotron \citep{Tacotron1} typically requires hundreds of hours of speech. Yet the ability to generate speech with any voice is attractive for a range of applications, be they useful or merely a matter of customization. Research has led to frameworks for voice conversion and voice cloning. They differ in that voice conversion is a form of style transfer on a speech segment from a voice to another, whereas voice cloning consists in capturing the voice of a speaker to perform text-to-speech on arbitrary inputs.
% What voice cloning is about
While the complete training of a single-speaker TTS model is technically a form of voice cloning, the interest rather lies in creating a fixed model able to incorporate newer voices with little data. The common approach is to condition a TTS model trained to generalize to new speakers on an embedding of the voice to clone~\citep{DeepVoice2, CloningFewSamples, SV2TTS}. The embedding is low-dimensional and derived by a speaker encoder model that takes reference speech as input. This approach is typically more data efficient than training a separate TTS model for each speaker, in addition to being orders of magnitude faster and less computationally expensive. Interestingly, there is a large discrepancy between the duration of reference speech needed to clone a voice among the different methods, ranging from half an hour per speaker to only a few seconds. This factor is usually determining of the similarity of the generated voice with respect to the true voice of the speaker.
% What we want to achieve
Our objective is to achieve a powerful form of voice cloning. The resulting framework must be able to operate in a zero-shot setting, that is, for speakers unseen during training. It should incorporate a speaker's voice with only a few seconds of reference speech. These desired results are shown to be fulfilled by \citep{SV2TTS}. Their results are impressive\footnote{\url{https://google.github.io/tacotron/publications/speaker_adaptation/index.html}}, but not backed by any public implementation. We reproduce their framework and make our implementation open-source\footnote{\color{red} repo link}. In addition, we integrate a model based on \citep{WaveRNN} in the framework to make it run in real-time, i.e. to generate speech in a time shorter or equal to the duration of the produced speech. \color{red} add a word about our results \color{black}
Our objective is to achieve a powerful form of voice cloning. The resulting framework must be able to operate in a zero-shot setting, that is, for speakers unseen during training. It should incorporate a speaker's voice with only a few seconds of reference speech. These desired results are shown to be fulfilled by \citep{SV2TTS}. Their results are impressive\footnote{\url{https://google.github.io/tacotron/publications/speaker_adaptation/index.html}}, but not backed by any public implementation. We reproduce their framework and make our implementation open-source\footnote{\url{https://github.com/CorentinJ/Real-Time-Voice-Cloning}}. In addition, we integrate a model based on \citep{WaveRNN} in the framework to make it run in real-time, i.e. to generate speech in a time shorter or equal to the duration of the produced speech.
% Structure of the thesis
The structure of this document goes as follows. We begin with a short introduction on TTS methods that involve machine learning. Follows a review of the evolution of the state of the art for TTS with speech naturalness as the core metric. We then present the work of \citep{SV2TTS} along with our own implementation. We conclude with a presentation of a toolbox we designed to interface the framework.
The structure of this document goes as follows. We begin with a short introduction on methods of TTS with machine learning. Follows a review of the evolution of the state of the art for TTS with speech naturalness as the core metric. We then present the work of \citep{SV2TTS} along with our own implementation. We conclude with a presentation of a toolbox we designed to interface the framework.
\section{A review of text-to-speech methods in machine learning}
\subsection{Statistical parametric speech synthesis} \label{SPSS}
% The SPSS pipeline
Statistical parametric speech synthesis (SPSS) refers to a group of data-driven TTS methods that emerged in the late 90s. In SPSS, the relationship between the features computed on the input text and the output acoustic features is learned by a statistical generative model (called the acoustic model). A complete SPSS framework thus also includes a pipeline to extract features from the text to synthesize, as well as a system able to reconstruct an audio waveform from the acoustic features produced by the acoustic model (such a system is called a vocoder). Unlike the acoustic model, these two parts of the framework may be entirely engineered and make use of no statistical methods. While modern deep TTS models are usually not referred to as SPSS, the SPSS pipeline as depicted in figure \ref{spss_framework} applies just as well to those newer methods.
Statistical parametric speech synthesis (SPSS) refers to a group of data-driven TTS methods that emerged in the late 90s. In SPSS, the relationship between the features computed on the input text and the output acoustic features is learned by a statistical generative model (called the acoustic model). A complete SPSS framework thus also includes a pipeline to extract features from the text to synthesize, as well as a system able to reconstruct an audio waveform from the acoustic features produced by the acoustic model (such a system is called a vocoder). Unlike the acoustic model, these two parts of the framework may be entirely engineered and make use of no statistical methods. While modern deep TTS models are usually not referred to as SPSS, the SPSS pipeline as depicted in Figure \ref{spss_framework} applies just as well to those newer methods.
\begin{figure}[h]
\centering
......@@ -115,7 +113,7 @@ As is often the case with tasks that involve generating perceptual data such as
\subsection{Evolution of the state of the art in text-to-speech}
The state of the art in SPSS has for long remained a hidden Markov model (HMM) based framework \citep{Tokuda-2013}. This approach, laid out in figure \ref{hmm_spss_framework}, consists in clustering the linguistic features extracted from the input text with a decision tree, and to train a HMM per cluster \citep{HMMTTS}. The HMMs are tasked to produce a distribution over spectrogram coefficients, their derivative, second derivative and a binary flag that indicates which parts of the generated audio should contain voice. With the maximum likelihood parameter generation algorithm (MLPG) \citep{Tokuda-2000}, spectrogram coefficients are sampled from this distribution and eventually fed to the MLSA vocoder \citep{MLSA}. It is possible to modify the voice generated by conditioning the HMMs on a speaker or tuning the generated speech parameters with adaptation or interpolation techniques \citep{HMMSpeakerInterpolation}. Note that, while this framework used to be state of the art for SPSS, it was still inferior in terms of the naturalness of the generated speech compared to the well-established concatenative approaches.
The state of the art in SPSS has for long remained a hidden Markov model (HMM) based framework \citep{Tokuda-2013}. This approach, laid out in Figure \ref{hmm_spss_framework}, consists in clustering the linguistic features extracted from the input text with a decision tree, and to train a HMM per cluster \citep{HMMTTS}. The HMMs are tasked to produce a distribution over spectrogram coefficients, their derivative, second derivative and a binary flag that indicates which parts of the generated audio should contain voice. With the maximum likelihood parameter generation algorithm (MLPG) \citep{Tokuda-2000}, spectrogram coefficients are sampled from this distribution and eventually fed to the MLSA vocoder \citep{MLSA}. It is possible to modify the voice generated by conditioning the HMMs on a speaker or tuning the generated speech parameters with adaptation or interpolation techniques \citep{HMMSpeakerInterpolation}. Note that, while this framework used to be state of the art for SPSS, it was still inferior in terms of the naturalness of the generated speech compared to the well-established concatenative approaches.
\begin{figure}[h]
\centering
......@@ -145,7 +143,7 @@ The state of the art in SPSS has for long remained a hidden Markov model (HMM) b
Improvements to this framework were later brought by feed-forward deep neural networks (DNN), as a result of progress in both hardware and software. \cite{SPSSDNN} proposes to replace entirely the decision tree-clustered HMMs in favor of a DNN. They argue for better data efficiency as the training set is no longer fragmented in different clusters of contexts. They demonstrate improvements over the speech quality with a number of parameters similar to that of the HMM-based approach. Later researches corroborate these findings \citep{OnTheTrainingAspects, Hashimoto-2015}. The MOS of different model combinations tried by \citep{Hashimoto-2015} are reported in Table \ref{hashimoto_results}
\citep{BDLSTMTTS} support that RNNs make natural acoustic models as they are able to learn a compact representation of complex and long-span functions. As RNNs are fit to generate temporally consistent series, the static features can directly be determined by the acoustic model, alleviating the need for dynamic features and MLPG. They compare networks of bidirectional LSTMs against the HMM and DNN based approaches described previously. Their A/B testing results are conclusive, we report them in figure \ref{dblstm_subjective}.
\citep{BDLSTMTTS} support that RNNs make natural acoustic models as they are able to learn a compact representation of complex and long-span functions. As RNNs are fit to generate temporally consistent series, the static features can directly be determined by the acoustic model, alleviating the need for dynamic features and MLPG. They compare networks of bidirectional LSTMs against the HMM and DNN based approaches described previously. Their A/B testing results are conclusive, we report them in Figure \ref{dblstm_subjective}.
\begin{figure}[h]
\centering
......@@ -183,7 +181,7 @@ The authors compare WaveNet to an older parametric approach and to a concatenati
% Near real-time (accuracy/speed tradeoff)
% No diff when using original WaveNet
Follows Tacotron \citep{Tacotron1}, a sequence-to-sequence model that produces a spectrogram from a sequence of characters alone, further reducing the need for domain expertise. In this framework, the vocoder is the Griffin-Lim algorithm. Tacotron uses an encoder-decoder architecture where, at each step, the decoder operates on a weighted sum of the encoder outputs. This attention mechanism, described in \citep{Attention}, lets the network decide which steps of the input sequence are important with respect to each step of the output sequence. Tacotron achieves a MOS of 3.85 on a US English dataset, which is more than the 3.69 score obtained in the parametric approach of \citep{LSTM-RNN} but less than the 4.09 score obtained by the concatenative approach of \citep{ConcatenativeGoogle}. The authors mention that Tacotron is merely a step towards a better framework. Subsequently, Tacotron 2 is published \citep{Tacotron2}. The architecture of Tacotron 2 remains that of an encoder-decoder with attention although several changes to the type of layers are made. The main difference with Tacotron is the use of a modified WaveNet as vocoder. On the same dataset, Tacotron 2 achieves a MOS of 4.53, which compares to the 4.58 for human speech (the difference is not statistically significant), achieving the all-time highest MOS for TTS. With A/B testing, Tacotron 2 was found to be only slightly less preferred on average than ground truth samples. These ratings are shown in figure \ref{tacotron2_results}.
Follows Tacotron \citep{Tacotron1}, a sequence-to-sequence model that produces a spectrogram from a sequence of characters alone, further reducing the need for domain expertise. In this framework, the vocoder is the Griffin-Lim algorithm. Tacotron uses an encoder-decoder architecture where, at each step, the decoder operates on a weighted sum of the encoder outputs. This attention mechanism, described in \citep{Attention}, lets the network decide which steps of the input sequence are important with respect to each step of the output sequence. Tacotron achieves a MOS of 3.85 on a US English dataset, which is more than the 3.69 score obtained in the parametric approach of \citep{LSTM-RNN} but less than the 4.09 score obtained by the concatenative approach of \citep{ConcatenativeGoogle}. The authors mention that Tacotron is merely a step towards a better framework. Subsequently, Tacotron 2 is published \citep{Tacotron2}. The architecture of Tacotron 2 remains that of an encoder-decoder with attention although several changes to the type of layers are made. The main difference with Tacotron is the use of a modified WaveNet as vocoder. On the same dataset, Tacotron 2 achieves a MOS of 4.53, which compares to the 4.58 for human speech (the difference is not statistically significant), achieving the all-time highest MOS for TTS. With A/B testing, Tacotron 2 was found to be only slightly less preferred on average than ground truth samples. These ratings are shown in Figure \ref{tacotron2_results}.
\begin{figure}[h]
\centering
......@@ -192,7 +190,7 @@ Follows Tacotron \citep{Tacotron1}, a sequence-to-sequence model that produces a
\label{tacotron2_results}
\end{figure}
For a global comparison of the techniques described in this section, we report in figure \ref{mos_all} all the MOS of each new state of the art method. Note that we have more than one result for human speech, as two studies evaluate it independently. The MOS for human speech is consistent across both, and the difference between them is not statistically significant. However note that the two studies have authors in common, and thus might share a bias. Slightly different evaluation methods of the MOS may yield different results, which is one of the reasons why MOS is subject to criticism.
For a global comparison of the techniques described in this section, we report in Figure \ref{mos_all} all the MOS of each state of the art method. Note that we have more than one result for human speech, as at least two studies evaluate it independently. The MOS for human speech is consistent across both, and the difference between them is not statistically significant. However note that the two studies have authors in common, and thus might share a bias. Slightly different evaluation methods of the MOS may yield different results, which is one of the reasons why MOS is subject to criticism.
\begin{figure}[h]
\centering
......@@ -209,15 +207,15 @@ For a global comparison of the techniques described in this section, we report i
\section{Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis}
\subsection{Overview}
Our approach to real-time voice cloning is largely based on \citep{SV2TTS} (referred to as SV2TTS throughout this document). It describes an approach to zero-shot voice cloning that only requires 5 seconds of reference speech. This paper is only one of the many publications from the Tacotron series\footnote{\url{https://google.github.io/tacotron/}} authored at Google. Interestingly, the SV2TTS paper does not bring much innovation of its own, rather it is based on three major earlier works from Google: the GE2E loss \citep{GE2E}, Tacotron \citep{Tacotron1} and WaveNet \citep{WaveNet}. The complete framework is a three-stage pipeline, where the steps correspond to the models listed in order previously. Many of the current TTS tools and functionalities provided by Google, such as the Google assistant\footnote{\url{https://assistant.google.com/}} or the Google cloud services\footnote{\url{https://cloud.google.com/text-to-speech/}}, make use of these same models. While there are many open-source reimplementations of these models online, there is none of the SV2TTS framework to our knowledge (as of May 2019).
Our approach to real-time voice cloning is largely based on \citep{SV2TTS} (referred to as SV2TTS throughout this document). It describes a framework for zero-shot voice cloning that only requires 5 seconds of reference speech. This paper is only one of the many publications from the Tacotron series\footnote{\url{https://google.github.io/tacotron/}} authored at Google. Interestingly, the SV2TTS paper does not bring much innovation of its own, rather it is based on three major earlier works from Google: the GE2E loss \citep{GE2E}, Tacotron \citep{Tacotron1} and WaveNet \citep{WaveNet}. The complete framework is a three-stage pipeline, where the steps correspond to the models listed in order previously. Many of the current TTS tools and functionalities provided by Google, such as the Google assistant\footnote{\url{https://assistant.google.com/}} or the Google cloud services\footnote{\url{https://cloud.google.com/text-to-speech/}}, make use of these same models. While there are many open-source reimplementations of these models online, there is none of the SV2TTS framework to our knowledge (as of May 2019).
The three stages of the framework are as follows:
\begin{itemize}
\item A speaker encoder that derives an embedding from the short utterance of a single speaker. The embedding is a meaningful representation of the voice of the speaker, such that similar voices are close in latent space. This model is described in \citep{GE2E} (referred as GE2E throughout this document) and \citep{TE2E}.
\item A synthesizer that, conditioned on the embedding of a speaker, generates a spectrogram from a text. This model is the popular Tacotron 2 \citep{Tacotron2} without WaveNet (which is often referred to as just Tacotron due to its similarity to the first iteration).
\item A synthesizer that, conditioned on the embedding of a speaker, generates a spectrogram from text. This model is the popular Tacotron 2 \citep{Tacotron2} without WaveNet (which is often referred to as just Tacotron due to its similarity to the first iteration).
\item A vocoder that infers an audio waveform from the spectrograms generated by the synthesizer. The authors used WaveNet \citep{WaveNet} as a vocoder, effectively reapplying the entire Tacotron 2 framework.
\end{itemize}
At inference time, the speaker encoder is fed a short reference utterance of the speaker to clone. It generates an embedding that is used to condition the synthesizer, and a text processed as a phoneme sequence is given as input to the synthesizer. The vocoder takes the output of the synthesizer to generate the speech waveform. This is illustrated in figure \ref{sv2tts_framework}.
At inference time, the speaker encoder is fed a short reference utterance of the speaker to clone. It generates an embedding that is used to condition the synthesizer, and a text processed as a phoneme sequence is given as input to the synthesizer. The vocoder takes the output of the synthesizer to generate the speech waveform. This is illustrated in Figure \ref{sv2tts_framework}.
\begin{figure}[h]
\centering
......@@ -241,11 +239,11 @@ In the following subsection, we formally define the task that SV2TTS aims to sol
\newcommand{\enc}{\mathcal{E}}
\newcommand{\syn}{\mathcal{S}}
\newcommand{\voc}{\mathcal{V}}
Consider a dataset of utterances grouped by their speaker. We denote the $j$th utterance of the $i$th speaker as $\vu_{ij}$. Utterances are in the waveform domain. We denote by $\vx_{ij}$ the log-mel spectrogram of the utterance $\vu_{ij}$. A log-mel spectrogram is a deterministic, non-invertible (lossy) but derivable function that extracts speech features from a waveform, so as to handle speech in a more tractable fashion in machine learning.
Consider a dataset of utterances grouped by their speaker. We denote the $j$th utterance of the $i$th speaker as $\vu_{ij}$. Utterances are in the waveform domain. We denote by $\vx_{ij}$ the log-mel spectrogram of the utterance $\vu_{ij}$. A log-mel spectrogram is a deterministic, non-invertible (lossy) function that extracts speech features from a waveform, so as to handle speech in a more tractable fashion in machine learning.
The encoder $\enc$ computes the embedding $\ve_{ij} = \enc(\vx_{ij}; \vw_\enc)$ corresponding to the utterance $\vu_{ij}$, where $\vw_\enc$ are the parameters of the encoder. Additionally, the authors define a speaker embedding as the centroid of the embeddings of the speaker's utterances:
\begin{equation}\label{speaker_embedding}
\vc_i=\frac{1}{n}\sum_{j}^{n}\ve_{ij}
\vc_i=\frac{1}{n}\sum_{j=1}^{n}\ve_{ij}
\end{equation}
The synthesizer $\syn$, parametrized by $\vw_\syn$, is tasked to approximate $\vx_{ij}$ given $\vc_i$ and $\vt_{ij}$, the transcript of utterance $\vu_{ij}$. We have $\hat\vx_{ij} = \syn(\vc_i, \vt_{ij}; \vw_\syn)$. In our implementation, we directly use the utterance embedding rather than the speaker embedding (we motivate this choice in section \ref{synthesizer}), giving instead $\hat\vx_{ij} = \syn(\vu_{ij}, \vt_{ij}; \vw_\syn)$.
......@@ -256,8 +254,8 @@ One could train this framework in an end-to-end fashion with the following objec
$$ min_{\vw_\enc, \vw_\syn, \vw_\voc} L_\voc(\vu_{ij}, \voc(\syn(\enc(\vx_{ij}; \vw_\enc), \vt_{ij}; \vw_\syn); \vw_\voc)) $$
Where $L_\voc$ is a loss function in the waveform domain. This approach has drawbacks:
\begin{itemize}
\item It requires training all three models on a same dataset, meaning that this dataset would ideally need to meet the requirements for all models: a large number of speakers for the encoder but at the same time transcripts for the synthesizer and a low level noise for the synthesizer and somehow an average noise level for the encoder (so as to be able to handle noisy input speech). These conflicts are problematic and would lead to training models that could perform better if trained separately on distinct datasets. Specifically, a small dataset will likely lead to poor generalization and thus poor zero-shot performance.
\item The convergence of the combined model could be very hard to reach. In particular, the Tacotron synthesizer typically takes a significant time before producing correct alignments.
\item It requires training all three models on a same dataset, meaning that this dataset would ideally need to meet the requirements for all models: a large number of speakers for the encoder but at the same time, transcripts for the synthesizer. A low level noise for the synthesizer and somehow an average noise level for the encoder (so as to be able to handle noisy input speech). These conflicts are problematic and would lead to training models that could perform better if trained separately on distinct datasets. Specifically, a small dataset will likely lead to poor generalization and thus poor zero-shot performance.
\item The convergence of the combined model could be very hard to reach. In particular, the Tacotron synthesizer could take a significant time before producing correct alignments.
\end{itemize}
An evident way of addressing the second issue is to separate the training of the synthesizer and of the vocoder. Assuming a pretrained encoder, the synthesizer can be trained to directly predict the mel spectrograms of the target audio:
......@@ -290,13 +288,13 @@ The inputs to the model are 40-channels log-mel spectrograms with a 25ms window
\subsubsection{Generalized End-to-End loss}
The speaker encoder is trained on a speaker verification task. Speaker verification is a typical application of biometrics where the identity of a person is verified through their voice. A template is created for a person by deriving their speaker embedding (see equation \ref{speaker_embedding}) from a few utterances. This process is called enrollment. At runtime, a user identifies himself with a short utterance and the system compares the embedding of that utterance with the enrolled speaker embeddings. Above a given similarity threshold, the user is identified. The GE2E loss simulates this process to optimize the model.
At training time, the model computes the embeddings $\ve_{ij}\ (1 \leq i \leq N, 1 \leq j \leq M)$ of $M$ utterances of fixed duration from $N$ speakers. A speaker embedding $\vc_i$ is derived for each speaker: $\vc_i=\frac{1}{M}\sum_{j}^{M}\ve_{ij}\ (1 \leq i \leq N)$. The similarity matrix $\ms_{ij,k}$ is the result of the two-by-two comparison of all embeddings $\ve_{ij}$ against every speaker embedding $\vc_{k}\ (1 \leq k \leq N)$ in the batch. This measure is the scaled cosine similarity:
At training time, the model computes the embeddings $\ve_{ij}\ (1 \leq i \leq N, 1 \leq j \leq M)$ of $M$ utterances of fixed duration from $N$ speakers. A speaker embedding $\vc_i$ is derived for each speaker: $\vc_i=\frac{1}{M}\sum_{j=1}^{M}\ve_{ij}$. The similarity matrix $\ms_{ij,k}$ is the result of the two-by-two comparison of all embeddings $\ve_{ij}$ against every speaker embedding $\vc_{k}\ (1 \leq k \leq N)$ in the batch. This measure is the scaled cosine similarity:
\begin{equation} \label{similarity_simple}
\ms_{ij,k} = w \cdot cos(\ve_{ij}, \vc_{k}) + b = w \cdot \ve_{ij} \cdot ||\vc_{k}||_2 + b
\end{equation}
where $w$ and $b$ are learnable parameters. This entire process is illustrated in figure \ref{sim_matrix}. From a computing perspective, the cosine similarity of two L2-normed vectors is simply their dot product, hence the rightmost hand side of equation \ref{similarity_simple}. An optimal model is expected to output high similarity values when an utterance matches the speaker $(i = k)$ and lower values elsewhere $(i \neq k)$. To optimize in this direction, the loss is the sum of row-wise softmax losses.
where $w$ and $b$ are learnable parameters. This entire process is illustrated in Figure \ref{sim_matrix}. From a computing perspective, the cosine similarity of two L2-normed vectors is simply their dot product, hence the rightmost hand side of equation \ref{similarity_simple}. An optimal model is expected to output high similarity values when an utterance matches the speaker $(i = k)$ and lower values elsewhere $(i \neq k)$. To optimize in this direction, the loss is the sum of row-wise softmax losses.
\begin{figure}[h]
\centering
......@@ -318,7 +316,7 @@ where the exclusive centroids $\vc_{i}^{(-j)}$ are defined as:
\vc_{i}^{(-j)} = \frac{1}{M-1}\sum_{\substack{m=1\\m\neq j}}^{M}\ve_{im}
\end{equation}
The fixed duration of the utterances in a training batch is of 1.6 seconds. These are partial utterances sampled from the longer complete utterances in the dataset. While the model architecture is able to handle inputs of variable length, it is reasonable to expect that it performs best with utterances of the same duration as those seen in training. Therefore, at inference time an utterance is split in segments of 1.6 seconds overlapping by 50\%, and the encoder forwards each segment individually. The resulting outputs are averaged then normalized to produce the utterance embedding. This is illustrated in figure \ref{encoder_inference}. Curiously, the authors of SV2TTS advocate for 800ms windows at inference time but still 1.6 seconds ones during training. We prefer to keep 1.6 seconds for both, as is done in GE2E.
The fixed duration of the utterances in a training batch is of 1.6 seconds. These are partial utterances sampled from the longer complete utterances in the dataset. While the model architecture is able to handle inputs of variable length, it is reasonable to expect that it performs best with utterances of the same duration as those seen in training. Therefore, at inference time an utterance is split in segments of 1.6 seconds overlapping by 50\%, and the encoder forwards each segment individually. The resulting outputs are averaged then normalized to produce the utterance embedding. This is illustrated in Figure \ref{encoder_inference}. Curiously, the authors of SV2TTS advocate for 800ms windows at inference time but still 1.6 seconds ones during training. We prefer to keep 1.6 seconds for both, as is done in GE2E.
\begin{figure}[h]
\centering
......@@ -330,7 +328,7 @@ The fixed duration of the utterances in a training batch is of 1.6 seconds. Thes
The authors use $N = 64$ and $M = 10$ as parameters for the batch size. When enrolling a speaker in a practical application, one should expect to have several utterances from each user but likely not an order of magnitude above that of 10, so this choice is reasonable. As for the number of speakers, it is good to observe that the time complexity of computing the similarity matrix is $O(N^2M)$. Therefore this parameter should be chosen not too large so as to not slow down substantially the training, as opposed to simply picking the largest batch size that fits on the GPU. It is still of course possible to parallelize multiple batches on the same GPU while synchronizing the operations across batches for efficiency. We found it particularly important to vectorize all operations when computing the similarity matrix, so as to minimize the number of GPU transactions.
\subsubsection{Experiments}
To avoid segments that are mostly silent when sampling partial utterances from complete utterances, we use the webrtcvad\footnote{\url{https://github.com/wiseman/py-webrtcvad}} python package to perform Voice Activity Detection (VAD). This yields a binary flag over the audio corresponding to whether or not the segment is voiced. We perform a moving average on this binary flag to smooth out short spikes in the detection, which we then binarize again. Finally, we perform a dilation on the flag with a kernel size of $s + 1$, where $s$ is the maximum silence duration tolerated. The audio is then trimmed of the unvoiced parts. We found the value $s=0.2$s to be a good choice that retains a natural speech prosody. This process is illustrated in figure \ref{encoder_preprocess_vad}. A last preprocessing step applied to the audio waveforms is normalization, to make up for the varying volume of the speakers in the dataset.
To avoid segments that are mostly silent when sampling partial utterances from complete utterances, we use the webrtcvad\footnote{\url{https://github.com/wiseman/py-webrtcvad}} python package to perform Voice Activity Detection (VAD). This yields a binary flag over the audio corresponding to whether or not the segment is voiced. We perform a moving average on this binary flag to smooth out short spikes in the detection, which we then binarize again. Finally, we perform a dilation on the flag with a kernel size of $s + 1$, where $s$ is the maximum silence duration tolerated. The audio is then trimmed of the unvoiced parts. We found the value $s=0.2$s to be a good choice that retains a natural speech prosody. This process is illustrated in Figure \ref{encoder_preprocess_vad}. A last preprocessing step applied to the audio waveforms is normalization, to make up for the varying volume of the speakers in the dataset.
\begin{figure}[h]
\centering
......@@ -374,7 +372,7 @@ The authors test different combinations of these datasets and observe the effect
These results indicate that the number of speakers is strongly correlated with the good performance of not only the encoder on the verification task, but also of the entire framework on the quality of the speech generated and on its ability to clone a voice. The small jump in naturalness, similarity and EER gained by including VoxCeleb2 could possibly indicate that the variation of languages is hurting the training. The internal dataset of the authors is a proprietary voice search corpus from 18k English speakers. The encoder trained on this dataset performs significantly better, however we only have access to public datasets. We thus proceed with LibriSpeech-Other, VoxCeleb1 and VoxCeleb2.
We train the speaker encoder for one million steps. To monitor the training we report the EER and we observe the ability of the model to cluster speakers. We periodically sample a batch of 10 speakers with 10 utterances each, compute the utterance embeddings and projecting them in a two-dimensional space with UMAP \citep{UMAP}. As embeddings of different speakers are expected to be further apart in the latent space than embeddings from the same speakers, it is expected that clusters of utterances from a same speaker form as the training progresses. We report our UMAP projections in figure \ref{training_umap}, where this behaviour can be observed.
We train the speaker encoder for one million steps. To monitor the training we report the EER and we observe the ability of the model to cluster speakers. We periodically sample a batch of 10 speakers with 10 utterances each, compute the utterance embeddings and projecting them in a two-dimensional space with UMAP \citep{UMAP}. As embeddings of different speakers are expected to be further apart in the latent space than embeddings from the same speakers, it is expected that clusters of utterances from a same speaker form as the training progresses. We report our UMAP projections in Figure \ref{training_umap}, where this behaviour can be observed.
\begin{figure}[h]
\centering
......@@ -387,7 +385,7 @@ As mentioned before, the authors have trained their model for 50 million steps o
The resulting model yields very strong results nonetheless. In fact, we computed the test set EER to be $\mathbf{4.5\%}$. This is an astonishingly low value in light of the $10.14\%$ of the authors for the same set with 50 times more steps. We do not know whether our model is actually performing that well or if the EER computation procedure of the authors is different enough than ours to produce values so far apart.
We find the clustering in latent space produced by our model to be impressively robust and to generalize well. In all our tests, the UMAP projections perfectly separate utterances from the test set of each of the three datasets, with large inter-cluster distances and small intra-cluster variance. An example is given in figure \ref{umap_projections}. The test set used for this evaluation is the combination of the test sets of LibriSpeech, VoxCeleb1 and VoxCeleb2. Speakers annotated with an F are female speakers, those with an M are male speakers. We compare our results with \citep[Figure 3]{SV2TTS}. We find that our projections also separate the gender of speakers linearly in the projected space. In the author's figure the utterances are from LibriSpeech only. We include VoxCeleb1 and VoxCeleb2 as well, with some of the randomly selected speakers who do not speak English. Note that our clusters are denser than those of the authors, but that this is only a result of using a different dimension reduction technique. We find results similar to theirs when using T-SNE \citep{TSNE}.
We find the clustering in latent space produced by our model to be impressively robust and to generalize well. In all our tests, the UMAP projections perfectly separate utterances from the test set of each of the three datasets, with large inter-cluster distances and small intra-cluster variance. An example is given in Figure \ref{umap_projections}. The test set used for this evaluation is the combination of the test sets of LibriSpeech, VoxCeleb1 and VoxCeleb2. Speakers annotated with an F are female speakers, those with an M are male speakers. We compare our results with \citep[Figure 3]{SV2TTS}. We find that our projections also separate the gender of speakers linearly in the projected space. In the author's figure the utterances are from LibriSpeech only. We include VoxCeleb1 and VoxCeleb2 as well, with some of the randomly selected speakers who do not speak English. Note that our clusters are denser than those of the authors, but that this is only a result of using a different dimension reduction technique. We find results similar to theirs when using T-SNE \citep{TSNE}.
\begin{figure}[h]
\centering
......@@ -408,17 +406,17 @@ We find the clustering in latent space produced by our model to be impressively
The Equal Error Rate (EER) is a measure typically used in biometric systems to evaluate the accuracy of the system. It is the value of the false positive rate when it is equal to the true negative rate. Equating those terms is done by varying the similarity threshold above which a user is recognized by the biometric system. The authors of SV2TTS do not make mention of their procedure to evaluate the EER. This is problematic as the EER is tricky to compute, and highly depends on the number of enrollment utterances chosen. We refer to GE2E and use 6 utterances for enrollment and compare those to 7 utterances. For different numbers of utterances, refer to Figure \ref{test_eer}. They authors do not mention either if the utterances they use (both for enrollment and test) are complete or partial utterances. In light of our results, we uses partial utterances; but complete utterances would yield an even lower EER.
Due to the lack of information provided by the authors of SV2TTS and as we found no reliable source indicating how the EER is computed for such a system, we cannot guarantee the correctness nor fairness of the comparison of our results against those of the authors. Independently of them however, our tests certainly demonstrate that our speaker encoder is performing very well on the task of speaker verification. With that taken into account, we are confident that the speaker encoder will generate meaningful embeddings; refer to section \rw for our final results.
Due to the lack of information provided by the authors of SV2TTS and as we found no reliable source indicating how the EER is computed for such a system, we cannot guarantee the correctness nor fairness of the comparison of our results against those of the authors. Independently of them however, our tests certainly demonstrate that our speaker encoder is performing very well on the task of speaker verification. With that taken into account, we are confident that the speaker encoder is generating meaningful embeddings.
On the topic of inference speed, the encoder is by far the fastest of the three models as it operates at approximately 1000$\times$ real-time on our GTX 1080 GPU. In fact, the execution time is bounded by the GPU I/O for all utterances of our dataset (< 1 minute).
On the topic of inference speed, the encoder is by far the fastest of the three models as it operates at approximately 1000$\times$ real-time on our GTX 1080 GPU. In fact, the execution time is bounded by the GPU I/O for all utterances of our dataset.
\subsection{Synthesizer} \label{synthesizer}
The synthesizer is Tacotron 2 without Wavenet \citep{WaveNet}. We use an open-source implementation\footnote{\url{https://github.com/Rayhane-mamah/Tacotron-2}} of Tacotron 2 from which we strip Wavenet and implement the modifications added by SV2TTS.
The synthesizer is Tacotron 2 without Wavenet \citep{WaveNet}. We use an open-source Tensorflow implementation\footnote{\url{https://github.com/Rayhane-mamah/Tacotron-2}} of Tacotron 2 from which we strip Wavenet and implement the modifications added by SV2TTS.
\subsubsection{Model architecture}
We briefly present the top-level architecture of the modified Tacotron 2 without Wavenet (which we'll refer to as simply Tacotron). For further details, we invite the reader to take a look at the Tacotron papers \citep{Tacotron2, Tacotron1}.
Tacotron is a recurrent sequence-to-sequence model that predicts a mel spectrogram from text. It features an encoder-decoder structure (not to be mistaken with the speaker encoder of SV2TTS) that is bridged by a location-sensitive attention mechanism \citep{Attention2}. Individual characters from the text sequence are first embedded as vectors. Convolutional layers follow, so as to increase the span of a single encoder frame. These frames are passed through a bidirectional LSTM to produce the encoder output frames. This is where SV2TTS brings a modification to the architecture: a speaker embedding is concatenated to every frame that is output by the Tacotron encoder. The attention mechanism attends to the encoder output frames to generate the decoder input frames. Each decoder input frame is concatenated with the previous decoder frame output passed through a pre-net, making the model autoregressive. This concatenated vector goes through two unidirectional LSTM layers before being projected to a single mel spectrogram frame. Another projection of the same vector to a scalar allows the network to predict on its own that it should stop generating frames by emitting a value above a certain threshold. The entire sequence of frames is passed through a residual post-net before it becomes the mel spectrogram. This architecture is represented in figure \ref{tacotron2_arch}.
Tacotron is a recurrent sequence-to-sequence model that predicts a mel spectrogram from text. It features an encoder-decoder structure (not to be mistaken with the speaker encoder of SV2TTS) that is bridged by a location-sensitive attention mechanism \citep{Attention2}. Individual characters from the text sequence are first embedded as vectors. Convolutional layers follow, so as to increase the span of a single encoder frame. These frames are passed through a bidirectional LSTM to produce the encoder output frames. This is where SV2TTS brings a modification to the architecture: a speaker embedding is concatenated to every frame that is output by the Tacotron encoder. The attention mechanism attends to the encoder output frames to generate the decoder input frames. Each decoder input frame is concatenated with the previous decoder frame output passed through a pre-net, making the model autoregressive. This concatenated vector goes through two unidirectional LSTM layers before being projected to a single mel spectrogram frame. Another projection of the same vector to a scalar allows the network to predict on its own that it should stop generating frames by emitting a value above a certain threshold. The entire sequence of frames is passed through a residual post-net before it becomes the mel spectrogram. This architecture is represented in Figure \ref{tacotron2_arch}.
\begin{figure}[h]
\centering
......@@ -449,7 +447,7 @@ In SV2TTS, the authors consider two datasets for training both the synthesizer a
\label{libri_vctk_cross}
\end{table}
Following the preprocessing recommendations of the authors, we use an Automatic Speech Recognition (ASR) model to force-align the LibriSpeech transcripts to text. We found the Montreal Forced Aligner\footnote{\url{https://montreal-forced-aligner.readthedocs.io/en/latest/}} to perform well on this task. We've also made a cleaner version of these alignments public\footnote{\url{https://github.com/CorentinJ/librispeech-alignments}} to save some time for other users in need of them. With the audio aligned to the text, we split utterances on silences longer than 0.4 seconds. This helps the synthesizer to converge, both because of the removal of silences in the target spectrogram, but also due to the reduction of the median duration of the utterances in the dataset, as shorter sequences offer less room for timing errors. We ensure that utterances are not shorter than 1.6 seconds, the duration of partial utterances used for training the encoder, and not longer than 11.25 seconds so as to save GPU memory for training. We do not split on a silence that would create an utterance too short or too long if possible. The distribution of the length of the utterances in the dataset is plotted in figure \ref{librispeech_durations}. Note how long silences already account for 64 hours (13.7\%) of the dataset.
Following the preprocessing recommendations of the authors, we use an Automatic Speech Recognition (ASR) model to force-align the LibriSpeech transcripts to text. We found the Montreal Forced Aligner\footnote{\url{https://montreal-forced-aligner.readthedocs.io/en/latest/}} to perform well on this task. We've also made a cleaner version of these alignments public\footnote{\url{https://github.com/CorentinJ/librispeech-alignments}} to save some time for other users in need of them. With the audio aligned to the text, we split utterances on silences longer than 0.4 seconds. This helps the synthesizer to converge, both because of the removal of silences in the target spectrogram, but also due to the reduction of the median duration of the utterances in the dataset, as shorter sequences offer less room for timing errors. We ensure that utterances are not shorter than 1.6 seconds, the duration of partial utterances used for training the encoder, and not longer than 11.25 seconds so as to save GPU memory for training. We do not split on a silence that would create an utterance too short or too long if possible. The distribution of the length of the utterances in the dataset is plotted in Figure \ref{librispeech_durations}. Note how long silences already account for 64 hours (13.7\%) of the dataset.
\begin{figure}[h]
\centering
......@@ -458,7 +456,7 @@ Following the preprocessing recommendations of the authors, we use an Automatic
\label{librispeech_durations}
\end{figure}
Isolating the silences with force-aligning the text to the utterances additionally allows to create a profile of the noise for all utterances of the same speaker. We use a python implementation\footnote{\url{https://github.com/wilsonchingg/logmmse}} of the LogMMSE algorithm \cite{LogMMSE}. LogMMSE cleans an audio speech segment by profiling the noise in the earliest few frames (which will usually not contain speech yet) and updating this noise profile on non-speech frames continuously throughout the utterance. We adapt this implementation to profile the noise and to clean the speech in two separate steps. On par with the authors, we found this additional preprocessing step to greatly help reducing the background noise of the synthesized spectrograms.
Isolating the silences with force-aligning the text to the utterances additionally allows to create a profile of the noise for all utterances of the same speaker. We use a python implementation\footnote{\url{https://github.com/wilsonchingg/logmmse}} of the LogMMSE algorithm \citep{LogMMSE}. LogMMSE cleans an audio speech segment by profiling the noise in the earliest few frames (which will usually not contain speech yet) and updating this noise profile on non-speech frames continuously throughout the utterance. We adapt this implementation to profile the noise and to clean the speech in two separate steps. On par with the authors, we found this additional preprocessing step to greatly help reducing the background noise of the synthesized spectrograms.
In SV2TTS, the embeddings used to condition the synthesizer at training time are speaker embeddings. We argue that utterance embeddings of the same target utterance make for a more natural choice instead. At inference time, utterance embeddings are also used. While the space of utterance and speaker embeddings is the same, speaker embeddings are not L2-normalized. This difference in domain should be small and have little impact on the synthesizer that uses the embedding, as the authors agreed when we asked them about it. However, they do not mention how many utterance embeddings are used to derive a speaker embedding. One would expect that all utterances available should be used; but with a larger number of utterance embeddings, the average vector (the speaker embedding) will further stray from its normalized version. Furthermore, the authors mention themselves that there are often large variations of tone and pitch within the utterances of a same speaker in the dataset, as they mimic different characters \citep[Appendix B]{SV2TTS}. Utterances have lower intra-variation, as their scope is limited to a sentence at most. Therefore, the embedding of an utterance is expected to be a more accurate representation of the voice spoken in the utterance than the embedding of the speaker. This holds if the utterance is long enough than to produce a meaningful embedding. While the ``optimal'' duration of reference speech was found to be 5 seconds, the embedding is shown to be already meaningful with only 2 seconds of reference speech (see table \ref{reference_speech_duration}). We believe that with utterances no shorter than the duration of partial utterances (1.6s), the utterance embedding should be sufficient for a meaningful capture of the voice, hence we used utterance embeddings for training the synthesizer.
......@@ -484,7 +482,7 @@ In SV2TTS, the embeddings used to condition the synthesizer at training time are
We train the synthesizer for 150k steps, with a batch size of 144 across 4 GPUs. The number of decoder outputs per step is set to 2, as is done in Tacotron 1. We found the model to either not converge or perform poorly with one output per decoder step. The loss function is the L2 loss between the predicted and ground truth mel spectrograms. During training, the model is set in Ground Truth Aligned (GTA) mode (also called teacher-forcing mode), where the input to the pre-net is the previous frame of the ground truth spectrogram instead of the predicted one. With GTA, the pitch and prosody of the generated spectrogram is aligned with the ground truth, allowing for a shared context between the prediction and the ground truth as well as faster convergence. Without GTA, the synthesizer would generate different variations of the same utterance given a fixed text and embedding input (as is the case at inference time).
As was discussed in section \ref{SPSS} is also the case for the vocoder, it is difficult to provide any quantitative assessment of the performance of the model. We can observe that the model is producing correct outputs through informal listening tests, but a formal evaluation would require us to setup subjective score polls to derive the MOS. While most authors we referred to could do so, this is beyond our reach. In the case of the synthesizer however, one can also verify that the alignments generated by the attention module are correct. We plot an example in figure \ref{tacotron_alignment}. Notice the number of decoder steps (223) matching the number of frames predicted (446) by the number of decoder outputs per step (2). Notice also how the predicted spectrogram is smoother than the ground truth, a typical behaviour of the model predicting the mean in presence of noise.
As was discussed in section \ref{SPSS} is also the case for the vocoder, it is difficult to provide any quantitative assessment of the performance of the model. We can observe that the model is producing correct outputs through informal listening tests, but a formal evaluation would require us to setup subjective score polls to derive the MOS. While some authors we referred to could do so, this is beyond our reach. In the case of the synthesizer however, one can also verify that the alignments generated by the attention module are correct. We plot an example in Figure \ref{tacotron_alignment}. Notice the number of decoder steps (223) matching the number of frames predicted (446) by the number of decoder outputs per step (2). Notice also how the predicted spectrogram is smoother than the ground truth, a typical behaviour of the model predicting the mean in presence of noise.
\begin{figure}[h]
\centering
......@@ -495,7 +493,7 @@ As was discussed in section \ref{SPSS} is also the case for the vocoder, it is d
Before training the vocoder, we can evaluate some aspects of the trained synthesizer using Griffin-Lim \citep{GriffinLim} as vocoder. Griffin-Lim is not a machine learning model but rather an iterative algorithm that estimates the source audio signal of a spectrogram. Audio generated this way typically conserves few of the voice characteristics of the speaker, but the speech is intelligible. The speech generated by the synthesizer matches correctly the text, even in the presence of complex or fictitious words. The prosody is however sometimes unnatural, with pauses at unexpected locations in the sentence, or the lack of pauses where they are expected. This is particularly noticeable with the embedding of some speakers who talk slowly, showing that the speaker encoder does capture some form of prosody. The lack of punctuation in LibriSpeech is partially responsible for this, forcing the model to infer punctuation from the text alone. This issue was highlighted by the authors as well, and can be heard on some of their samples\footnote{\url{https://google.github.io/tacotron/publications/speaker_adaptation/index.html}} of LibriSpeech speakers. The limits we imposed on the duration of utterances in the dataset (1.6s - 11.25s) are likely also problematic. Sentences that are too short will be stretched out with long pauses, and for those that are too long the voice will be rushed. When generating several sentences at inference time, we need to manually insert breaks to delimit where to split the input text so as to synthesize the spectrogram in multiple parts. This has the advantage of creating a batch of inputs rather than a long input, allowing for fast inference.
We can further observe how some voice features are lost with Griffin-Lim by computing the embeddings of synthesized speech and projecting them with UMAP along with ground truth embeddings. An example is given in figure \ref{projections_griffin}. We observe that the synthesized embeddings clusters are close to their respective ground truth embeddings cluster. The loss of emerging features is also visible, e.g for the pink, red and the two blue speakers the synthesized utterances have a lower inter-cluster variance than their ground truth counterpart. This phenomenon occurs with the gray and purple speakers as well.
We can further observe how some voice features are lost with Griffin-Lim by computing the embeddings of synthesized speech and projecting them with UMAP along with ground truth embeddings. An example is given in Figure \ref{projections_griffin}. We observe that the synthesized embeddings clusters are close to their respective ground truth embeddings cluster. The loss of emerging features is also visible, e.g for the pink, red and the two blue speakers the synthesized utterances have a lower inter-cluster variance than their ground truth counterpart. This phenomenon occurs with the gray and purple speakers as well.
\begin{figure}[h]
\centering
......@@ -508,20 +506,20 @@ Tacotron usually operates faster than real-time. We measure an inference speed o
\subsection{Vocoder} \label{vocoder}
In SV2TTS and in Tacotron2, WaveNet is the vocoder. WaveNet has been at the heart of deep learning with audio since its release and remains state of the art when it comes to voice naturalness in TTS. It is however also known for being the slowest practical deep learning architecture at inference time. Several later papers brought improvements on that aspect to bring the generation near real-time or faster than real-time, e.g. \citep{ParallelWaveNet, FastWaveNet, WaveRNN}, with no or next to no hit to the quality of the generated speech. Nonetheless, WaveNet remains the vocoder in SV2TTS as speed is not the main concern in \citep{SV2TTS} and because Google's own WaveNet implementation with various improvements already generates at 8000 samples per second \citep[page~2]{WaveRNN}. This is in contrast with "vanilla" WaveNet which generates 172 steps per second at best \citep[page 7]{ParallelWaveNet}. At the time of the writing of this thesis, most open-source implementations of WaveNet are still vanilla implementations.
In SV2TTS and in Tacotron2, WaveNet is the vocoder. WaveNet has been at the heart of deep learning with audio since its release and remains state of the art when it comes to voice naturalness in TTS. It is however also known for being the slowest practical deep learning architecture at inference time. Several later papers brought improvements on that aspect to bring the generation near real-time or faster than real-time, e.g. \citep{ParallelWaveNet, FastWaveNet, WaveRNN}, with no or next to no hit to the quality of the generated speech. Nonetheless, WaveNet remains the vocoder in SV2TTS as speed is not the main concern and because Google's own WaveNet implementation with various improvements already generates at 8000 samples per second \citep[page~2]{WaveRNN}. This is in contrast with "vanilla" WaveNet which generates at 172 steps per second at best \citep[page 7]{ParallelWaveNet}. At the time of the writing of this thesis, most open-source implementations of WaveNet are still vanilla implementations.
\citep{WaveRNN} proposes a simple scheme for describing the inference speed of autoregressive models. Given a target vector $\mathbf{u}$ with $|\mathbf{u}|$ samples to predict, the total time of inference $T(\mathbf{u})$ can be decomposed as:
$$ T(\mathbf{u}) = |\mathbf{u}|\sum_{i=1}^{N}(c(op_i) + d(op_i)) $$
where $N$ is the number of matrix-vector products ($\propto$ the number of layers) required to produce one sample, $c(op_i)$ is the computation time of layer $i$ and $d(op_i)$ is the overhead of the computation (typically I/O operations) for layer $i$. Note that standard sampling rates for speech include 16kHz, 22.05kHz and 24kHz (while music is usually sampled at 44.1kHz), meaning that for just 5 seconds of audio $|\mathbf{u}|$ is close to 100,000 samples. The standard WaveNet architecture accounts for three stacks of 10 residual blocks of two layers each, leading to $N = 60$.
WaveRNN, the model proposed in \citep{WaveRNN}, improves on WaveNet by not only reducing the contribution from $N$ but also from $\mathbf{u}$, $c(op_i)$ and $d(op_i)$. The vocoder model we use is an open source PyTorch implementation\footnote{\url{https://github.com/fatchord/WaveRNN}} that is based on WaveRNN but presents quite a few different design choices made by github user fatchord. We'll refer to this architecture as the "alternative WaveRNN".
WaveRNN, the model proposed in \citep{WaveRNN}, improves on WaveNet by not only reducing the contribution from $N$ but also from $\mathbf{u}$, $c(op_i)$ and $d(op_i)$. The vocoder model we use is an open source PyTorch implementation\footnote{\url{https://github.com/fatchord/WaveRNN}} that is based on WaveRNN but presents quite a few different design choices made by github user fatchord. We'll refer to this architecture as the ``alternative WaveRNN''.
\subsubsection{Model architecture}
In WaveRNN, the entire 60 convolutions from WaveNet are replaced by a single GRU layer \citep{GRU}. The authors maintain that the high non-linearity of a GRU layer alone is close enough to encompass the complexity of the entire WaveNet model. Indeed, they report a MOS of $4.51 \pm 0.08$ for Wavenet and $4.48 \pm 0.07$ for their best WaveRNN model. The inputs to the model are the GTA mel spectrogram generated by the synthesizer, with the ground truth audio as target. At training time the model predicts fixed- The forward pass of WaveRNN is implemented with only $N = 5$ matrix-vector products in a coarse-fine scheme where the lower 8 bits (coarse) of the target 16 bits sample are predicted first and then used to condition the prediction of the higher 8 bits (fine). The prediction is over the parameters of a distribution from which the output is sampled. We refer the reader to \citep{WaveRNN} for additional details.
In WaveRNN, the entire 60 convolutions from WaveNet are replaced by a single GRU layer \citep{GRU}. The authors maintain that the high non-linearity of a GRU layer alone is close enough to encompass the complexity of the entire WaveNet model. Indeed, they report a MOS of $4.51 \pm 0.08$ for Wavenet and $4.48 \pm 0.07$ for their best WaveRNN model. The inputs to the model are the GTA mel spectrogram generated by the synthesizer, with the ground truth audio as target. At training time the model predicts fixed-size waveform segments. The forward pass of WaveRNN is implemented with only $N = 5$ matrix-vector products in a coarse-fine scheme where the lower 8 bits (coarse) of the target 16 bits sample are predicted first and then used to condition the prediction of the higher 8 bits (fine). The prediction is over the parameters of a distribution from which the output is sampled. We refer the reader to \citep{WaveRNN} for additional details.
The authors improve on the factors $c(op_i)$ and $d(op_i)$ by implementing the sampling operation as a custom GPU operation. We do not replicate this. They also sparsify the network with the pruning strategy from \citep{SparsityRNN, 2PruneOrNot2Prune}. This method gradually prunes weights during training, as opposed to more classical pruning algorithms that operate between several trainings. The algorithm creates a binary mask over the weights that indicates whether the weight should be forced set to 0 or remain as is. The proportion of zero weights compared to the total number of weights in the network is called sparsity. Results from \citep{SparsityRNN, 2PruneOrNot2Prune} indicate that large networks with sparsity levels between 90\% and 95\% significantly outperform their dense versions. The authors of WaveRNN additionally argue that $c(op_i)$ is proportional to the number of nonzero weights. We experiment with this form of pruning and report our results in section \ref{vocoder_experiments}.
Finally, they improve on $|\mathbf{u}|$ with batched sampling. In batched sampling, the utterance is divided in segments of fixed length and the generation is done in parallel over all segments. To preserve some context between the end of a segment and the beginning of the subsequent one, a small section of the end of a segment is repeated at the beginning of the next one. This process is called folding. The model then forwards all segments. To retrieve the unfolded tensor, the overlapping sections of consecutive segments are merged by a cross-fade. This is illustrated in Figure \ref{batched_sampling}. We use batched sampling with the alternative WaveRNN, with a segment length of 8000 samples and an overlap length of 400 samples. With these parameters, a folded batch of size 2 will yield a bit more than 1 second of audio for 16kHz speech.
Finally, they improve on $|\mathbf{u}|$ with batched sampling. In batched sampling, the utterance is divided in segments of fixed length and the generation is done in parallel over all segments. To preserve some context between the end of a segment and the beginning of the subsequent one, a small section of the end of a segment is repeated at the beginning of the next one. This process is called folding. The model then forwards the folded segments. To retrieve the unfolded tensor, the overlapping sections of consecutive segments are merged by a cross-fade. This is illustrated in Figure \ref{batched_sampling}. We use batched sampling with the alternative WaveRNN, with a segment length of 8000 samples and an overlap length of 400 samples. With these parameters, a folded batch of size 2 will yield a bit more than 1 second of audio for 16kHz speech.
\begin{figure}[h]
\centering
......@@ -543,17 +541,56 @@ The alternative WaveRNN is the architecture we use. There is no documentation no
\subsubsection{Experiments} \label{vocoder_experiments}
When dealing with short utterances, the vocoder usually runs below real-time. The inference speed is highly dependent of the number of folds in batched sampling. Indeed, the network runs nearly in constant time with respect to the number of folds, with only a small increase in time as the number of folds grows. We find it is simpler to talk about a threshold duration of speech above which the model runs in real time. On our setup, this threshold is of 12.5 seconds; meaning that for utterances that are shorter than this threshold, the model will run slower than real-time. It seems that performance varies unexpectedly with some environment factors (such as the operating system) on PyTorch, and therefore we express our results with respect to a single same configuration.
The implementation on our hands does not have the custom GPU operation by \citep{WaveRNN}, and implementing it is beyond our capabilities. Rather, we focus on the pruning aspect mentioned by the authors. They claim that a large sparse WaveRNN will perform better and faster than a smaller dense one. We have experimented with the pruning algorithm but did not complete the training of a pruned model, due to time limits. This is a milestone we hope to achieve at a later date. Sparse tensors are, at the time of writing, yet an experimental feature in PyTorch. Their implementation might not be as efficient as the one the authors used. Through experiments, we find that the matrix multiply operation \texttt{addmm} for a sparse matrix and a dense vector only breaks even with time-wise the dense-only \texttt{addmm} for levels of sparsity above 91\%. Below this value, using sparse tensors will actually slow down the forward pass speed. The authors report sparsity levels of 96.4\% and 97.8\% \citep[Table 5]{WaveRNN}) while maintaining decent performances. At best, a sparsity level of 96.4\% would lower the real-time threshold to 7.86 seconds, and a level of 97.8\% to 4.44 seconds.
The implementation on our hands does not have the custom GPU operation by \citep{WaveRNN}, and implementing it is beyond our capabilities. Rather, we focus on the pruning aspect mentioned by the authors. They claim that a large sparse WaveRNN will perform better and faster than a smaller dense one. We have experimented with the pruning algorithm but did not complete the training of a pruned model, due to time limits. This is a milestone we hope to achieve at a later date. Sparse tensors are, at the time of writing, yet an experimental feature in PyTorch. Their implementation might not be as efficient as the one the authors used. Through experiments, we find that the matrix multiply operation \texttt{addmm} for a sparse matrix and a dense vector only breaks even time-wise with the dense-only \texttt{addmm} for levels of sparsity above 91\%. Below this value, using sparse tensors will actually slow down the forward pass speed. The authors report sparsity levels of 96.4\% and 97.8\% \citep[Table 5]{WaveRNN} while maintaining decent performances. Our tests indicate that, at best, a sparsity level of 96.4\% would lower the real-time threshold to 7.86 seconds, and a level of 97.8\% to 4.44 seconds. These are optimistic lower bounds on the actual threshold due to our assumption of constant time inference, and also because some layers in the model cannot be sparsified. This preliminary analysis indicates that pruning the vocoder would be beneficial to inference speed.
Unfortunately we could not complete with successfully training a final version of the vocoder before submitting this thesis. We did manage to get a prototype working by February, but this model is no longer compatible with changes we've made to the framework. We are determined to provide a working implementation before the defense of this thesis, but we cannot report of new experiments for now. Our impressions of the prototype was that our implementation was successful in creating a TTS model that could clone most voices, but not some uncommon ones. Some artifacts and background noise were present due to the poor quality of our synthesizer, which is why we had to revise the quality of our data and our preprocessing procedures. One drawback of the SV2TTS framework is the necessity to train models in sequential order. Once a new encoder is trained, the synthesizer must be retrained and so must the vocoder. Waiting for models to train so as to know on what to focus next has been a recurring situation in the development of our framework. Nonetheless we believe that we will be able to publish samples on our repository soon, and we invite the reader to follow the developments to come even if they now fall beyond the submission date.
\section{Toolbox and open-sourcing}
A part of the development effort went into making this project suitable for open-source. You can access the repository through this url: \url{https://github.com/CorentinJ/Real-Time-Voice-Cloning}. If the repository is currently inaccessible at the time of reading, it should be made open shortly.
Making the project open-source means to expose a clean interface to each model, both for training and inference, as well as documenting code and procedures. We believe that the effort is worth it. Our motivations are:
\begin{itemize}
\item to improve on our implementation with the contributions of users
\item to learn about managing open-source repositories
\item to assure the replicability of our results
\item to allow users to perform voice cloning on consumer hardware with little data, time and effort.
\end{itemize}
Unfortunately we could not complete with successfully training a final version of the vocoder before submitting this thesis. We managed to get a prototype working by February, but this model is no longer compatible with changes we've made to the framework. Specifically, we believe that
For this last point, we aimed to develop a graphical interface allowing users to get their hands on the framework quickly and without having to study it first. We call it the ``SV2TTS toolbox''. The interface of the toolbox can be seen in Figure \ref{toolbox}. It is written in Python with the Qt4 graphical interface, and is therefore cross-platform.
\begin{figure}[h]
\centering
\includegraphics[width=1\linewidth]{images/toolbox.png}
\caption{The SV2TTS toolbox interface. This image is best viewed on a digital support.}
\label{toolbox}
\end{figure}
A user begins by selecting an utterance audio file from any of the dataset stored on their disk. The toolbox handles several popular speech datasets, and can be customized to add new ones. Furthermore, the user can also record utterances so as to clone his or her own voice.
Once an utterance is loaded, its embedding will be computed and the UMAP projections will be updated automatically. The mel spectrogram of the utterance is drawn (middle row on the right), but this is merely for reference, as it is not used to compute anything. We also draw the embedding vector with a heatmap plot (on the left of the spectrogram). Note that embeddings are unidimensional vectors and thus the square shape has no structural meaning about the embedding values. Drawing embeddings gives visual cues as to how two embeddings differ.
When an embedding has been computed, it can be used to generate a spectrogram. The user can write any arbitrary text (top right of the interface) to be synthesized. As a reminder, punctuation is not supported by our model and will be discarded. To tune the prosody of the generate utterance, the user has to insert line breaks between parts that should be synthesized individually. The complete spectrogram is then the concatenation of those parts. Synthesizing a spectrogram will display it on the bottom right of the interface. Synthesizing the same sentences multiple times will yield different outputs.
Finally, the user can generate the segment corresponding to the synthesized spectrogram with the vocoder. A loading bar displays the progress of the generation. When done, the embedding of the synthesized utterance is generated (on the left of the synthesized spectrogram) and will also be projected with UMAP. The user is free to take that embedding as reference for further generation.
%A noteworthy property of presenting the embedding at every encoder step is that it allows to change a voice through a sentence, e.g. by morphing a voice to another with a linear interpolation between their respective embeddings.
\section{Conclusion}
We developed a framework for real-time voice cloning that had no public implementation. We find the results to be satisfying despite some unnatural prosody, and the voice cloning ability of the framework to be reasonably good but not on par with methods that make use of more reference speech time. We hope to improve our framework beyond the scope of this thesis, and possibly to implement some of the newer advances in the field that were made at the time of writing. We believe that even more powerful forms of voice cloning will become available in a near future.
\clearpage
\section*{Acknowledgments}
I would like to thank my advisor Pr. Gilles Louppe for his wise advices, his dedication and his patience.
\noindent I would like to thank mr. Quan Wang for answering my questions about his research.
\noindent I would like to thank github users Rayhane-mamah\footnote{\url{https://github.com/Rayhane-mamah}}, fatchord\footnote{\url{https://github.com/fatchord}} and begeekmyfriend\footnote{\url{https://github.com/begeekmyfriend}} for their contributions to the open source repositories used in this project.
\noindent I also wish to thank teaching assistants Joeri Hermans and Romain Mormont for managing the university's deep learning cluster and for providing me with technical assistance.
\noindent Finally, I wish to thank my sister Marie and sister-in-law Virginie for proofreading this thesis.
\clearpage
\bibliographystyle{plainnat}
......
Generalized End-To-End Loss For Speaker Verification
------ Todo ------
- Analyze perf of the encoder on the test set (join vc1_test and vc2_test, they're small)
- Behaviour of the encoder outputs on a sequence (when does it converge, what if another speaker comes in etc) -> try to achieve speaker diarization from this
------ Ideas ------
- Add a term in the loss that forces centroid of synthesized utterances to be close to that of ground truth utterances
- How much is the encoding affected by the quality of the microphone? Record the same sentences on 2 or 3 different microphones and project the embeddings. Look at how far they are from the rest of the dataset.
- How stable is the encoding w.r.t. the length of the utterance? Compare side by side embeddings of several utterances cut at 1.6s, 2.4s, 3.2s... Show the distribution of the distance to the centroids w.r.t. to the length of the utterances, the EER
- Analyze the components of the embeddings
......@@ -16,7 +11,7 @@ Generalized End-To-End Loss For Speaker Verification
------ Things to not forget to write about ------
- The contents of problems.txt, improvements.txt, encoder.txt, synthesizer.txt, questions.txt.
- (slides) Interpretation of EER
- Thank mr. Louppe, the authors of GE2E and Joeri
- Thank mr. Louppe, the authors of GE2E, Joeri, Virgine, github users
- LibriTTS, Improving data efficiency in end-to-end,
......
......@@ -20,3 +20,5 @@ Notes on alignment:
- Some users report their number of training steps (https://github.com/Rayhane-mamah/Tacotron-2/issues/175)
Using the L1 loss (section 2.2) didn't change anything.
-> Actually made things way worse because I forgot to scale it, like a complete idiot.
\ No newline at end of file
How about a nonlinear encoding? https://i.imgur.com/xRWn6AE.png
-> talk about linear encoding and mulaw
gen_s_mel_raw:
- 110k: 4 3 3 2 3 4 4 3 4 5
- 196k: Pretty good overall. A few artifacts (~1 per utterance)
gen_s_mel_raw_no_pad:
- 110k: 2 3 3 3 3 5 4 4 4 3
- 200k: Pretty good as well. I'm not 100% sure, but I feel that gen_s_mel_raw is a bit better.
gt_s_mel:
- 209k: 5 3 3 3 2 1 1 1 1 1
Adding preemphasis now
-> Models trained with preemphasis sound better with deemphasis, as expected
-> Hard to tell if they sound better than non-preemphasis models however
NOTES:
- tacotron_model.ckpt-486000 was the model used to generate GTA.
- best lost on mu_law: 2.935 (????)
TODO:
- Pruning
- Check if there is any significant speedup with 100% sparsity
- Begin merging the three projects:
- Fix saving wavs in the synthesizer preprocessor
- Single inference demo (without GUI)
- Single big inference tool (with GUI)
- Single config file (or args???? -> could be both: default to config values)
- Three hparams files (inside packages?)
- Proper versions in requirements.txt (do something different for inference and training?)
- Clean up the rest of the code (' to "), Check all TODOs
- Proper versions in requirements.txt (do something different for inference and training?)
- Check all TODOs
- Move root of repo to SV2TTS (the rest isn't necessary)
- Put on github (RECHECK ALL REQUIREMENTS (+VLIBS))
- Make demo website
- Write something
- Cite all datasets
- Argparse with default values in config
- Namespace for hparams
Some noisy speakers: 40,
Full: 3.7
-I 3.9
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册