summary.tex 2.3 KB
Newer Older
C
Corentin Jemine 已提交
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152
\documentclass[a4paper, oneside, 12pt, english]{article}
\usepackage{geometry}
\geometry{
	a4paper,
	total={154mm,232mm},
	left=28mm,
	top=32mm,
}
\setlength{\parskip}{0.85em}
\setlength{\parindent}{2.6em}
\usepackage{array}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsfonts} 
\usepackage{enumerate}
\usepackage{graphicx}
\usepackage{url}
\usepackage{color}
\usepackage{wrapfig}
\usepackage{systeme}
\usepackage{tabularx}
\usepackage{subfig}
\usepackage{hyperref}
\usepackage[authoryear, round]{natbib}
\usepackage[dvipsnames]{xcolor}
\usepackage{booktabs}
\usepackage{multirow}


\begin{document}
\section*{Real-Time Voice Cloning}
We developed a three-stage deep learning framework that performs voice cloning in real-time. This framework is the result of a 2018 paper from Google for which there existed no public implementation before ours. From an utterance of speech of only 5 seconds, the framework is able to capture in a digital format a meaningful representation of the voice spoken. Given a text prompt, it is able to perform text-to-speech using any voice extracted by this process. We reproduced each of the three stages of the framework with our own implementations or open-source ones. We implemented efficient deep learning models and adequate data preprocessing pipelines. We trained these models for weeks or months on large datasets of tens of thousands of hours of speech from several thousands of speakers. We analyzed their capabilities and their drawbacks. We focused on making this framework operate in real-time, that is, to make it possible to capture a voice and generate speech in less time than the duration of the generated speech. The framework is able to clone voices it has never heard during training, and to generate speech from text it has never seen. We made our code and pretrained models public, in addition to developing a graphical interface to the framework, so that it is accessible even for users unfamiliar with deep learning.

\vspace{2cm}

\noindent \textbf{Author}: Corentin Jemine {\small \textit{(C.S. bachelor 2014-2017, master in Data Science 2017-2019)})}

\noindent \textbf{Supervisor}: Prof. Gilles Louppe

\noindent \textbf{Academic year}: 2018-2019


\end{document}