Generalized End-To-End Loss For Speaker Verification

------ Todo ------
- Analyze perf of the encoder on the test set (join vc1_test and vc2_test, they're small)
- Behaviour of the encoder outputs on a sequence (when does it converge, what if another speaker comes in etc) -> try to achieve speaker diarization from this

------ Ideas ------
- Add a term in the loss that forces centroid of synthesized utterances to be close to that of ground truth utterances
- How much is the encoding affected by the quality of the microphone? Record the same sentences on 2 or 3 different microphones and project the embeddings. Look at how far they are from the rest of the dataset.
- How stable is the encoding w.r.t. the length of the utterance? Compare side by side embeddings of several utterances cut at 1.6s, 2.4s, 3.2s... Show the distribution of the distance to the centroids w.r.t. to the length of the utterances, the EER
- Analyze the components of the embeddings
- Technically, you could do voice morphing in the same sentence
- Check out this dataset http://www.robots.ox.ac.uk/~vgg/data/lip_reading/


------ Things to not forget to write about ------
- The contents of problems.txt, improvements.txt, encoder.txt, synthesizer.txt, questions.txt.
- (slides) Interpretation of EER
- Thank mr. Louppe, the authors of GE2E and Joeri
- LibriTTS, Improving data efficiency in end-to-end, 


------ Structure ------
- Abstract (1/3 page): very briefly talk about the setting, the current SOTA and the goal to achieve
- Introduction (1 page): go over the abstract and bring more details, pose the central problem that will guide the rest of the thesis
- State of the art (3-4 pages)
  - In text to speech (2-3 pages)(cut about half of what I've written to make it a less tedious read)
  - In voice cloning (1-2 pages)
- SV2TTS (5-7 pages): present their framework without talking about my implementation. Present the parts of the framework without getting too technical, but rather explain how they work together. Talk about the surrounding context of their work: impressive results, no public implementation yet, one of the many papers on TTS from Google (mention the kind of applications they have for it).
- My work (~20 pages): 
    - Encoder (8-10 pages, I'll talk more about it, since I reimplemented it from scratch.)
        - GE2E
        - Implementation
        - Embedding space
        - Function as a speaker diarizer
    - Synthesizer (5-8 pages)
        - Tacotron 2
        - Convergence
        - Dataset
    - Vocoder (3-4 pages)
        - WaveRNN
        - Pruning
        - Speed
    - Toolbox? (3 pages)
    - Integration as an application? (2 pages)
- Conclusion (1/2 page)


------ Notes ------
https://stats.stackexchange.com/questions/395345/computing-and-estimating-the-eer-on-an-entire-dataset

IIRC "THE EFFECT OF NEURAL NETWORKS IN STATISTICAL PARAMETRIC SPEECH SYNTHESIS" talked somewhere about recurrence and neighbouring features. I can't find where again, but it'd be great if I did.

I think Hashimoto 2015 talked about cross-language.

Fundamentals of speaker recognition has a good deal of stuff for the definitions in the lexicon

Damn good slides by H. Zen: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42624.pdf

1609.03499: "1606.06061 showed that state-of-the-art statistical parametric speech syntheziers matched state-of-the-art concatenative ones in some languages"

Deep voice 1 explains wavenet better than wavenet... go figure

1702.07825: "As is common with high-dimensional generative models (Theis et al., 2015), model loss is somewhat uncorrelated with perceptual quality of individual samples."