Figured out the feature extraction process

878bab60 · Corentin Jemine · eb0406f0 · 878bab60 · 878bab60 · 878bab60
9 changed file
--- a/SV2TTS/audio.py
+++ b/SV2TTS/audio.py
+import librosa
+import librosa.display
+import numpy as np
+import matplotlib.pyplot as plt
+
+fpath = r"E:\Datasets\LibriSpeech\train-other-500\20\205\20-205-0000.flac"
+signal, sampling_rate = librosa.load(fpath, sr=16000)
+signal = signal[:int(sampling_rate * 3.5)]
+plt.plot(signal)
+plt.show()
+
+frames = librosa.feature.melspectrogram(signal, 
+                                        sampling_rate,
+                                        n_fft=int(sampling_rate * 0.025),
+                                        hop_length=int(sampling_rate * 0.01),
+                                        n_mels=40)
+
+librosa.display.specshow(
+    librosa.power_to_db(frames, ref=np.max),
+    hop_length=int(sampling_rate * 0.01),
+    y_axis='mel',
+    x_axis='time',
+    sr=sampling_rate
+)
+plt.colorbar(format='%+2.0f dB')
+plt.title('Mel spectrogram')
+plt.tight_layout()
+plt.show()
\ No newline at end of file
--- a/SV2TTS/config.py
+++ b/SV2TTS/config.py
+from vlibs import fileio
+
+LIBRISPEECH_ROOT = "E://Datasets/LibriSpeech"
+LIBRISPEECH_DATASETS = ["train-other-500"]
\ No newline at end of file
--- a/SV2TTS/datasets/__init__.py
+++ b/SV2TTS/datasets/__init__.py
--- a/SV2TTS/datasets/librispeech.py
+++ b/SV2TTS/datasets/librispeech.py
+from config import *
\ No newline at end of file
--- a/SV2TTS/datasets/speaker_data.py
+++ b/SV2TTS/datasets/speaker_data.py
+from vlibs import fileio, core
+import numpy as np
+
+# Set of utterances for a single speaker
+class SpeakerData:
+    def __init__(self, root, name):
+        self.name = name
+        self.utterances = fileio.get_files(root, r"\.npy")
+        self.next_utterances = []
+        
+    def random_utterances(self, count):
+        """
+        Samples a batch of <count> unique utterances from the disk in a way that they all come up at 
+        least once every two cycles and in a random order every time.
+        
+        :param count: The number of utterances to sample from the set of utterances from that 
+        speaker. Utterances are guaranteed not to be repeated if <count> is not larger than the 
+        number of utterances available.
+        :return: A list of utterances loaded in memory. 
+        """
+        
+        # Sample the utterances
+        fpaths = []
+        while count > 0:
+            n = min(count, len(self.next_utterances))
+            fpaths.extend(self.next_utterances[:n])
+            self.next_utterances = self.next_utterances[n:]
+            if len(self.next_utterances) == 0:
+                new_utterances = [u for u in self.utterances if not u in fpaths[-n:]]
+                self.next_utterances = core.shuffle(new_utterances)
+            count -= n
+        
+        # Load them
+        return list(map(np.load, fpaths))
\ No newline at end of file
--- a/SV2TTS/draft.py
+++ b/SV2TTS/draft.py
+from vlibs import fileio
+import librosa
+
+
--- a/notes/datasets.txt
+++ b/notes/datasets.txt
@@ -4,6 +4,7 @@ https://catalog.ldc.upenn.edu/LDC93S1
 http://www.voxforge.org/home
 https://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html (used in 1802.06006, 1806.04558)
 http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
+http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/


 ### TATOEBA ###

--- a/notes/questions.txt
+++ b/notes/questions.txt
- Why do they call the architecture a encoder-decoder one? There is no special embedding in the middle of the network. Something to do with the loss?
- Matheo + B. Wéry


-Self:
 - Am I going to need a different embedding for the voice of a same speaker in two different languages? I may need to formulate a "unique encoding hypothesis", i.e. that two people with the same voice in language A would also have the same voice in language B. This is likely not a true hypothesis but still a reasonable simplification for the voice transfer problem.
+
 - [1409.0473] "Most of the proposed neural machine translation models belong to a family of encoder–decoders (...), with an encoder and a decoder for each language, (...)". I could do something similar: a voice encoder and a synthesizer per language, and somehow manage to keep a shared embedding space for all languages. This reminds me of UNIT, I wonder if it's applicable here. Very likely, the best way to do this lies in recent NLP methods.


 Answered:
+- Why do they call the architecture a encoder-decoder one? There is no special embedding in the middle of the network. Something to do with the loss?
+-> Because of the attention layer. Encode source characters in an intermediate representation, decode them to produce a spectrogram.
+
 - How do i aggregate the different metrics: MOS, Preference score, LSD/VU error rate/F0 RMSE?
 -> Don't. Present them seperately as they are.


--- a/notes/speaker_encoder.txt
+++ b/notes/speaker_encoder.txt
+END-TO-END TEXT-DEPENDENT SPEAKER VERIFICATION:
+- Take an utterance, split it in frames with a window function, feed frames to the LSTM. Take only the last output of the LSTM and average it over several frames to obtain the embedding of that utterance. The speaker embedding (also called speaker model) is the average of several utterance embeddings.
+- The process of deriving a speaker embedding from several utterances of the same speaker is called enrollment.
+- Typically an order more of utterances available per speaker than used for enrollment.
+- Addition of noise to the data (page 3, top of right column), but not to the enrollment samples! Non-augmented samples are also used for evaluation.
+- Dataset of 22M utterances from 80k speakers with at least 150 utterances per speaker.
+- 100Hz framerate (~80 frames per sample) with 40 log-mel filters.
+
+
+GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION:
+- Each batch contains N speakers and M utterances.
+- Addition of a linear layer at the end of the LSTM network.
+- The utterance embedding is the L2 norm of the last layer.
+- In training, when comparing an uterrance embedding with a speaker embedding, they exclude that utterance embedding from the computation of the speaker embedding if they both come from the same speaker.
+- Overlapping windows of 25ms width every 10ms (= 100Hz) with 40 log-mel filters.
+- Utterances of 140-180 frames randomly extracted using Voice Activity Detection (VAD)
+- Utterances of 160 frames with 50% overlap are used at inference time. The outputs of the model for each utterance are L2-normalized then averaged.
+- 3-layer LSTM with projection of size equal to the embedding (64 and 256 for small and large datasets respectively). 
+- Training hyperparameters at the begining of section 3.
+- Dataset of 36M utterances from 80k speakers for training. 1k speakers for evaluation with 6.3 average enrollment utterances and 7.2 evaluation utterances per speaker.
+
+
+TRANSFER LEARNING FROM SPEAKER VERIFICATION TO MULTISPEAKER TEXT-TO-SPEECH SYNTHESIS
+- Trained on untranscribed speech containing reverbation and background noise for a large number or speakers (can be disjoint from synthesis network).
+- Training on 36M utterances from 18k speakers (and 1.2k for synthesis network) with median utterance duration of 3.9 seconds. This data is not used for the other parts of the framework. At least thousands of speakers are needed for zero-shot transfer.
+- Training set consists of audio segmented in 1.6 seconds and the speaker label.
+- Only using the last output of the LSTM as embedding (still L2-normalized).
+- Utterances of 800ms with 50% overlap are used at inference time.
+- VCTK: downsampling to 24kHz, trim leading and trailing silences (median duration from 3.3 to 1.8 seconds).
+- LibriSpeech: data sampled at 16kHz. See paper for details.
+- Datasets used for the encoder: LS-Other, VC, VC2 (for the synthesis network: VCTK, LS-Clean)
+- LS-Other: 461 hours of speech from 1166 speakers (disjoint from LS-Clean)
+- VoxCeleb: 139k utterances from 1211 speakers
+- VoxCeleb: 1.09M utterances from 5994 speakers
+
+
+About mel:
+- Tacotron predicts a mel spectrogram
+- Tacotron's target is an 80-channel mel-scale filterbank followed by log dynamic range compression
+- The encoder takes 40-channel log-mel spectrogram frames
+- The encoder takes 40-channel log-mel filterbank energies
+- https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
+
+What I suggest:
+- Start with GE2E