diff --git a/README.md b/README.md index f1570b4a04899ff2610e294747c2ef93ede0ce20..083a0e6f3fe5be03259b9db047a0173a80ab7820 100644 --- a/README.md +++ b/README.md @@ -204,6 +204,11 @@ Developers can have a try of our models with [PaddleSpeech Command Line](./paddl paddlespeech cls --input input.wav ``` +**Speaker Verification** +``` +paddlespeech vector --task spk --input input_16k.wav +``` + **Automatic Speech Recognition** ```shell paddlespeech asr --lang zh --input input_16k.wav @@ -489,6 +494,29 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r +**Speaker Verification** + + + + + + + + + + + + + + + + + + +
Task Dataset Model Type Link
Speaker VerificationVoxCeleb12ECAPA-TDNN + ecapa-tdnn-voxceleb12 +
+ **Punctuation Restoration** @@ -530,6 +558,7 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht - [Chinese Rule Based Text Frontend](./docs/source/tts/zh_text_frontend.md) - [Test Audio Samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) - [Audio Classification](./demos/audio_tagging/README.md) + - [Speaker Verification](./demos/speaker_verification/README.md) - [Speech Translation](./demos/speech_translation/README.md) - [Released Models](./docs/source/released_model.md) - [Community](#Community) diff --git a/README_cn.md b/README_cn.md index 70f6b2d95ef231e873a9cf34eee7d68e93af2c53..f5f5a9e3fefb03dd16bc4db413009ae4297d6e31 100644 --- a/README_cn.md +++ b/README_cn.md @@ -203,6 +203,10 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme ```shell paddlespeech cls --input input.wav ``` +**声纹识别** +```shell +paddlespeech vector --task spk --input input_16k.wav +``` **语音识别** ```shell paddlespeech asr --lang zh --input input_16k.wav @@ -481,6 +485,30 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
+ +**声纹识别** + + + + + + + + + + + + + + + + + + +
Task Dataset Model Type Link
Speaker VerificationVoxCeleb12ECAPA-TDNN + ecapa-tdnn-voxceleb12 +
+ **标点恢复** @@ -527,6 +555,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 - [中文文本前端](./docs/source/tts/zh_text_frontend.md) - [测试语音样本](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) - [声音分类](./demos/audio_tagging/README_cn.md) + - [声纹识别](./demos/speaker_verification/README_cn.md) - [语音翻译](./demos/speech_translation/README_cn.md) - [模型列表](#模型列表) - [语音识别](#语音识别模型) diff --git a/demos/speaker_verification/README.md b/demos/speaker_verification/README.md new file mode 100644 index 0000000000000000000000000000000000000000..c4d10ccf22f55ea31afe2303d7c2fcd9c0213d22 --- /dev/null +++ b/demos/speaker_verification/README.md @@ -0,0 +1,178 @@ +([简体中文](./README_cn.md)|English) +# Speech Verification) + +## Introduction + +Speaker Verification, refers to the problem of getting a speaker embedding from an audio. + +This demo is an implementation to extract speaker embedding from a specific audio file. It can be done by a single command or a few lines in python using `PaddleSpeech`. + +## Usage +### 1. Installation +see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). + +You can choose one way from easy, meduim and hard to install paddlespeech. + +### 2. Prepare Input File +The input of this demo should be a WAV file(`.wav`), and the sample rate must be the same as the model. + +Here are sample files for this demo that can be downloaded: +```bash +wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav +``` + +### 3. Usage +- Command Line(Recommended) + ```bash + paddlespeech vector --task spk --input 85236145389.wav + + echo -e "demo1 85236145389.wav" > vec.job + paddlespeech vector --task spk --input vec.job + + echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk + ``` + + Usage: + ```bash + paddlespeech vector --help + ``` + Arguments: + - `input`(required): Audio file to recognize. + - `model`: Model type of vector task. Default: `ecapatdnn_voxceleb12`. + - `sample_rate`: Sample rate of the model. Default: `16000`. + - `config`: Config of vector task. Use pretrained model when it is None. Default: `None`. + - `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`. + - `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment. + + Output: + +```bash + demo {'dim': 192, 'embedding': array([ -5.749211 , 9.505463 , -8.200284 , -5.2075014 , + 5.3940268 , -3.04878 , 1.611095 , 10.127234 , + -10.534177 , -15.821609 , 1.2032688 , -0.35080156, + 1.2629458 , -12.643498 , -2.5758228 , -11.343508 , + 2.3385992 , -8.719341 , 14.213509 , 15.404744 , + -0.39327756, 6.338786 , 2.688887 , 8.7104025 , + 17.469526 , -8.77959 , 7.0576906 , 4.648855 , + -1.3089896 , -23.294737 , 8.013747 , 13.891729 , + -9.926753 , 5.655307 , -5.9422326 , -22.842539 , + 0.6293588 , -18.46266 , -10.811862 , 9.8192625 , + 3.0070958 , 3.8072643 , -2.3861165 , 3.0821571 , + -14.739942 , 1.7594414 , -0.6485091 , 4.485623 , + 2.0207152 , 7.264915 , -6.40137 , 23.63524 , + 2.9711294 , -22.708025 , 9.93719 , 20.354511 , + -10.324688 , -0.700492 , -8.783211 , -5.27593 , + 15.999649 , 3.3004563 , 12.747926 , 15.429879 , + 4.7849145 , 5.6699696 , -2.3826702 , 10.605882 , + 3.9112158 , 3.1500628 , 15.859915 , -2.1832209 , + -23.908653 , -6.4799504 , -4.5365124 , -9.224193 , + 14.568347 , -10.568833 , 4.982321 , -4.342062 , + 0.0914714 , 12.645902 , -5.74285 , -3.2141201 , + -2.7173362 , -6.680575 , 0.4757669 , -5.035051 , + -6.7964664 , 16.865469 , -11.54324 , 7.681869 , + 0.44475392, 9.708182 , -8.932846 , 0.4123232 , + -4.361452 , 1.3948607 , 9.511665 , 0.11667654, + 2.9079323 , 6.049952 , 9.275183 , -18.078873 , + 6.2983274 , -0.7500531 , -2.725033 , -7.6027865 , + 3.3404543 , 2.990815 , 4.010979 , 11.000591 , + -2.8873312 , 7.1352735 , -16.79663 , 18.495346 , + -14.293832 , 7.89578 , 2.2714825 , 22.976387 , + -4.875734 , -3.0836344 , -2.9999814 , 13.751918 , + 6.448228 , -11.924197 , 2.171869 , 2.0423572 , + -6.173772 , 10.778437 , 25.77281 , -4.9495463 , + 14.57806 , 0.3044315 , 2.6132357 , -7.591999 , + -2.076944 , 9.025118 , 1.7834753 , -3.1799617 , + -4.9401326 , 23.465864 , 5.1685796 , -9.018578 , + 9.037825 , -4.4150195 , 6.859591 , -12.274467 , + -0.88911164, 5.186309 , -3.9988663 , -13.638606 , + -9.925445 , -0.06329413, -3.6709652 , -12.397416 , + -12.719869 , -1.395601 , 2.1150916 , 5.7381287 , + -4.4691963 , -3.82819 , -0.84233856, -1.1604277 , + -13.490127 , 8.731719 , -20.778936 , -11.495662 , + 5.8033476 , -4.752041 , 10.833007 , -6.717991 , + 4.504732 , 13.4244375 , 1.1306485 , 7.3435574 , + 1.400918 , 14.704036 , -9.501399 , 7.2315617 , + -6.417456 , 1.3333273 , 11.872697 , -0.30664724, + 8.8845 , 6.5569253 , 4.7948146 , 0.03662816, + -8.704245 , 6.224871 , -3.2701402 , -11.508579 ], + dtype=float32)} + ``` + +- Python API + ```python + import paddle + from paddlespeech.cli import VectorExecutor + + vector_executor = VectorExecutor() + audio_emb = vector_executor( + model='ecapatdnn_voxceleb12', + sample_rate=16000, + config=None, + ckpt_path=None, + audio_file='./85236145389.wav', + force_yes=False, + device=paddle.get_device()) + print('Audio embedding Result: \n{}'.format(audio_emb)) + ``` + + Output: + ```bash + # Vector Result: + {'dim': 192, 'embedding': array([ -5.749211 , 9.505463 , -8.200284 , -5.2075014 , + 5.3940268 , -3.04878 , 1.611095 , 10.127234 , + -10.534177 , -15.821609 , 1.2032688 , -0.35080156, + 1.2629458 , -12.643498 , -2.5758228 , -11.343508 , + 2.3385992 , -8.719341 , 14.213509 , 15.404744 , + -0.39327756, 6.338786 , 2.688887 , 8.7104025 , + 17.469526 , -8.77959 , 7.0576906 , 4.648855 , + -1.3089896 , -23.294737 , 8.013747 , 13.891729 , + -9.926753 , 5.655307 , -5.9422326 , -22.842539 , + 0.6293588 , -18.46266 , -10.811862 , 9.8192625 , + 3.0070958 , 3.8072643 , -2.3861165 , 3.0821571 , + -14.739942 , 1.7594414 , -0.6485091 , 4.485623 , + 2.0207152 , 7.264915 , -6.40137 , 23.63524 , + 2.9711294 , -22.708025 , 9.93719 , 20.354511 , + -10.324688 , -0.700492 , -8.783211 , -5.27593 , + 15.999649 , 3.3004563 , 12.747926 , 15.429879 , + 4.7849145 , 5.6699696 , -2.3826702 , 10.605882 , + 3.9112158 , 3.1500628 , 15.859915 , -2.1832209 , + -23.908653 , -6.4799504 , -4.5365124 , -9.224193 , + 14.568347 , -10.568833 , 4.982321 , -4.342062 , + 0.0914714 , 12.645902 , -5.74285 , -3.2141201 , + -2.7173362 , -6.680575 , 0.4757669 , -5.035051 , + -6.7964664 , 16.865469 , -11.54324 , 7.681869 , + 0.44475392, 9.708182 , -8.932846 , 0.4123232 , + -4.361452 , 1.3948607 , 9.511665 , 0.11667654, + 2.9079323 , 6.049952 , 9.275183 , -18.078873 , + 6.2983274 , -0.7500531 , -2.725033 , -7.6027865 , + 3.3404543 , 2.990815 , 4.010979 , 11.000591 , + -2.8873312 , 7.1352735 , -16.79663 , 18.495346 , + -14.293832 , 7.89578 , 2.2714825 , 22.976387 , + -4.875734 , -3.0836344 , -2.9999814 , 13.751918 , + 6.448228 , -11.924197 , 2.171869 , 2.0423572 , + -6.173772 , 10.778437 , 25.77281 , -4.9495463 , + 14.57806 , 0.3044315 , 2.6132357 , -7.591999 , + -2.076944 , 9.025118 , 1.7834753 , -3.1799617 , + -4.9401326 , 23.465864 , 5.1685796 , -9.018578 , + 9.037825 , -4.4150195 , 6.859591 , -12.274467 , + -0.88911164, 5.186309 , -3.9988663 , -13.638606 , + -9.925445 , -0.06329413, -3.6709652 , -12.397416 , + -12.719869 , -1.395601 , 2.1150916 , 5.7381287 , + -4.4691963 , -3.82819 , -0.84233856, -1.1604277 , + -13.490127 , 8.731719 , -20.778936 , -11.495662 , + 5.8033476 , -4.752041 , 10.833007 , -6.717991 , + 4.504732 , 13.4244375 , 1.1306485 , 7.3435574 , + 1.400918 , 14.704036 , -9.501399 , 7.2315617 , + -6.417456 , 1.3333273 , 11.872697 , -0.30664724, + 8.8845 , 6.5569253 , 4.7948146 , 0.03662816, + -8.704245 , 6.224871 , -3.2701402 , -11.508579 ], + dtype=float32)} + ``` + +### 4.Pretrained Models + +Here is a list of pretrained models released by PaddleSpeech that can be used by command and python API: + +| Model | Sample Rate +| :--- | :---: | +| ecapatdnn_voxceleb12 | 16k diff --git a/demos/speaker_verification/README_cn.md b/demos/speaker_verification/README_cn.md new file mode 100644 index 0000000000000000000000000000000000000000..e2799b75e921035b71353abe506d6daf40e6a7ff --- /dev/null +++ b/demos/speaker_verification/README_cn.md @@ -0,0 +1,175 @@ +(简体中文|[English](./README.md)) + +# 声纹识别 +## 介绍 +声纹识别是一项用计算机程序自动提取说话人特征的技术。 + +这个 demo 是一个从给定音频文件提取说话人特征,它可以通过使用 `PaddleSpeech` 的单个命令或 python 中的几行代码来实现。 + +## 使用方法 +### 1. 安装 +请看[安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install_cn.md)。 + +你可以从 easy,medium,hard 三中方式中选择一种方式安装。 + +### 2. 准备输入 +这个 demo 的输入应该是一个 WAV 文件(`.wav`),并且采样率必须与模型的采样率相同。 + +可以下载此 demo 的示例音频: +```bash +# 该音频的内容是数字串 85236145389 +wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav +``` +### 3. 使用方法 +- 命令行 (推荐使用) + ```bash + paddlespeech vector --task spk --input 85236145389.wav + + echo -e "demo1 85236145389.wav" > vec.job + paddlespeech vector --task spk --input vec.job + + echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk + ``` + + 使用方法: + ```bash + paddlespeech vector --help + ``` + 参数: + - `input`(必须输入):用于识别的音频文件。 + - `model`:声纹任务的模型,默认值:`ecapatdnn_voxceleb12`。 + - `sample_rate`:音频采样率,默认值:`16000`。 + - `config`:声纹任务的参数文件,若不设置则使用预训练模型中的默认配置,默认值:`None`。 + - `ckpt_path`:模型参数文件,若不设置则下载预训练模型使用,默认值:`None`。 + - `device`:执行预测的设备,默认值:当前系统下 paddlepaddle 的默认 device。 + + 输出: + ```bash + demo {'dim': 192, 'embedding': array([ -5.749211 , 9.505463 , -8.200284 , -5.2075014 , + 5.3940268 , -3.04878 , 1.611095 , 10.127234 , + -10.534177 , -15.821609 , 1.2032688 , -0.35080156, + 1.2629458 , -12.643498 , -2.5758228 , -11.343508 , + 2.3385992 , -8.719341 , 14.213509 , 15.404744 , + -0.39327756, 6.338786 , 2.688887 , 8.7104025 , + 17.469526 , -8.77959 , 7.0576906 , 4.648855 , + -1.3089896 , -23.294737 , 8.013747 , 13.891729 , + -9.926753 , 5.655307 , -5.9422326 , -22.842539 , + 0.6293588 , -18.46266 , -10.811862 , 9.8192625 , + 3.0070958 , 3.8072643 , -2.3861165 , 3.0821571 , + -14.739942 , 1.7594414 , -0.6485091 , 4.485623 , + 2.0207152 , 7.264915 , -6.40137 , 23.63524 , + 2.9711294 , -22.708025 , 9.93719 , 20.354511 , + -10.324688 , -0.700492 , -8.783211 , -5.27593 , + 15.999649 , 3.3004563 , 12.747926 , 15.429879 , + 4.7849145 , 5.6699696 , -2.3826702 , 10.605882 , + 3.9112158 , 3.1500628 , 15.859915 , -2.1832209 , + -23.908653 , -6.4799504 , -4.5365124 , -9.224193 , + 14.568347 , -10.568833 , 4.982321 , -4.342062 , + 0.0914714 , 12.645902 , -5.74285 , -3.2141201 , + -2.7173362 , -6.680575 , 0.4757669 , -5.035051 , + -6.7964664 , 16.865469 , -11.54324 , 7.681869 , + 0.44475392, 9.708182 , -8.932846 , 0.4123232 , + -4.361452 , 1.3948607 , 9.511665 , 0.11667654, + 2.9079323 , 6.049952 , 9.275183 , -18.078873 , + 6.2983274 , -0.7500531 , -2.725033 , -7.6027865 , + 3.3404543 , 2.990815 , 4.010979 , 11.000591 , + -2.8873312 , 7.1352735 , -16.79663 , 18.495346 , + -14.293832 , 7.89578 , 2.2714825 , 22.976387 , + -4.875734 , -3.0836344 , -2.9999814 , 13.751918 , + 6.448228 , -11.924197 , 2.171869 , 2.0423572 , + -6.173772 , 10.778437 , 25.77281 , -4.9495463 , + 14.57806 , 0.3044315 , 2.6132357 , -7.591999 , + -2.076944 , 9.025118 , 1.7834753 , -3.1799617 , + -4.9401326 , 23.465864 , 5.1685796 , -9.018578 , + 9.037825 , -4.4150195 , 6.859591 , -12.274467 , + -0.88911164, 5.186309 , -3.9988663 , -13.638606 , + -9.925445 , -0.06329413, -3.6709652 , -12.397416 , + -12.719869 , -1.395601 , 2.1150916 , 5.7381287 , + -4.4691963 , -3.82819 , -0.84233856, -1.1604277 , + -13.490127 , 8.731719 , -20.778936 , -11.495662 , + 5.8033476 , -4.752041 , 10.833007 , -6.717991 , + 4.504732 , 13.4244375 , 1.1306485 , 7.3435574 , + 1.400918 , 14.704036 , -9.501399 , 7.2315617 , + -6.417456 , 1.3333273 , 11.872697 , -0.30664724, + 8.8845 , 6.5569253 , 4.7948146 , 0.03662816, + -8.704245 , 6.224871 , -3.2701402 , -11.508579 ], + dtype=float32)} + ``` + +- Python API + ```python + import paddle + from paddlespeech.cli import VectorExecutor + + vector_executor = VectorExecutor() + audio_emb = vector_executor( + model='ecapatdnn_voxceleb12', + sample_rate=16000, + config=None, # Set `config` and `ckpt_path` to None to use pretrained model. + ckpt_path=None, + audio_file='./85236145389.wav', + force_yes=False, + device=paddle.get_device()) + print('Audio embedding Result: \n{}'.format(audio_emb)) + ``` + + 输出: + ```bash + # Vector Result: + {'dim': 192, 'embedding': array([ -5.749211 , 9.505463 , -8.200284 , -5.2075014 , + 5.3940268 , -3.04878 , 1.611095 , 10.127234 , + -10.534177 , -15.821609 , 1.2032688 , -0.35080156, + 1.2629458 , -12.643498 , -2.5758228 , -11.343508 , + 2.3385992 , -8.719341 , 14.213509 , 15.404744 , + -0.39327756, 6.338786 , 2.688887 , 8.7104025 , + 17.469526 , -8.77959 , 7.0576906 , 4.648855 , + -1.3089896 , -23.294737 , 8.013747 , 13.891729 , + -9.926753 , 5.655307 , -5.9422326 , -22.842539 , + 0.6293588 , -18.46266 , -10.811862 , 9.8192625 , + 3.0070958 , 3.8072643 , -2.3861165 , 3.0821571 , + -14.739942 , 1.7594414 , -0.6485091 , 4.485623 , + 2.0207152 , 7.264915 , -6.40137 , 23.63524 , + 2.9711294 , -22.708025 , 9.93719 , 20.354511 , + -10.324688 , -0.700492 , -8.783211 , -5.27593 , + 15.999649 , 3.3004563 , 12.747926 , 15.429879 , + 4.7849145 , 5.6699696 , -2.3826702 , 10.605882 , + 3.9112158 , 3.1500628 , 15.859915 , -2.1832209 , + -23.908653 , -6.4799504 , -4.5365124 , -9.224193 , + 14.568347 , -10.568833 , 4.982321 , -4.342062 , + 0.0914714 , 12.645902 , -5.74285 , -3.2141201 , + -2.7173362 , -6.680575 , 0.4757669 , -5.035051 , + -6.7964664 , 16.865469 , -11.54324 , 7.681869 , + 0.44475392, 9.708182 , -8.932846 , 0.4123232 , + -4.361452 , 1.3948607 , 9.511665 , 0.11667654, + 2.9079323 , 6.049952 , 9.275183 , -18.078873 , + 6.2983274 , -0.7500531 , -2.725033 , -7.6027865 , + 3.3404543 , 2.990815 , 4.010979 , 11.000591 , + -2.8873312 , 7.1352735 , -16.79663 , 18.495346 , + -14.293832 , 7.89578 , 2.2714825 , 22.976387 , + -4.875734 , -3.0836344 , -2.9999814 , 13.751918 , + 6.448228 , -11.924197 , 2.171869 , 2.0423572 , + -6.173772 , 10.778437 , 25.77281 , -4.9495463 , + 14.57806 , 0.3044315 , 2.6132357 , -7.591999 , + -2.076944 , 9.025118 , 1.7834753 , -3.1799617 , + -4.9401326 , 23.465864 , 5.1685796 , -9.018578 , + 9.037825 , -4.4150195 , 6.859591 , -12.274467 , + -0.88911164, 5.186309 , -3.9988663 , -13.638606 , + -9.925445 , -0.06329413, -3.6709652 , -12.397416 , + -12.719869 , -1.395601 , 2.1150916 , 5.7381287 , + -4.4691963 , -3.82819 , -0.84233856, -1.1604277 , + -13.490127 , 8.731719 , -20.778936 , -11.495662 , + 5.8033476 , -4.752041 , 10.833007 , -6.717991 , + 4.504732 , 13.4244375 , 1.1306485 , 7.3435574 , + 1.400918 , 14.704036 , -9.501399 , 7.2315617 , + -6.417456 , 1.3333273 , 11.872697 , -0.30664724, + 8.8845 , 6.5569253 , 4.7948146 , 0.03662816, + -8.704245 , 6.224871 , -3.2701402 , -11.508579 ], + dtype=float32)} + ``` + +### 4.预训练模型 +以下是 PaddleSpeech 提供的可以被命令行和 python API 使用的预训练模型列表: + +| 模型 | 采样率 +| :--- | :---: | +| ecapatdnn_voxceleb12 | 16k diff --git a/demos/speaker_verification/run.sh b/demos/speaker_verification/run.sh new file mode 100644 index 0000000000000000000000000000000000000000..856886d333cd30f983576875e809ed2016a51f50 --- /dev/null +++ b/demos/speaker_verification/run.sh @@ -0,0 +1,6 @@ +#!/bin/bash + +wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav + +# asr +paddlespeech vector --task spk --input ./85236145389.wav \ No newline at end of file diff --git a/docs/source/released_model.md b/docs/source/released_model.md index a6092d558cadabcfae2f357c4ccf40b8dfab13f4..826279e6a316e4b1d4f665b05706aa1f0939d7fa 100644 --- a/docs/source/released_model.md +++ b/docs/source/released_model.md @@ -75,6 +75,12 @@ Model Type | Dataset| Example Link | Pretrained Models | Static Models PANN | Audioset| [audioset_tagging_cnn](https://github.com/qiuqiangkong/audioset_tagging_cnn) | [panns_cnn6.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn6.pdparams), [panns_cnn10.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn10.pdparams), [panns_cnn14.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn14.pdparams) | [panns_cnn6_static.tar.gz](https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn6_static.tar.gz)(18M), [panns_cnn10_static.tar.gz](https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn10_static.tar.gz)(19M), [panns_cnn14_static.tar.gz](https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn14_static.tar.gz)(289M) PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn6.tar.gz), [esc50_cnn10.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn10.tar.gz), [esc50_cnn14.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn14.tar.gz) +## Speaker Verification Models + +Model Type | Dataset| Example Link | Pretrained Models | Static Models +:-------------:| :------------:| :-----: | :-----: | :-----: +PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz) | - + ## Punctuation Restoration Models Model Type | Dataset| Example Link | Pretrained Models :-------------:| :------------:| :-----: | :-----: diff --git a/examples/voxceleb/sv0/RESULT.md b/examples/voxceleb/sv0/RESULT.md new file mode 100644 index 0000000000000000000000000000000000000000..c37bcecef9b4276adcd7eb05b14893c48c3bdf96 --- /dev/null +++ b/examples/voxceleb/sv0/RESULT.md @@ -0,0 +1,7 @@ +# VoxCeleb + +## ECAPA-TDNN + +| Model | Number of Params | Release | Config | dim | Test set | Cosine | Cosine + S-Norm | +| --- | --- | --- | --- | --- | --- | --- | ---- | +| ECAPA-TDNN | 85M | 0.1.1 | conf/ecapa_tdnn.yaml |192 | test | 1.15 | 1.06 | diff --git a/paddlespeech/cli/README.md b/paddlespeech/cli/README.md index 5ac7a3bcaf1709b94020715d4480c08cf98cc3f0..19c822040de6699123781f14b6eac5bcf3ca15a6 100644 --- a/paddlespeech/cli/README.md +++ b/paddlespeech/cli/README.md @@ -13,6 +13,12 @@ paddlespeech cls --input input.wav ``` + ## Speaker Verification + + ```bash + paddlespeech vector --task spk --input input_16k.wav + ``` + ## Automatic Speech Recognition ``` paddlespeech asr --lang zh --input input_16k.wav diff --git a/paddlespeech/cli/README_cn.md b/paddlespeech/cli/README_cn.md index 75ab9e41b10152446db762b1b4ed1c180cd49967..4b15d6c7bc68a39075aba7efb37a04e687b5ab35 100644 --- a/paddlespeech/cli/README_cn.md +++ b/paddlespeech/cli/README_cn.md @@ -12,6 +12,12 @@ ## 声音分类 ```bash paddlespeech cls --input input.wav + ``` + + ## 声纹识别 + + ```bash + paddlespeech vector --task spk --input input_16k.wav ``` ## 语音识别 diff --git a/paddlespeech/cli/vector/infer.py b/paddlespeech/cli/vector/infer.py index 91974761e5b08a5c4cdf4874896bc2d9a9ddf7bd..79d3b5dba1de53591df41fe06c28500d62144151 100644 --- a/paddlespeech/cli/vector/infer.py +++ b/paddlespeech/cli/vector/infer.py @@ -42,13 +42,15 @@ pretrained_models = { # "paddlespeech vector --task spk --model ecapatdnn_voxceleb12-16k --sr 16000 --input ./input.wav" "ecapatdnn_voxceleb12-16k": { 'url': - 'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_0.tar.gz', + 'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz', 'md5': - '85ff08ce0ef406b8c6d7b5ffc5b2b48f', + 'a1c0dba7d4de997187786ff517d5b4ec', 'cfg_path': - 'conf/model.yaml', + 'conf/model.yaml', # the yaml config path 'ckpt_path': - 'model/model', + 'model/model', # the format is ${dir}/{model_name}, + # so the first 'model' is dir, the second 'model' is the name + # this means we have a model stored as model/model.pdparams }, } @@ -66,12 +68,13 @@ class VectorExecutor(BaseExecutor): self.parser = argparse.ArgumentParser( prog="paddlespeech.vector", add_help=True) + self.parser.add_argument( "--model", type=str, default="ecapatdnn_voxceleb12", choices=["ecapatdnn_voxceleb12"], - help="Choose model type of asr task.") + help="Choose model type of vector task.") self.parser.add_argument( "--task", type=str, @@ -79,7 +82,7 @@ class VectorExecutor(BaseExecutor): choices=["spk"], help="task type in vector domain") self.parser.add_argument( - "--input", type=str, default=None, help="Audio file to recognize.") + "--input", type=str, default=None, help="Audio file to extract embedding.") self.parser.add_argument( "--sample_rate", type=int, @@ -169,26 +172,59 @@ class VectorExecutor(BaseExecutor): @stats_wrapper def __call__(self, audio_file: os.PathLike, - model: str='ecapatdnn-voxceleb12', + model: str='ecapatdnn_voxceleb12', sample_rate: int=16000, config: os.PathLike=None, ckpt_path: os.PathLike=None, - force_yes: bool=False, device=paddle.get_device()): + """Extract the audio embedding + + Args: + audio_file (os.PathLike): audio path, + whose format must be wav and sample rate must be matched the model + model (str, optional): mode type, which is been loaded from the pretrained model list. + Defaults to 'ecapatdnn-voxceleb12'. + sample_rate (int, optional): model sample rate. Defaults to 16000. + config (os.PathLike, optional): yaml config. Defaults to None. + ckpt_path (os.PathLike, optional): pretrained model path. Defaults to None. + device (optional): paddle running host device. Defaults to paddle.get_device(). + + Returns: + dict: return the audio embedding and the embedding shape + """ + # stage 0: check the audio format audio_file = os.path.abspath(audio_file) if not self._check(audio_file, sample_rate): sys.exit(-1) + # stage 1: set the paddle runtime host device logger.info(f"device type: {device}") paddle.device.set_device(device) + + # stage 2: read the specific pretrained model self._init_from_path(model, sample_rate, config, ckpt_path) + + # stage 3: preprocess the audio and get the audio feat self.preprocess(model, audio_file) + + # stage 4: infer the model and get the audio embedding self.infer(model) + + # stage 5: process the result and set them to output dict res = self.postprocess() return res def _get_pretrained_path(self, tag: str) -> os.PathLike: + """get the neural network path from the pretrained model list + we stored all the pretained mode in the variable `pretrained_models` + + Args: + tag (str): model tag in the pretrained model list + + Returns: + os.PathLike: the downloaded pretrained model path in the disk + """ support_models = list(pretrained_models.keys()) assert tag in pretrained_models, \ 'The model "{}" you want to use has not been supported,'\ @@ -210,15 +246,33 @@ class VectorExecutor(BaseExecutor): sample_rate: int=16000, cfg_path: Optional[os.PathLike]=None, ckpt_path: Optional[os.PathLike]=None): + """Init the neural network from the model path + + Args: + model_type (str, optional): model tag in the pretrained model list. + Defaults to 'ecapatdnn_voxceleb12'. + sample_rate (int, optional): model sample rate. + Defaults to 16000. + cfg_path (Optional[os.PathLike], optional): yaml config file path. + Defaults to None. + ckpt_path (Optional[os.PathLike], optional): the pretrained model path, which is stored in the disk. + Defaults to None. + """ + # stage 0: avoid to init the mode again if hasattr(self, "model"): logger.info("Model has been initialized") return # stage 1: get the model and config path + # if we want init the network from the model stored in the disk, + # we must pass the config path and the ckpt model path if cfg_path is None or ckpt_path is None: + # get the mode from pretrained list sample_rate_str = "16k" if sample_rate == 16000 else "8k" tag = model_type + "-" + sample_rate_str logger.info(f"load the pretrained model: {tag}") + # get the model from the pretrained list + # we download the pretrained model and store it in the res_path res_path = self._get_pretrained_path(tag) self.res_path = res_path @@ -227,6 +281,7 @@ class VectorExecutor(BaseExecutor): self.ckpt_path = os.path.join( res_path, pretrained_models[tag]['ckpt_path'] + '.pdparams') else: + # get the model from disk self.cfg_path = os.path.abspath(cfg_path) self.ckpt_path = os.path.abspath(ckpt_path + ".pdparams") self.res_path = os.path.dirname( @@ -241,7 +296,6 @@ class VectorExecutor(BaseExecutor): self.config.merge_from_file(self.cfg_path) # stage 3: get the model name to instance the model network with dynamic_import - # Noet: we use the '-' to get the model name instead of '_' logger.info("start to dynamic import the model class") model_name = model_type[:model_type.rindex('_')] logger.info(f"model name {model_name}") @@ -262,31 +316,55 @@ class VectorExecutor(BaseExecutor): @paddle.no_grad() def infer(self, model_type: str): + """Infer the model to get the embedding + Args: + model_type (str): speaker verification model type + """ + # stage 0: get the feat and length from _inputs feats = self._inputs["feats"] lengths = self._inputs["lengths"] logger.info("start to do backbone network model forward") logger.info( f"feats shape:{feats.shape}, lengths shape: {lengths.shape}") + + # stage 1: get the audio embedding # embedding from (1, emb_size, 1) -> (emb_size) embedding = self.model.backbone(feats, lengths).squeeze().numpy() logger.info(f"embedding size: {embedding.shape}") + # stage 2: put the embedding and dim info to _outputs property + # the embedding type is numpy.array self._outputs["embedding"] = embedding def postprocess(self) -> Union[str, os.PathLike]: - return self._outputs["embedding"] + """Return the audio embedding info + + Returns: + Union[str, os.PathLike]: audio embedding info + """ + embedding = self._outputs["embedding"] + dim = embedding.shape[0] + return {"dim": dim, "embedding": embedding} def preprocess(self, model_type: str, input_file: Union[str, os.PathLike]): + """Extract the audio feat + + Args: + model_type (str): speaker verification model type + input_file (Union[str, os.PathLike]): audio file path + """ audio_file = input_file if isinstance(audio_file, (str, os.PathLike)): logger.info(f"Preprocess audio file: {audio_file}") - # stage 1: load the audio + # stage 1: load the audio sample points + # Note: this process must match the training process waveform, sr = load_audio(audio_file) logger.info(f"load the audio sample points, shape is: {waveform.shape}") # stage 2: get the audio feat + # Note: Now we only support fbank feature try: feat = melspectrogram( x=waveform, @@ -302,8 +380,13 @@ class VectorExecutor(BaseExecutor): feat = paddle.to_tensor(feat).unsqueeze(0) # in inference period, the lengths is all one without padding lengths = paddle.ones([1]) + + # stage 3: we do feature normalize, + # Now we assume that the feat must do normalize feat = feature_normalize(feat, mean_norm=True, std_norm=False) + # stage 4: store the feat and length in the _inputs, + # which will be used in other function logger.info(f"feats shape: {feat.shape}") self._inputs["feats"] = feat self._inputs["lengths"] = lengths @@ -311,6 +394,15 @@ class VectorExecutor(BaseExecutor): logger.info("audio extract the feat success") def _check(self, audio_file: str, sample_rate: int): + """Check if the model sample match the audio sample rate + + Args: + audio_file (str): audio file path, which will be extracted the embedding + sample_rate (int): the desired model sample rate + + Returns: + bool: return if the audio sample rate matches the model sample rate + """ self.sample_rate = sample_rate if self.sample_rate != 16000 and self.sample_rate != 8000: logger.error( diff --git a/paddlespeech/vector/cluster/__init__.py b/paddlespeech/vector/cluster/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47 --- /dev/null +++ b/paddlespeech/vector/cluster/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/paddlespeech/vector/io/__init__.py b/paddlespeech/vector/io/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47 --- /dev/null +++ b/paddlespeech/vector/io/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/paddlespeech/vector/modules/__init__.py b/paddlespeech/vector/modules/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47 --- /dev/null +++ b/paddlespeech/vector/modules/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/paddlespeech/vector/training/__init__.py b/paddlespeech/vector/training/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47 --- /dev/null +++ b/paddlespeech/vector/training/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/paddlespeech/vector/utils/__init__.py b/paddlespeech/vector/utils/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47 --- /dev/null +++ b/paddlespeech/vector/utils/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.