Add speech models. (#1678)

5c8136b0 · KP · GitHub · 8f403bac · 5c8136b0 · 5c8136b0
54 changed file
--- a/docs/docs_ch/get_start/mac_quickstart.md
+++ b/docs/docs_ch/get_start/mac_quickstart.md
@@ -192,7 +192,7 @@
    - <img src="../../imgs/Install_Related/mac/output_img.png" alt="output image" width="600" align="center"/>

 ## 第6步：飞桨预训练模型探索之旅
- 恭喜你，到这里PaddleHub在windows环境下的安装和入门案例就全部完成了，快快开启你更多的深度学习模型探索之旅吧。[【更多模型探索，跳转飞桨官网】](https://www.paddlepaddle.org.cn/hublist)
+- 恭喜你，到这里PaddleHub在mac环境下的安装和入门案例就全部完成了，快快开启你更多的深度学习模型探索之旅吧。[【更多模型探索，跳转飞桨官网】](https://www.paddlepaddle.org.cn/hublist)




--- a/modules/audio/asr/deepspeech2_aishell/README.md
+++ b/modules/audio/asr/deepspeech2_aishell/README.md
+# deepspeech2_aishell
+
+|模型名称|deepspeech2_aishell|
+| :--- | :---: |
+|类别|语音-语音识别|
+|网络|DeepSpeech2|
+|数据集|AISHELL-1|
+|是否支持Fine-tuning|否|
+|模型大小|306MB|
+|最新更新日期|2021-10-20|
+|数据指标|中文CER 0.065|
+
+## 一、模型基本信息
+
+### 模型介绍
+
+DeepSpeech2是百度于2015年提出的适用于英文和中文的end-to-end语音识别模型。deepspeech2_aishell使用了DeepSpeech2离线模型的结构，模型主要由2层卷积网络和3层GRU组成，并在中文普通话开源语音数据集[AISHELL-1](http://www.aishelltech.com/kysjcp)进行了预训练，该模型在其测试集上的CER指标是0.065。
+
+
+<p align="center">
+<img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/Hub/docs/images/ds2offlineModel.png" hspace='10'/> <br />
+</p>
+
+更多详情请参考[Deep Speech 2: End-to-End Speech Recognition in English and Mandarin](https://arxiv.org/abs/1512.02595)
+
+## 二、安装
+
+- ### 1、系统依赖
+
+  - libsndfile, swig >= 3.0
+    - Linux
+      ```shell
+      $ sudo apt-get install libsndfile swig
+      or
+      $ sudo yum install libsndfile swig
+      ```
+    - MacOs
+      ```
+      $ brew install libsndfile swig
+      ```
+
+- ### 2、环境依赖
+  - swig_decoder:
+    ```
+    git clone https://github.com/PaddlePaddle/DeepSpeech.git && cd DeepSpeech && git reset --hard b53171694e7b87abe7ea96870b2f4d8e0e2b1485 && cd deepspeech/decoders/ctcdecoder/swig && sh setup.sh
+    ```
+
+  - paddlepaddle >= 2.1.0
+
+  - paddlehub >= 2.1.0    | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst)
+
+- ### 3、安装
+
+  - ```shell
+    $ hub install deepspeech2_aishell
+    ```
+  - 如您安装时遇到问题，可参考：[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
+ | [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
+
+
+## 三、模型API预测  
+
+- ### 1、预测代码示例
+
+    ```python
+    import paddlehub as hub
+
+    # 采样率为16k，格式为wav的中文语音音频
+    wav_file = '/PATH/TO/AUDIO'
+
+    model = hub.Module(
+        name='deepspeech2_aishell',
+        version='1.0.0')
+    text = model.speech_recognize(wav_file)
+
+    print(text)
+    ```
+
+- ### 2、API
+  - ```python
+    def check_audio(audio_file)
+    ```
+    - 检查输入音频格式和采样率是否满足为16000
+
+    - **参数**
+
+      - `audio_file`：本地音频文件(*.wav)的路径，如`/path/to/input.wav`
+
+  - ```python
+    def speech_recognize(
+        audio_file,
+        device='cpu',
+    )
+    ```
+    - 将输入的音频识别成文字
+
+    - **参数**
+
+      - `audio_file`：本地音频文件(*.wav)的路径，如`/path/to/input.wav`
+      - `device`：预测时使用的设备，默认为`cpu`，如需使用gpu预测，请设置为`gpu`。
+
+    - **返回**
+
+      - `text`：str类型，返回输入音频的识别文字结果。
+
+
+## 四、服务部署
+
+- PaddleHub Serving可以部署一个在线的语音识别服务。
+
+- ### 第一步：启动PaddleHub Serving
+
+  - ```shell
+    $ hub serving start -m deepspeech2_aishell
+    ```
+
+  - 这样就完成了一个语音识别服务化API的部署，默认端口号为8866。
+
+  - **NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。
+
+- ### 第二步：发送预测请求
+
+  - 配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果
+
+  - ```python
+    import requests
+    import json
+
+    # 需要识别的音频的存放路径，确保部署服务的机器可访问
+    file = '/path/to/input.wav'
+
+    # 以key的方式指定text传入预测方法的时的参数，此例中为"audio_file"
+    data = {"audio_file": file}
+
+    # 发送post请求，content-type类型应指定json方式，url中的ip地址需改为对应机器的ip
+    url = "http://127.0.0.1:8866/predict/deepspeech2_aishell"
+
+    # 指定post请求的headers为application/json方式
+    headers = {"Content-Type": "application/json"}
+
+    r = requests.post(url=url, headers=headers, data=json.dumps(data))
+    print(r.json())
+    ```
+
+## 五、更新历史
+
+* 1.0.0
+
+  初始发布
+
+  ```shell
+  $ hub install deepspeech2_aishell
+  ```
--- a/modules/audio/asr/deepspeech2_aishell/__init__.py
+++ b/modules/audio/asr/deepspeech2_aishell/__init__.py
--- a/modules/audio/asr/deepspeech2_aishell/assets/conf/augmentation.json
+++ b/modules/audio/asr/deepspeech2_aishell/assets/conf/augmentation.json
+{}
--- a/modules/audio/asr/deepspeech2_aishell/assets/conf/deepspeech2.yaml
+++ b/modules/audio/asr/deepspeech2_aishell/assets/conf/deepspeech2.yaml
+# https://yaml.org/type/float.html
+data:
+  train_manifest: data/manifest.train
+  dev_manifest: data/manifest.dev
+  test_manifest: data/manifest.test
+  min_input_len: 0.0
+  max_input_len: 27.0 # second
+  min_output_len: 0.0
+  max_output_len: .inf
+  min_output_input_ratio: 0.00
+  max_output_input_ratio: .inf
+
+collator:
+  batch_size: 64 # one gpu
+  mean_std_filepath: data/mean_std.json
+  unit_type: char
+  vocab_filepath: data/vocab.txt
+  augmentation_config: conf/augmentation.json
+  random_seed: 0
+  spm_model_prefix:
+  spectrum_type: linear
+  feat_dim:
+  delta_delta: False
+  stride_ms: 10.0
+  window_ms: 20.0
+  n_fft: None
+  max_freq: None
+  target_sample_rate: 16000
+  use_dB_normalization: True
+  target_dB: -20
+  dither: 1.0
+  keep_transcription_text: False
+  sortagrad: True
+  shuffle_method: batch_shuffle
+  num_workers: 2
+
+model:
+  num_conv_layers: 2
+  num_rnn_layers: 3
+  rnn_layer_size: 1024
+  use_gru: True
+  share_rnn_weights: False
+  blank_id: 0
+  ctc_grad_norm_type: instance
+
+training:
+  n_epoch: 80
+  accum_grad: 1
+  lr: 2e-3
+  lr_decay: 0.83
+  weight_decay: 1e-06
+  global_grad_clip: 3.0
+  log_interval: 100
+  checkpoint:
+    kbest_n: 50
+    latest_n: 5
+
+decoding:
+  batch_size: 128
+  error_rate_type: cer
+  decoding_method: ctc_beam_search
+  lang_model_path: data/lm/zh_giga.no_cna_cmn.prune01244.klm
+  alpha: 1.9
+  beta: 5.0
+  beam_size: 300
+  cutoff_prob: 0.99
+  cutoff_top_n: 40
+  num_proc_bsearch: 10
--- a/modules/audio/asr/deepspeech2_aishell/assets/data/mean_std.json
+++ b/modules/audio/asr/deepspeech2_aishell/assets/data/mean_std.json
+{"mean_stat": [-13505966.65209869, -12778154.889588555, -13487728.30750011, -12897344.94123812, -12472281.490772562, -12631566.475106332, -13391790.349327326, -14045382.570026815, -14159320.465516506, -14273422.438486755, -14639805.161347123, -15145380.07768254, -15612893.133258691, -15938542.05012206, -16115293.502621327, -16188225.698757892, -16317206.280373082, -16500598.476283036, -16671564.297937019, -16804599.860397574, -16916423.142814968, -17011785.59439087, -17075067.62262626, -17154580.16740178, -17257812.961825978, -17355683.228599995, -17441455.258318607, -17473199.925130684, -17488835.5763828, -17491232.15414511, -17485000.29006962, -17499471.646940477, -17551398.97122984, -17641732.10682403, -17757209.077974595, -17843801.500521667, -17935647.58641936, -18020362.347413756, -18117633.806080323, -18232427.58935143, -18316024.35215119, -18378789.145393644, -18421147.25807373, -18445805.18294822, -18460946.27810118, -18467914.04034822, -18469404.319909714, -18469606.974339806, -18470754.294192698, -18458320.91921723, -18441354.111811973, -18428332.216321833, -18422281.413955193, -18433421.585668042, -18460521.025954794, -18494800.856363494, -18539532.288011573, -18583823.79899225, -18614474.56256926, -18646872.180154275, -18661137.85367877, -18673590.719379324, -18702967.62040798, -18736434.748098046, -18777912.13098326, -18794675.486509323, -18837225.856196072, -18874872.796128694, -18927340.44407057, -18994929.076545004, -19060701.164406348, -19118006.18996682, -19175792.05766062, -19230755.996405277, -19270174.594219487, -19334788.35904946, -19401456.988906194, -19484580.095938426, -19582040.4715673, -19696598.86662636, -19810401.513227757, -19931755.37941177, -20021867.47620737, -20082298.984455004, -20114708.336475413, -20143802.72793865, -20146821.988139726, -20165613.317683898, -20189938.602584295, -20220059.08673595, -20242848.528134122, -20250859.979931064, -20267382.93048284, -20267964.544716164, -20261372.89563879, -20252878.74023849, -20247550.771284755, -20231778.31093504, -20231376.103159923, -20236926.52293088, -20248068.41488535, -20255076.901920393, -20262924.167151034, -20263926.583205637, -20263790.273742784, -20268560.080967404, -20268997.150654405, -20269810.816284582, -20267771.864327505, -20256472.703380838, -20241790.559690386, -20241865.794732895, -20244924.716114976, -20249736.631184842, -20257257.816903576, -20268027.212145977, -20277399.95533857, -20281840.8112546, -20270512.52002465, -20255938.63066214, -20242421.685443826, -20241986.654626504, -20237836.034444932, -20231458.31132546, -20218092.819713395, -20204994.19634715, -20198880.142133974, -20197376.49014031, -20198117.60450857, -20197443.473929476, -20191142.03632657, -20174428.452719454, -20159204.32090646, -20137981.294740904, -20124944.79897834, -20112774.604521394, -20109389.248600915, -20115248.61302806, -20117743.853294585, -20123076.93515528, -20132224.95454374, -20147099.26793121, -20169581.367630124, -20190957.518733896, -20215197.057997894, -20242033.589256056, -20282032.217160087, -20316778.653784916, -20360354.215504933, -20425089.908502825, -20534553.0465662, -20737928.349233944, -21091705.14104186, -21646013.197923105, -22403182.076235127, -23313516.63322832, -24244679.879594248, -25027534.00417361, -25502455.708560493, -25665136.744125813, -26602318.88405537], "var_stat": [209924783.1093623, 185218712.4577822, 209991180.89829063, 196198511.40798286, 186098265.7827955, 191905798.58923203, 214281935.29191792, 235042114.51049897, 240179456.24597096, 244657890.3963041, 256099586.32657292, 271849135.9872555, 287174069.13527167, 298171137.28863454, 304112589.91933817, 306553976.2206335, 310813670.30674237, 316958840.3099824, 322651440.3639528, 327213725.196089, 331252123.26114285, 334856188.3081607, 337217897.6545214, 340385427.82557064, 344400488.5633641, 348086880.08086526, 351349070.53148264, 352648076.18415344, 353409462.33704513, 353598061.4967693, 353405322.74993587, 353917215.6834277, 355784796.898883, 359222461.3224974, 363671441.7428676, 366908651.69908494, 370304677.0615045, 373477194.79721, 377174088.9808273, 381531608.6574547, 384703574.426059, 387104126.9474883, 388723211.11308575, 389687817.27351815, 390351031.4418706, 390659006.3690262, 390704649.89417714, 390702370.1919126, 390731862.59274197, 390216004.4126628, 389516083.054853, 389017745.636457, 388788872.1127645, 389269311.2239042, 390401819.5968815, 391842612.97859454, 393708801.05223197, 395569598.4694, 396868892.67152405, 398210915.02133286, 398743299.4753882, 399330344.88417244, 400565940.1325846, 401901693.4656316, 403513855.43933284, 404103248.96526104, 405986814.274556, 407507145.4104169, 409598353.6517908, 412453848.0248063, 415138273.0558441, 417479272.96907294, 419785633.3276395, 422003065.1681787, 423610264.8868346, 426260552.96545905, 428973536.3620236, 432368654.40899384, 436359561.5468266, 441119512.777527, 445884989.25794005, 451037422.65838546, 454872292.24179226, 457497136.8780015, 458904066.0675219, 460155836.4432799, 460272943.80738074, 461087498.6828549, 462144907.7850926, 463483598.81228757, 464530694.44478536, 464971538.85301507, 465771535.6019992, 465936698.93801653, 465741012.7287712, 465448625.0011534, 465296363.8603534, 464718299.2207512, 464720391.25778216, 465016640.5248736, 465564374.0248998, 465982788.8695927, 466425068.01245564, 466595649.90489674, 466707658.8296169, 467015570.78026086, 467099213.08769494, 467201640.15951264, 467163862.3709329, 466727597.56313753, 466174871.71213347, 466255498.45248336, 466439062.65458614, 466693130.99620277, 467068587.1422199, 467536070.1402474, 467955819.1549621, 468187227.1069643, 467742976.2778335, 467159585.250493, 466592359.52916145, 466583195.8099961, 466424348.9572719, 466155323.6074322, 465569620.1801811, 465021642.5158305, 464757658.6383867, 464713882.60103834, 464724239.2941314, 464679163.728191, 464407007.8705965, 463660736.0136739, 463001339.2385198, 462077058.47595775, 461505071.67199403, 460946277.95973784, 460816158.9197017, 461123589.268546, 461232998.1572812, 461445601.0442877, 461803238.28569543, 462436966.22005004, 463391404.7434971, 464299608.85523456, 465319405.3931429, 466432961.70208246, 468168080.3331244, 469640808.6809098, 471501539.22440934, 474301795.1694898, 479155711.93441755, 488314271.10405815, 504537056.23994666, 530509400.5201074, 566892036.4437443, 611792826.0442055, 658913502.9004005, 699716882.9169292, 725237302.8248898, 734259159.9571886, 789267050.8287783], "frame_num": 899422}
--- a/modules/audio/asr/deepspeech2_aishell/assets/data/vocab.txt
+++ b/modules/audio/asr/deepspeech2_aishell/assets/data/vocab.txt
--- a/modules/audio/asr/deepspeech2_aishell/deepspeech_tester.py
+++ b/modules/audio/asr/deepspeech2_aishell/deepspeech_tester.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Evaluation for DeepSpeech2 model."""
+import os
+import sys
+from pathlib import Path
+
+import paddle
+
+from deepspeech.frontend.featurizer.text_featurizer import TextFeaturizer
+from deepspeech.io.collator import SpeechCollator
+from deepspeech.models.ds2 import DeepSpeech2Model
+from deepspeech.utils import mp_tools
+from deepspeech.utils.utility import UpdateConfig
+
+
+class DeepSpeech2Tester:
+    def __init__(self, config):
+        self.config = config
+        self.collate_fn_test = SpeechCollator.from_config(config)
+        self._text_featurizer = TextFeaturizer(unit_type=config.collator.unit_type, vocab_filepath=None)
+
+    def compute_result_transcripts(self, audio, audio_len, vocab_list, cfg):
+        result_transcripts = self.model.decode(
+            audio,
+            audio_len,
+            vocab_list,
+            decoding_method=cfg.decoding_method,
+            lang_model_path=cfg.lang_model_path,
+            beam_alpha=cfg.alpha,
+            beam_beta=cfg.beta,
+            beam_size=cfg.beam_size,
+            cutoff_prob=cfg.cutoff_prob,
+            cutoff_top_n=cfg.cutoff_top_n,
+            num_processes=cfg.num_proc_bsearch)
+        #replace the '<space>' with ' '
+        result_transcripts = [self._text_featurizer.detokenize(sentence) for sentence in result_transcripts]
+
+        return result_transcripts
+
+    @mp_tools.rank_zero_only
+    @paddle.no_grad()
+    def test(self, audio_file):
+        self.model.eval()
+        cfg = self.config
+        collate_fn_test = self.collate_fn_test
+        audio, _ = collate_fn_test.process_utterance(audio_file=audio_file, transcript=" ")
+        audio_len = audio.shape[0]
+        audio = paddle.to_tensor(audio, dtype='float32')
+        audio_len = paddle.to_tensor(audio_len)
+        audio = paddle.unsqueeze(audio, axis=0)
+        vocab_list = collate_fn_test.vocab_list
+        result_transcripts = self.compute_result_transcripts(audio, audio_len, vocab_list, cfg.decoding)
+        return result_transcripts
+
+    def setup_model(self):
+        config = self.config.clone()
+        with UpdateConfig(config):
+            config.model.feat_size = self.collate_fn_test.feature_size
+            config.model.dict_size = self.collate_fn_test.vocab_size
+
+        model = DeepSpeech2Model.from_config(config.model)
+        self.model = model
+
+    def resume(self, checkpoint):
+        """Resume from the checkpoint at checkpoints in the output
+        directory or load a specified checkpoint.
+        """
+        model_dict = paddle.load(checkpoint)
+        self.model.set_state_dict(model_dict)
--- a/modules/audio/asr/deepspeech2_aishell/module.py
+++ b/modules/audio/asr/deepspeech2_aishell/module.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from pathlib import Path
+import sys
+
+import numpy as np
+from paddlehub.env import MODULE_HOME
+from paddlehub.module.module import moduleinfo, serving
+from paddlehub.utils.log import logger
+from paddle.utils.download import get_path_from_url
+
+try:
+    import swig_decoders
+except ModuleNotFoundError as e:
+    logger.error(e)
+    logger.info('The module requires additional dependencies: swig_decoders. '
+                'please install via:\n\'git clone https://github.com/PaddlePaddle/DeepSpeech.git '
+                '&& cd DeepSpeech && git reset --hard b53171694e7b87abe7ea96870b2f4d8e0e2b1485 '
+                '&& cd deepspeech/decoders/ctcdecoder/swig && sh setup.sh\'')
+    sys.exit(1)
+
+import paddle
+import soundfile as sf
+
+# TODO: Remove system path when deepspeech can be installed via pip.
+sys.path.append(os.path.join(MODULE_HOME, 'deepspeech2_aishell'))
+from deepspeech.exps.deepspeech2.config import get_cfg_defaults
+from deepspeech.utils.utility import UpdateConfig
+from .deepspeech_tester import DeepSpeech2Tester
+
+LM_URL = 'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm'
+LM_MD5 = '29e02312deb2e59b3c8686c7966d4fe3'
+
+
+@moduleinfo(name="deepspeech2_aishell", version="1.0.0", summary="", author="Baidu", author_email="", type="audio/asr")
+class DeepSpeech2(paddle.nn.Layer):
+    def __init__(self):
+        super(DeepSpeech2, self).__init__()
+
+        # resource
+        res_dir = os.path.join(MODULE_HOME, 'deepspeech2_aishell', 'assets')
+        conf_file = os.path.join(res_dir, 'conf/deepspeech2.yaml')
+        checkpoint = os.path.join(res_dir, 'checkpoints/avg_1.pdparams')
+        # Download LM manually cause its large size.
+        lm_path = os.path.join(res_dir, 'data', 'lm')
+        lm_file = os.path.join(lm_path, LM_URL.split('/')[-1])
+        if not os.path.isfile(lm_file):
+            logger.info(f'Downloading lm from {LM_URL}.')
+            get_path_from_url(url=LM_URL, root_dir=lm_path, md5sum=LM_MD5)
+
+        # config
+        self.model_type = 'offline'
+        self.config = get_cfg_defaults(self.model_type)
+        self.config.merge_from_file(conf_file)
+
+        # TODO: Remove path updating snippet.
+        with UpdateConfig(self.config):
+            self.config.collator.mean_std_filepath = os.path.join(res_dir, self.config.collator.mean_std_filepath)
+            self.config.collator.vocab_filepath = os.path.join(res_dir, self.config.collator.vocab_filepath)
+            self.config.collator.augmentation_config = os.path.join(res_dir, self.config.collator.augmentation_config)
+            self.config.decoding.lang_model_path = os.path.join(res_dir, self.config.decoding.lang_model_path)
+
+        # model
+        self.tester = DeepSpeech2Tester(self.config)
+        self.tester.setup_model()
+        self.tester.resume(checkpoint)
+
+    @staticmethod
+    def check_audio(audio_file):
+        sig, sample_rate = sf.read(audio_file)
+        assert sample_rate == 16000, 'Excepting sample rate of input audio is 16000, but got {}'.format(sample_rate)
+
+    @serving
+    def speech_recognize(self, audio_file, device='cpu'):
+        assert os.path.isfile(audio_file), 'File not exists: {}'.format(audio_file)
+        self.check_audio(audio_file)
+
+        paddle.set_device(device)
+        return self.tester.test(audio_file)[0]
--- a/modules/audio/asr/deepspeech2_aishell/requirements.txt
+++ b/modules/audio/asr/deepspeech2_aishell/requirements.txt
+# system level: libsnd swig
+loguru
+yacs
+jsonlines
+scipy==1.2.1
+sentencepiece
+resampy==0.2.2
+SoundFile==0.9.0.post1
+soxbindings
+kaldiio
+typeguard
+editdistance
--- a/modules/audio/asr/deepspeech2_librispeech/README.md
+++ b/modules/audio/asr/deepspeech2_librispeech/README.md
+# deepspeech2_librispeech
+
+|模型名称|deepspeech2_librispeech|
+| :--- | :---: |
+|类别|语音-语音识别|
+|网络|DeepSpeech2|
+|数据集|LibriSpeech|
+|是否支持Fine-tuning|否|
+|模型大小|518MB|
+|最新更新日期|2021-10-20|
+|数据指标|英文WER 0.072|
+
+## 一、模型基本信息
+
+### 模型介绍
+
+DeepSpeech2是百度于2015年提出的适用于英文和中文的end-to-end语音识别模型。deepspeech2_librispeech使用了DeepSpeech2离线模型的结构，模型主要由2层卷积网络和3层GRU组成，并在英文开源语音数据集[LibriSpeech ASR corpus](http://www.openslr.org/12/)进行了预训练，该模型在其测试集上的WER指标是0.072。
+
+
+<p align="center">
+<img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/Hub/docs/images/ds2offlineModel.png" hspace='10'/> <br />
+</p>
+
+更多详情请参考[Deep Speech 2: End-to-End Speech Recognition in English and Mandarin](https://arxiv.org/abs/1512.02595)
+
+## 二、安装
+
+- ### 1、系统依赖
+
+  - libsndfile, swig >= 3.0
+    - Linux
+      ```shell
+      $ sudo apt-get install libsndfile swig
+      or
+      $ sudo yum install libsndfile swig
+      ```
+    - MacOs
+      ```
+      $ brew install libsndfile swig
+      ```
+
+- ### 2、环境依赖
+  - swig_decoder:
+    ```
+    git clone https://github.com/paddlepaddle/deepspeech && cd DeepSpeech && git reset --hard b53171694e7b87abe7ea96870b2f4d8e0e2b1485 && cd deepspeech/decoders/ctcdecoder/swig && sh setup.sh
+    ```
+
+  - paddlepaddle >= 2.1.0
+
+  - paddlehub >= 2.1.0    | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst)
+
+- ### 3、安装
+
+  - ```shell
+    $ hub install deepspeech2_librispeech
+    ```
+  - 如您安装时遇到问题，可参考：[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
+ | [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
+
+
+## 三、模型API预测  
+
+- ### 1、预测代码示例
+
+    ```python
+    import paddlehub as hub
+
+    # 采样率为16k，格式为wav的英文语音音频
+    wav_file = '/PATH/TO/AUDIO'
+
+    model = hub.Module(
+        name='deepspeech2_librispeech',
+        version='1.0.0')
+    text = model.speech_recognize(wav_file)
+
+    print(text)
+    ```
+
+- ### 2、API
+  - ```python
+    def check_audio(audio_file)
+    ```
+    - 检查输入音频格式和采样率是否满足为16000
+
+    - **参数**
+
+      - `audio_file`：本地音频文件(*.wav)的路径，如`/path/to/input.wav`
+
+  - ```python
+    def speech_recognize(
+        audio_file,
+        device='cpu',
+    )
+    ```
+    - 将输入的音频识别成文字
+
+    - **参数**
+
+      - `audio_file`：本地音频文件(*.wav)的路径，如`/path/to/input.wav`
+      - `device`：预测时使用的设备，默认为`cpu`，如需使用gpu预测，请设置为`gpu`。
+
+    - **返回**
+
+      - `text`：str类型，返回输入音频的识别文字结果。
+
+
+## 四、服务部署
+
+- PaddleHub Serving可以部署一个在线的语音识别服务。
+
+- ### 第一步：启动PaddleHub Serving
+
+  - ```shell
+    $ hub serving start -m deepspeech2_librispeech
+    ```
+
+  - 这样就完成了一个语音识别服务化API的部署，默认端口号为8866。
+
+  - **NOTE:** 如使用GPU预测，则需要在启动服务之前，请设置CUDA_VISIBLE_DEVICES环境变量，否则不用设置。
+
+- ### 第二步：发送预测请求
+
+  - 配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果
+
+  - ```python
+    import requests
+    import json
+
+    # 需要识别的音频的存放路径，确保部署服务的机器可访问
+    file = '/path/to/input.wav'
+
+    # 以key的方式指定text传入预测方法的时的参数，此例中为"audio_file"
+    data = {"audio_file": file}
+
+    # 发送post请求，content-type类型应指定json方式，url中的ip地址需改为对应机器的ip
+    url = "http://127.0.0.1:8866/predict/deepspeech2_librispeech"
+
+    # 指定post请求的headers为application/json方式
+    headers = {"Content-Type": "application/json"}
+
+    r = requests.post(url=url, headers=headers, data=json.dumps(data))
+    print(r.json())
+    ```
+
+## 五、更新历史
+
+* 1.0.0
+
+  初始发布
+
+  ```shell
+  $ hub install deepspeech2_librispeech
+  ```
--- a/modules/audio/asr/deepspeech2_librispeech/__init__.py
+++ b/modules/audio/asr/deepspeech2_librispeech/__init__.py
--- a/modules/audio/asr/deepspeech2_librispeech/assets/conf/augmentation.json
+++ b/modules/audio/asr/deepspeech2_librispeech/assets/conf/augmentation.json
+{}
--- a/modules/audio/asr/deepspeech2_librispeech/assets/conf/deepspeech2.yaml
+++ b/modules/audio/asr/deepspeech2_librispeech/assets/conf/deepspeech2.yaml
+# https://yaml.org/type/float.html
+data:
+  train_manifest: data/manifest.train
+  dev_manifest: data/manifest.dev-clean
+  test_manifest: data/manifest.test-clean
+  min_input_len: 0.0
+  max_input_len: 30.0 # second
+  min_output_len: 0.0
+  max_output_len: .inf
+  min_output_input_ratio: 0.00
+  max_output_input_ratio: .inf
+
+collator:
+  batch_size: 20
+  mean_std_filepath: data/mean_std.json
+  unit_type: char
+  vocab_filepath: data/vocab.txt
+  augmentation_config: conf/augmentation.json
+  random_seed: 0
+  spm_model_prefix:
+  spectrum_type: linear
+  target_sample_rate: 16000
+  max_freq: None
+  n_fft: None
+  stride_ms: 10.0
+  window_ms: 20.0
+  delta_delta: False
+  dither: 1.0
+  use_dB_normalization: True
+  target_dB: -20
+  random_seed: 0
+  keep_transcription_text: False
+  sortagrad: True
+  shuffle_method: batch_shuffle
+  num_workers: 2
+
+model:
+  num_conv_layers: 2
+  num_rnn_layers: 3
+  rnn_layer_size: 2048
+  use_gru: False
+  share_rnn_weights: True
+  blank_id: 0
+  ctc_grad_norm_type: instance
+
+training:
+  n_epoch: 50
+  accum_grad: 1
+  lr: 1e-3
+  lr_decay: 0.83
+  weight_decay: 1e-06
+  global_grad_clip: 5.0
+  log_interval: 100
+  checkpoint:
+    kbest_n: 50
+    latest_n: 5
+
+decoding:
+  batch_size: 128
+  error_rate_type: wer
+  decoding_method: ctc_beam_search
+  lang_model_path: data/lm/common_crawl_00.prune01111.trie.klm
+  alpha: 1.9
+  beta: 0.3
+  beam_size: 500
+  cutoff_prob: 1.0
+  cutoff_top_n: 40
+  num_proc_bsearch: 8
--- a/modules/audio/asr/deepspeech2_librispeech/deepspeech_tester.py
+++ b/modules/audio/asr/deepspeech2_librispeech/deepspeech_tester.py
--- a/modules/audio/asr/deepspeech2_librispeech/module.py
+++ b/modules/audio/asr/deepspeech2_librispeech/module.py
--- a/modules/audio/asr/deepspeech2_librispeech/requirements.txt
+++ b/modules/audio/asr/deepspeech2_librispeech/requirements.txt
+loguru
+yacs
+jsonlines
+scipy==1.2.1
+sentencepiece
+resampy==0.2.2
+SoundFile==0.9.0.post1
+soxbindings
+kaldiio
+typeguard
+editdistance
--- a/modules/audio/asr/u2_conformer_aishell/README.md
+++ b/modules/audio/asr/u2_conformer_aishell/README.md
--- a/modules/audio/asr/u2_conformer_aishell/__init__.py
+++ b/modules/audio/asr/u2_conformer_aishell/__init__.py
--- a/modules/audio/asr/u2_conformer_aishell/assets/conf/augmentation.json
+++ b/modules/audio/asr/u2_conformer_aishell/assets/conf/augmentation.json
+{}
--- a/modules/audio/asr/u2_conformer_aishell/assets/conf/conformer.yaml
+++ b/modules/audio/asr/u2_conformer_aishell/assets/conf/conformer.yaml
--- a/modules/audio/asr/u2_conformer_aishell/assets/data/mean_std.json
+++ b/modules/audio/asr/u2_conformer_aishell/assets/data/mean_std.json
+{"mean_stat": [533749178.75492024, 537379151.9412827, 553560684.251823, 587164297.7995199, 631868827.5506272, 662598279.7375823, 684377628.7270963, 695391900.076011, 692470493.5234187, 679434068.1698124, 666124153.9164762, 656323498.7897255, 665750586.0282139, 678693518.7836165, 681921713.5434498, 679622373.0941861, 669891550.4909347, 656595089.7941492, 653838531.0994304, 637678601.7858486, 628412248.7348012, 644835299.462052, 638840698.1892803, 646181879.4332589, 639724189.2981818, 642757470.3933163, 637471382.8647255, 642368839.4687729, 643414999.4559816, 647384269.1630985, 649348352.9727564, 649293860.0141628, 650234047.7200857, 654485430.6703687, 660474314.9996675, 667417041.2224753, 673157601.3226709, 675674470.304284, 675124085.6890339, 668017589.4583111, 670061307.6169846, 662625614.6886193, 663144526.4351237, 662504003.7634674, 666413530.1149732, 672263295.5639057, 678483738.2530766, 685387098.3034457, 692570857.529439, 699066050.4399202, 700784878.5879861, 701201520.50868, 702666292.305144, 705443439.2278953, 706070270.9023902, 705988909.8337733, 702843339.0362502, 699318566.4701376, 696089900.3030818, 687559674.541517, 675279201.9502573, 663676352.2301354, 662963751.7464145, 664300133.8414352, 666095384.4212626, 671682092.7777623, 676652386.6696675, 680097668.2490273, 683810023.0071762, 688701544.3655603, 692082724.9923568, 695788849.6782106, 701085780.0070009, 706389529.7959046, 711492753.1344281, 717637923.73355, 719691678.2081754, 715810733.4964175, 696362890.4862831, 604649423.9932467], "var_stat": [5413314850.92017, 5559847287.933615, 6150990253.613769, 6921242242.585692, 7999776708.347419, 8789877370.390867, 9405801233.462742, 9768050110.323652, 9759783206.942099, 9430647265.679018, 9090547056.72849, 8873147345.425886, 9155912918.518642, 9542539953.84679, 9653547618.806402, 9593434792.936714, 9316633026.420147, 8959273999.588833, 8863548125.445953, 8450615911.730164, 8211598033.615433, 8587083872.162145, 8432613574.987708, 8583943640.722399, 8401731458.393406, 8439359231.367369, 8293779802.711447, 8401506934.147289, 8427506949.839874, 8525176341.071184, 8577080109.482346, 8575106681.347283, 8594987363.896849, 8701703698.13697, 8854967559.695303, 9029484499.828356, 9168774993.437275, 9221457044.693224, 9194525496.858181, 8997085233.031223, 9024585998.805922, 8819398159.92156, 8807895653.788486, 8777245867.886335, 8869681168.825321, 9017397167.041729, 9173402827.38027, 9345595113.30765, 9530638054.282673, 9701241750.610865, 9749002220.142677, 9762753891.356327, 9802020174.527405, 9874432300.977995, 9883303068.689241, 9873499335.610315, 9780680890.924107, 9672603363.913414, 9569436761.47915, 9321842521.985804, 8968140697.297707, 8646348638.918655, 8616965457.523136, 8648620220.395298, 8702086138.675117, 8859213220.99842, 8999405313.087536, 9105949447.399998, 9220413227.016796, 9358601578.269663, 9451405873.00428, 9552727080.824707, 9695443509.54488, 9836687193.669691, 9970962418.410656, 10135881535.317768, 10189390919.400673, 10070483257.345238, 9532953296.22076, 7261219636.045063], "frame_num": 54068199}
--- a/modules/audio/asr/u2_conformer_aishell/assets/data/vocab.txt
+++ b/modules/audio/asr/u2_conformer_aishell/assets/data/vocab.txt
--- a/modules/audio/asr/u2_conformer_aishell/module.py
+++ b/modules/audio/asr/u2_conformer_aishell/module.py
--- a/modules/audio/asr/u2_conformer_aishell/requirements.txt
+++ b/modules/audio/asr/u2_conformer_aishell/requirements.txt
+loguru
+yacs
+jsonlines
+scipy==1.2.1
+sentencepiece
+resampy==0.2.2
+SoundFile==0.9.0.post1
+soxbindings
+kaldiio
+typeguard
+editdistance
+textgrid
--- a/modules/audio/asr/u2_conformer_aishell/u2_conformer_tester.py
+++ b/modules/audio/asr/u2_conformer_aishell/u2_conformer_tester.py
--- a/modules/audio/asr/u2_conformer_librispeech/README.md
+++ b/modules/audio/asr/u2_conformer_librispeech/README.md
--- a/modules/audio/asr/u2_conformer_librispeech/__init__.py
+++ b/modules/audio/asr/u2_conformer_librispeech/__init__.py
--- a/modules/audio/asr/u2_conformer_librispeech/assets/conf/augmentation.json
+++ b/modules/audio/asr/u2_conformer_librispeech/assets/conf/augmentation.json
+{}
--- a/modules/audio/asr/u2_conformer_librispeech/assets/conf/conformer.yaml
+++ b/modules/audio/asr/u2_conformer_librispeech/assets/conf/conformer.yaml
--- a/modules/audio/asr/u2_conformer_librispeech/assets/data/bpe_unigram_5000.model
+++ b/modules/audio/asr/u2_conformer_librispeech/assets/data/bpe_unigram_5000.model
--- a/modules/audio/asr/u2_conformer_librispeech/assets/data/bpe_unigram_5000.vocab
+++ b/modules/audio/asr/u2_conformer_librispeech/assets/data/bpe_unigram_5000.vocab
--- a/modules/audio/asr/u2_conformer_librispeech/assets/data/mean_std.json
+++ b/modules/audio/asr/u2_conformer_librispeech/assets/data/mean_std.json
+{"mean_stat": [3419817384.9589553, 3554070049.1888413, 3818511309.9166613, 4066044518.3850017, 4291564631.2871633, 4447813845.146345, 4533096457.680424, 4535743891.989957, 4529762966.952207, 4506798370.255702, 4563810141.721841, 4621582319.277632, 4717208210.814803, 4782916961.295261, 4800534153.252695, 4816978042.979026, 4813370098.242317, 4783029495.131413, 4797780594.144404, 4697681126.278327, 4615891408.325888, 4660549391.6024275, 4576180438.146472, 4609080513.250168, 4575296489.058092, 4602504837.872262, 4568039825.650208, 4596829549.204861, 4590634987.343898, 4604371982.549804, 4623782318.317643, 4643582410.8842745, 4681460771.788484, 4759470876.31175, 4808639788.683043, 4828470941.416027, 4868984035.113543, 4906503986.801533, 4945995579.443381, 4936645225.986488, 4975902400.919519, 4960230208.656678, 4986734786.199859, 4983472199.8246765, 5002204376.162232, 5030432036.352981, 5060386169.086892, 5093482058.577236, 5118330657.308789, 5137270836.326198, 5140137363.319094, 5144296534.330122, 5158812605.654329, 5166263515.51458, 5156261604.282723, 5155820011.532965, 5154511256.8968, 5152063882.193671, 5153425524.412178, 5149000486.683038, 5154587156.35868, 5134412165.07972, 5092874838.792056, 5062281231.5140915, 5029059442.072953, 4996045017.917702, 4962203662.170533, 4928110046.282831, 4900476581.092096, 4881407033.533021, 4859626116.955097, 4851430742.3865795, 4850317443.454599, 4848197040.155383, 4837178106.464577, 4818448202.7298765, 4803345264.527405, 4765785994.104498, 4735296707.352132, 4699957946.40757], "var_stat": [39487786239.20539, 42865198005.60155, 49718916704.468704, 55953639455.490585, 62156293826.00315, 66738657819.12445, 69416921986.47835, 69657873431.17258, 69240303799.53061, 68286972351.43054, 69718367152.18843, 71405427710.7103, 74174200331.87572, 76047347951.43869, 76478048614.40665, 76810929560.19212, 76540466184.85634, 75538479521.34026, 75775624554.07217, 72775991318.16557, 70350402972.93352, 71358602366.48341, 68872845697.9878, 69552396791.49916, 68471390455.59991, 69022047288.07498, 67982260910.11236, 68656154716.71916, 68461419064.9241, 68795285460.65717, 69270474608.52791, 69754495937.76433, 70596044579.14969, 72207936275.97945, 73629619360.65047, 74746445259.57487, 75925168496.81197, 76973508692.04265, 78074337163.3413, 77765963787.96971, 78839167623.49733, 78328768943.2287, 79016127287.03778, 78922638306.99306, 79489768324.9408, 80354861037.44005, 81311991408.12526, 82368205917.26112, 83134782296.1741, 83667769421.23245, 83673751953.46239, 83806087685.62842, 84193971202.07523, 84424752763.34825, 84092846117.64104, 84039114093.08766, 83982515225.7085, 83909645482.75613, 83947278563.15077, 83800767707.19617, 83851106027.8772, 83089292432.37892, 82056425825.3622, 81138570746.92316, 80131843258.75557, 79130160837.19037, 78092166878.71533, 77104785522.79205, 76308548392.10454, 75709445890.58063, 75084778641.6033, 74795849006.19067, 74725807683.832, 74645651838.2169, 74300193368.39339, 73696619147.86806, 73212785808.97992, 72240491743.0697, 71420246227.32545, 70457076435.4593], "frame_num": 345484372}
--- a/modules/audio/asr/u2_conformer_librispeech/assets/data/vocab.txt
+++ b/modules/audio/asr/u2_conformer_librispeech/assets/data/vocab.txt
--- a/modules/audio/asr/u2_conformer_librispeech/module.py
+++ b/modules/audio/asr/u2_conformer_librispeech/module.py
--- a/modules/audio/asr/u2_conformer_librispeech/requirements.txt
+++ b/modules/audio/asr/u2_conformer_librispeech/requirements.txt
+loguru
+yacs
+jsonlines
+scipy==1.2.1
+sentencepiece
+resampy==0.2.2
+SoundFile==0.9.0.post1
+soxbindings
+kaldiio
+typeguard
+editdistance
+textgrid
--- a/modules/audio/asr/u2_conformer_librispeech/u2_conformer_tester.py
+++ b/modules/audio/asr/u2_conformer_librispeech/u2_conformer_tester.py
--- a/modules/audio/audio_classification/PANNs/cnn10/README.md
+++ b/modules/audio/audio_classification/PANNs/cnn10/README.md
--- a/modules/audio/audio_classification/PANNs/cnn14/README.md
+++ b/modules/audio/audio_classification/PANNs/cnn14/README.md
--- a/modules/audio/audio_classification/PANNs/cnn6/README.md
+++ b/modules/audio/audio_classification/PANNs/cnn6/README.md
--- a/modules/audio/tts/fastspeech2_baker/README.md
+++ b/modules/audio/tts/fastspeech2_baker/README.md
--- a/modules/audio/tts/fastspeech2_baker/__init__.py
+++ b/modules/audio/tts/fastspeech2_baker/__init__.py
--- a/modules/audio/tts/fastspeech2_baker/assets/fastspeech2_nosil_baker_ckpt_0.4/default.yaml
+++ b/modules/audio/tts/fastspeech2_baker/assets/fastspeech2_nosil_baker_ckpt_0.4/default.yaml
--- a/modules/audio/tts/fastspeech2_baker/assets/fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
+++ b/modules/audio/tts/fastspeech2_baker/assets/fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
--- a/modules/audio/tts/fastspeech2_baker/assets/pwg_baker_ckpt_0.4/pwg_default.yaml
+++ b/modules/audio/tts/fastspeech2_baker/assets/pwg_baker_ckpt_0.4/pwg_default.yaml
--- a/modules/audio/tts/fastspeech2_baker/module.py
+++ b/modules/audio/tts/fastspeech2_baker/module.py
--- a/modules/audio/tts/fastspeech2_baker/requirements.txt
+++ b/modules/audio/tts/fastspeech2_baker/requirements.txt
+git+https://github.com/PaddlePaddle/Parakeet@8040cb0#egg=paddle-parakeet
--- a/modules/audio/tts/fastspeech2_ljspeech/README.md
+++ b/modules/audio/tts/fastspeech2_ljspeech/README.md
--- a/modules/audio/tts/fastspeech2_ljspeech/__init__.py
+++ b/modules/audio/tts/fastspeech2_ljspeech/__init__.py
--- a/modules/audio/tts/fastspeech2_ljspeech/assets/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml
+++ b/modules/audio/tts/fastspeech2_ljspeech/assets/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml
--- a/modules/audio/tts/fastspeech2_ljspeech/assets/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt
+++ b/modules/audio/tts/fastspeech2_ljspeech/assets/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt
--- a/modules/audio/tts/fastspeech2_ljspeech/assets/pwg_ljspeech_ckpt_0.5/pwg_default.yaml
+++ b/modules/audio/tts/fastspeech2_ljspeech/assets/pwg_ljspeech_ckpt_0.5/pwg_default.yaml
--- a/modules/audio/tts/fastspeech2_ljspeech/module.py
+++ b/modules/audio/tts/fastspeech2_ljspeech/module.py
--- a/modules/audio/tts/fastspeech2_ljspeech/requirements.txt
+++ b/modules/audio/tts/fastspeech2_ljspeech/requirements.txt
+git+https://github.com/PaddlePaddle/Parakeet@8040cb0#egg=paddle-parakeet