introduction_en.ipynb 15.0 KB
Notebook
Newer Older
Z
Zth9730 已提交
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# 1. Introduction\n",
    "## 1.1 Overview\n",
    "PP-ASR provides a tool for ASR functions. It provides a variety models in Chinese and English, supports model training, and supports model inference using the command line. PP-ASR also supports the deployment of streaming models and the deployment of personalized scenarios. PP-ASR supports multiple pre-training models: [released_model](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/released_model.md). The better model is Conformer model which supports streaming ASR.\n",
    "\n",
    "## 1.2 Features\n",
    "The basic process of speech recognition is shown in the following figure: \n",
    "<div align=center>\n",
    "<img src=\"https://user-images.githubusercontent.com/87408988/168259962-cbe2008b-47b6-443d-9566-d77a5ca2eb25.png\"/>\n",
    "<br>\n",
    "</div>\n",
    "<br></br>\n",
    "The main features of PP-ASR are as follows:\n",
    "\n",
    "1. Provide pre-trained models on Chinese/English open source datasets: aishell1(Chinese), wenetspeech (Chinese), librispeech (English). The model includes the deepspeech2 model and the conformer/transformer model.\n",
    "2. Support Chinese/English model training function.\n",
    "3. Support command line model inference, you can use paddlespeech asr --model xxx --input xxx.wav to call each pre-trained model for inference.\n",
    "4. Support the service deployment of streaming ASR, and also support the output of timestamps.\n",
    "5. Support the deployment of personalized scenarios.\n",
    "\n",
    "Welcome to [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/paddlespeech) for more experience!\n",
    "\n",
    "\n",
    "# 2. Model Representation and Application Scenarios\n",
    "## 2.1 Streaming Speech Recognition Task\n",
    "\n",
    "Automatic Speech Recognition (ASR) is a task to extract language and text content from a piece of speech. The streaming speech recognition is that the user segments a whole speech and inputs it in streaming mode, and finally gets the recognition result.\n",
    "\n",
    "The real-time speech recognition engine can simultaneously extract and decode the features of the segmented input speech without waiting for all the data to be obtained. Therefore, after the last speech, the final recognition result can be returned only after a short delay (that is, the time to wait for processing the last speech segment and obtaining the final result). This streaming input mode can shorten the overall time to obtain the final results, and greatly improve the user experience.\n",
    "\n",
    "## 2.2 Application Scenario\n",
    "1. Human–Computer Interaction/Speech Input    \n",
    "Streaming speech recognition can generate text in real time when users speak, speeding up the feedback speed of machines to people, and improving the user experience.\n",
    "\n",
    "\n",
    "<div align=center>\n",
    "<img src=\"https://ai-studio-static-online.cdn.bcebos.com/6a68196417234818b3241616a1649741eef4f919c67141d9b9ad371780d110a8\" height=50%, width=50%/>\n",
    "<br>\n",
    "  (Baidu smart audio: https://dumall.baidu.com/)\n",
    "</div>\n",
    "\n",
    "  \n",
    "2. Real Time Subtitles/Meeting Minutes    \n",
    "In the meeting scene, speak while transcribing the text.\n",
    "Convert audio information of meetings, court trials, interviews and other scenes into text, which is realized by real-time speech recognition services, reducing manual recording costs and improving efficiency.\n",
    "\n",
    "<div align=center>\n",
    "<img src=\"https://ai-studio-static-online.cdn.bcebos.com/546271f5bad341acb208d3d497874028da5a664e9e1e460eb61af6a742e89aeb\" height=70%, width=70%/>\n",
    "<br>\n",
    "(Baidu Intelligent Conference System: One Finger Zen)\n",
    "</div>\n",
    "\n",
    "3. Simultaneous Translation\n",
    "When the machine performs simultaneous translation, the machine needs to be able to recognize the user's speech content in real time, so as to translate the speech content into other languages in real time through the translation module.   \n",
    "\n",
    "<div align=center>\n",
    "<img href=\"https://infoflow.baidu.com/audio-video/#/\" src=\"https://ai-studio-static-online.cdn.bcebos.com/7472f6f976e94e3288dacb0a8bffd9a824f31e392e48496d830f5f11626c0851\" height=50%, width=50%/>\n",
    "<br>\n",
    "  (Ruliu: intelligent conference https://infoflow.baidu.com/audio-video/#/)\n",
    "</div>\n",
    "\n",
    "4. Telephone Quality Inspection    \n",
    "Turn the call into text, which is realized by real-time speech recognition service or recording file recognition service, to comprehensively cover the quality inspection content and improve the quality inspection efficiency.    \n",
    "\n",
    "<div align=center>\n",
    "<img src=\"https://ai-studio-static-online.cdn.bcebos.com/cbd0af3553ff4b8891bb6239069ad76d95bbc36fb98444378a3b3d716eb1fbcb\" height=40%, width=40%/>\n",
    "</div>\n",
    "\n",
    "5. Speech Message Transfer    \n",
    "Turn the user's speech information into text information, which is realized by one sentence recognition service, and improve the user's reading efficiency.    \n",
    "\n",
    "## 2.3 Datasets\n",
    "The model uses the 10000 hour multi-domain Chinese speech recognition dataset wenetspeech。\n",
    "\n",
    "## 2.4 Demonstration\n",
    "The effect of using asr server on the webpage is shown as follows:[streaming_asr_demo_video](https://paddlespeech.readthedocs.io/en/latest/streaming_asr_demo_video.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Model Usage\n",
    "## 3.1 Model Inference\n",
    "### install paddlespeech"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install paddlespeech"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### download test audio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from paddlespeech.cli.asr.infer import ASRExecutor\n",
    "audio = \"zh.wav\"\n",
    "asr = ASRExecutor()\n",
    "result = asr(audio_file=audio)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2 model training\n",
    "[Streaming conformer training based on wenetspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/wenetspeech/asr1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# 4. Principle of Streaming Conformer Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## 4.1 Confomer Model Structure\n",
    "\n",
    "<div align=center>\n",
    "<img src=\"https://ai-studio-static-online.cdn.bcebos.com/0fc40fc45a8f4046beea14eb69cfc1eee52196d9db974442a4c4df8007f8d70d\" height=1200, width=800 />\n",
    "<br>\n",
    "</div>\n",
    "  \n",
    "  \n",
    "Conformer is mainly composed of Encoder and Decoder. The overall model structure is very similar to Transformer.  \n",
    "Conformer and Transformer have the same Decoder, with two main differences:  \n",
    "1. The Encoder of the Conformer contains the conv module. The conv module consists of five parts: pointwise conv, GLU layer, Depthwith conv, RELU layer, and the second pointwise conv layer. \n",
    "2. The Encoder of Conform uses two layers of FeedForward, which are located at the head and tail of each layer of encoder respectively. The weight of each layer output is set to 0.5, which is similar to the structure of a hamburger as a whole.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "\n",
    "\n",
    "## 4.2 Streaming Conformer\n",
    "Streaming decoding is mainly divided into two steps:\n",
    "1. While speaking: use CTC prefix beam search to decode\n",
    "2. End of speech: use CTC prefix beam search + attention_rescoring to decode. attention_rescoring mainly uses decoder to re-score ctc results, thus changing the candidate ranking of whole sentence ctc results.\n",
    "\n",
    "<div align=center>\n",
    "<img src=\"https://ai-studio-static-online.cdn.bcebos.com/c37339dbaf5c4c20a67b76d88c6730bb1cd93fc7f71b4179982f42365b969f49\" height=1200, width=800 />\n",
    "<br>(image from \"Chao Yang http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html\" )\n",
    "</div>\n",
    "\n",
    "\n",
    "Therefore, the core of streaming decoding lies in supporting streaming CTC prefix beam search, and streaming CTC prefix beam search lies in training an encoder that can support streaming.\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "### 4.2.1 Point 1: Causal convolution to avoid high delay\n",
    "For traditional convolution networks, if many layers of convolution are used, each step of the network output will rely heavily on the multiple frames after the current step, thus increasing the delay of the streaming model. However, there are a large number of conv layers in the conformer model. Therefore, if traditional convolution is used, the delay of the streaming conformer model will be large.  \n",
    "In order to solve this problem, stream conformer uses causal convolution. The output of each step of causal convolution will only depend on the previous time point, not the subsequent time point, similar to the RNN structure of convolution implementation. Thus, the high delay of the conformer model is avoided.\n",
    "<div align=center>\n",
    "<img src=\"https://ai-studio-static-online.cdn.bcebos.com/e77dddf4e0514724b3f24e9f6931aaf1054ebf0b4c1348b59aee6d3a13f833fe\" height=800, width=500 />\n",
    "<br>(image from \"Bai S, Kolter J Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling\" )\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "### 4.2.2 Point 2: Attention with mask \n",
    "\n",
    "The main challenge to implement streaming encoder is that the attention structure of the conformer usually uses global information, as shown in the first sub figure in the following figure, so streaming cannot be implemented. In order to solve this problem, streaming conformer will limit the scope of attention during training. \n",
    "The main strategies for the scope of attention are shown in the following figure:\n",
    "\n",
    "<div align=center>\n",
    "<img src=\"https://ai-studio-static-online.cdn.bcebos.com/5a8cecc5d0b54898bd9ee4d4573433de992d68a234de418daaa02e6f80289b46\" height=1200, width=800 />\n",
    "<br>(image from \"Chao Yang http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html\" )\n",
    "</div>\n",
    "\n",
    "In order to use speech context information as much as possible, we generally use the third type of attention scope. \n",
    "In the training process, in order to enhance the robustness of the model, and also make the model applicable to a variety of chunk sizes in the decoding process, for each batch of data, the random chunk size will be used for training. \n",
    "In the decoding process, we use a fixed chunk size for decoding. \n",
    "\n",
    "### 4.2.3 Point 3: Cache\n",
    "In the process of decoding, the conformer will use cache to reduce redundant computation.  \n",
    "The cache of the conformer encoder is mainly divided into three parts: \n",
    "1. subsampling_cache  \n",
    "2. conformer_cnn_cache \n",
    "3. elayers_output_cache \n",
    "```\n",
    "\t\t# Feed forward overlap input step by step\n",
    "        for cur in range(0, num_frames - context + 1, stride):\n",
    "            end = min(cur + decoding_window, num_frames)\n",
    "            chunk_xs = xs[:, cur:end, :]\n",
    "            (y, subsampling_cache, elayers_output_cache,\n",
    "             conformer_cnn_cache) = self.forward_chunk(\n",
    "                 chunk_xs, offset, required_cache_size, subsampling_cache,\n",
    "                 elayers_output_cache, conformer_cnn_cache)\n",
    "            outputs.append(y)\n",
    "            offset += y.shape[1]\n",
    "        ys = paddle.cat(outputs, 1)\n",
    "```\n",
    "\n",
    "<div align=center>\n",
    "<img src=\"https://ai-studio-static-online.cdn.bcebos.com/a8e0ff53e2b54fbfbc6f8715dfcba8a50d05b13228eb4ef598a0445336dd3a03\" height=1200, width=800 />\n",
    "<br>(image from \"Chao Yang http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html\" )\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "1. subsampling cache: [paddle.Tensor]     \n",
    "The output cache of subsampling is the input of the first conformer block. It is used to cache the results of input features after passing through the subsampling module, while the current input chunk and subsampling cache are combined as the input of the conformer encoder. The subsampling module used by the conformer is mainly composed of two layers of cnn and one layer of linear.  \n",
    " \n",
    "2. conformer_cnn_cache: List[paddle.Tensor]   \n",
    "It mainly stores the input of the conv module in each conformer block. Because the conv module depends on the previous frame information, it needs to cache the previous input to save computing time.  \n",
    "\n",
    "3. layers_output_cache: List[paddle.Tensor]    \n",
    "It mainly stores the historical output of the current conformer block, so that it can be spliced with the output of the current conformer block as the input of the next conformer block.  \n",
    "\n",
    "A non streaming conformer model can be transformed into a streaming conformer model by combining the above three points."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Note"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# 6. Reference       \n",
    "[1] Chao Yang. http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html    \n",
    "[2] Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition[J]. arXiv preprint arXiv:2005.08100, 2020.    \n",
    "[3] Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd international conference on Machine learning. 2006: 369-376.    "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "py35-paddle1.2.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}