提交 3ebcdd57 编写于 作者: 绝不原创的飞龙's avatar 绝不原创的飞龙

2024-02-05 18:28:17

上级 f26b02ef
- en: Speech Recognition with Wav2Vec2
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 使用Wav2Vec2进行语音识别
- en: 原文:[https://pytorch.org/audio/stable/tutorials/speech_recognition_pipeline_tutorial.html](https://pytorch.org/audio/stable/tutorials/speech_recognition_pipeline_tutorial.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://pytorch.org/audio/stable/tutorials/speech_recognition_pipeline_tutorial.html](https://pytorch.org/audio/stable/tutorials/speech_recognition_pipeline_tutorial.html)
- en: Note
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: Click [here](#sphx-glr-download-tutorials-speech-recognition-pipeline-tutorial-py)
to download the full example code
id: totrans-3
prefs: []
type: TYPE_NORMAL
zh: 点击[这里](#sphx-glr-download-tutorials-speech-recognition-pipeline-tutorial-py)下载完整示例代码
- en: '**Author**: [Moto Hira](mailto:moto%40meta.com)'
id: totrans-4
prefs: []
type: TYPE_NORMAL
zh: '**作者**:[Moto Hira](mailto:moto%40meta.com)'
- en: This tutorial shows how to perform speech recognition using using pre-trained
models from wav2vec 2.0 [[paper](https://arxiv.org/abs/2006.11477)].
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: 本教程展示了如何使用来自wav2vec 2.0的预训练模型执行语音识别[[论文](https://arxiv.org/abs/2006.11477)]。
- en: Overview[](#overview "Permalink to this heading")
id: totrans-6
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 概述[](#overview "跳转到此标题")
- en: The process of speech recognition looks like the following.
id: totrans-7
prefs: []
type: TYPE_NORMAL
zh: 语音识别的过程如下所示。
- en: Extract the acoustic features from audio waveform
id: totrans-8
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 从音频波形中提取声学特征
- en: Estimate the class of the acoustic features frame-by-frame
id: totrans-9
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 逐帧估计声学特征的类别
- en: Generate hypothesis from the sequence of the class probabilities
id: totrans-10
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 从类概率序列生成假设
- en: Torchaudio provides easy access to the pre-trained weights and associated information,
such as the expected sample rate and class labels. They are bundled together and
available under [`torchaudio.pipelines`](../pipelines.html#module-torchaudio.pipelines
"torchaudio.pipelines") module.
id: totrans-11
prefs: []
type: TYPE_NORMAL
zh: Torchaudio提供了对预训练权重和相关信息的简单访问,例如预期的采样率和类标签。它们被捆绑在一起,并在[`torchaudio.pipelines`](../pipelines.html#module-torchaudio.pipelines
"torchaudio.pipelines")模块下提供。
- en: Preparation[](#preparation "Permalink to this heading")
id: totrans-12
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 准备[](#preparation "跳转到此标题")
- en: '[PRE0]'
id: totrans-13
prefs: []
type: TYPE_PRE
zh: '[PRE0]'
- en: '[PRE1]'
id: totrans-14
prefs: []
type: TYPE_PRE
zh: '[PRE1]'
- en: '[PRE2]'
id: totrans-15
prefs: []
type: TYPE_PRE
zh: '[PRE2]'
- en: '[PRE3]'
id: totrans-16
prefs: []
type: TYPE_PRE
zh: '[PRE3]'
- en: Creating a pipeline[](#creating-a-pipeline "Permalink to this heading")
id: totrans-17
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 创建管道[](#creating-a-pipeline "跳转到此标题")
- en: First, we will create a Wav2Vec2 model that performs the feature extraction
and the classification.
id: totrans-18
prefs: []
type: TYPE_NORMAL
zh: 首先,我们将创建一个执行特征提取和分类的Wav2Vec2模型。
- en: There are two types of Wav2Vec2 pre-trained weights available in torchaudio.
The ones fine-tuned for ASR task, and the ones not fine-tuned.
id: totrans-19
prefs: []
type: TYPE_NORMAL
zh: torchaudio中有两种类型的Wav2Vec2预训练权重。一种是为ASR任务微调的,另一种是未经微调的。
- en: Wav2Vec2 (and HuBERT) models are trained in self-supervised manner. They are
firstly trained with audio only for representation learning, then fine-tuned for
a specific task with additional labels.
id: totrans-20
prefs: []
type: TYPE_NORMAL
zh: Wav2Vec2(和HuBERT)模型以自监督方式进行训练。它们首先仅使用音频进行表示学习的训练,然后再使用附加标签进行特定任务的微调。
- en: The pre-trained weights without fine-tuning can be fine-tuned for other downstream
tasks as well, but this tutorial does not cover that.
id: totrans-21
prefs: []
type: TYPE_NORMAL
zh: 未经微调的预训练权重也可以用于其他下游任务的微调,但本教程不涵盖此内容。
- en: We will use [`torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H`](../generated/torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H.html#torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
"torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H") here.
id: totrans-22
prefs: []
type: TYPE_NORMAL
zh: 我们将在这里使用[`torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H`](../generated/torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H.html#torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
"torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H")。
- en: There are multiple pre-trained models available in [`torchaudio.pipelines`](../pipelines.html#module-torchaudio.pipelines
"torchaudio.pipelines"). Please check the documentation for the detail of how
they are trained.
id: totrans-23
prefs: []
type: TYPE_NORMAL
zh: '[`torchaudio.pipelines`](../pipelines.html#module-torchaudio.pipelines "torchaudio.pipelines")中有多个预训练模型可用。请查看文档以了解它们的训练方式的详细信息。'
- en: The bundle object provides the interface to instantiate model and other information.
Sampling rate and the class labels are found as follow.
id: totrans-24
prefs: []
type: TYPE_NORMAL
zh: bundle对象提供了实例化模型和其他信息的接口。采样率和类标签如下所示。
- en: '[PRE4]'
id: totrans-25
prefs: []
type: TYPE_PRE
zh: '[PRE4]'
- en: '[PRE5]'
id: totrans-26
prefs: []
type: TYPE_PRE
zh: '[PRE5]'
- en: Model can be constructed as following. This process will automatically fetch
the pre-trained weights and load it into the model.
id: totrans-27
prefs: []
type: TYPE_NORMAL
zh: 模型可以按以下方式构建。此过程将自动获取预训练权重并将其加载到模型中。
- en: '[PRE6]'
id: totrans-28
prefs: []
type: TYPE_PRE
zh: '[PRE6]'
- en: '[PRE7]'
id: totrans-29
prefs: []
type: TYPE_PRE
zh: '[PRE7]'
- en: Loading data[](#loading-data "Permalink to this heading")
id: totrans-30
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 加载数据[](#loading-data "跳转到此标题")
- en: We will use the speech data from [VOiCES dataset](https://iqtlabs.github.io/voices/),
which is licensed under Creative Commos BY 4.0.
id: totrans-31
prefs: []
type: TYPE_NORMAL
zh: 我们将使用[VOiCES数据集](https://iqtlabs.github.io/voices/)中的语音数据,该数据集在Creative Commos
BY 4.0下许可。
- en: '[PRE8]'
id: totrans-32
prefs: []
type: TYPE_PRE
- en:
zh: '[PRE8]'
- en: null
id: totrans-33
prefs: []
type: TYPE_NORMAL
- en: Your browser does not support the audio element.
id: totrans-34
prefs: []
type: TYPE_NORMAL
zh: 您的浏览器不支持音频元素。
- en: To load data, we use [`torchaudio.load()`](../generated/torchaudio.load.html#torchaudio.load
"torchaudio.load").
id: totrans-35
prefs: []
type: TYPE_NORMAL
zh: 为了加载数据,我们使用[`torchaudio.load()`](../generated/torchaudio.load.html#torchaudio.load
"torchaudio.load")。
- en: If the sampling rate is different from what the pipeline expects, then we can
use [`torchaudio.functional.resample()`](../generated/torchaudio.functional.resample.html#torchaudio.functional.resample
"torchaudio.functional.resample") for resampling.
id: totrans-36
prefs: []
type: TYPE_NORMAL
zh: 如果采样率与管道期望的不同,则可以使用[`torchaudio.functional.resample()`](../generated/torchaudio.functional.resample.html#torchaudio.functional.resample
"torchaudio.functional.resample")进行重采样。
- en: Note
id: totrans-37
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: '[`torchaudio.functional.resample()`](../generated/torchaudio.functional.resample.html#torchaudio.functional.resample
"torchaudio.functional.resample") works on CUDA tensors as well.'
id: totrans-38
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[`torchaudio.functional.resample()`](../generated/torchaudio.functional.resample.html#torchaudio.functional.resample
"torchaudio.functional.resample")也适用于CUDA张量。'
- en: When performing resampling multiple times on the same set of sample rates, using
[`torchaudio.transforms.Resample`](../generated/torchaudio.transforms.Resample.html#torchaudio.transforms.Resample
"torchaudio.transforms.Resample") might improve the performace.
id: totrans-39
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 在同一组采样率上多次执行重采样时,使用[`torchaudio.transforms.Resample`](../generated/torchaudio.transforms.Resample.html#torchaudio.transforms.Resample
"torchaudio.transforms.Resample")可能会提高性能。
- en: '[PRE9]'
id: totrans-40
prefs: []
type: TYPE_PRE
- en: Extracting acoustic features[](#extracting-acoustic-features "Permalink to
this heading")
zh: '[PRE9]'
- en: Extracting acoustic features[](#extracting-acoustic-features "Permalink to this
heading")
id: totrans-41
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 提取声学特征[](#extracting-acoustic-features "跳转到此标题")
- en: The next step is to extract acoustic features from the audio.
id: totrans-42
prefs: []
type: TYPE_NORMAL
zh: 下一步是从音频中提取声学特征。
- en: Note
id: totrans-43
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: Wav2Vec2 models fine-tuned for ASR task can perform feature extraction and classification
with one step, but for the sake of the tutorial, we also show how to perform feature
extraction here.
id: totrans-44
prefs: []
type: TYPE_NORMAL
zh: 为ASR任务微调的Wav2Vec2模型可以一步完成特征提取和分类,但为了教程的目的,我们还展示了如何在此处执行特征提取。
- en: '[PRE10]'
id: totrans-45
prefs: []
type: TYPE_PRE
zh: '[PRE10]'
- en: The returned features is a list of tensors. Each tensor is the output of a transformer
layer.
id: totrans-46
prefs: []
type: TYPE_NORMAL
zh: 返回的特征是一个张量列表。每个张量是一个变换器层的输出。
- en: '[PRE11]'
id: totrans-47
prefs: []
type: TYPE_PRE
zh: '[PRE11]'
- en: '![Feature from transformer layer 1, Feature from transformer layer 2, Feature
from transformer layer 3, Feature from transformer layer 4, Feature from transformer
layer 5, Feature from transformer layer 6, Feature from transformer layer 7, Feature
from transformer layer 8, Feature from transformer layer 9, Feature from transformer
layer 10, Feature from transformer layer 11, Feature from transformer layer 12](../Images/9f2d3410922166561ebdadfd4981e797.png)'
id: totrans-48
prefs: []
type: TYPE_IMG
zh: '![来自变换器层1的特征,来自变换器层2的特征,来自变换器层3的特征,来自变换器层4的特征,来自变换器层5的特征,来自变换器层6的特征,来自变换器层7的特征,来自变换器层8的特征,来自变换器层9的特征,来自变换器层10的特征,来自变换器层11的特征,来自变换器层12的特征](../Images/9f2d3410922166561ebdadfd4981e797.png)'
- en: Feature classification[](#feature-classification "Permalink to this heading")
id: totrans-49
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 特征分类
- en: Once the acoustic features are extracted, the next step is to classify them
into a set of categories.
id: totrans-50
prefs: []
type: TYPE_NORMAL
zh: 一旦提取了声学特征,下一步就是将它们分类到一组类别中。
- en: Wav2Vec2 model provides method to perform the feature extraction and classification
in one step.
id: totrans-51
prefs: []
type: TYPE_NORMAL
zh: Wav2Vec2模型提供了一种在一步中执行特征提取和分类的方法。
- en: '[PRE12]'
id: totrans-52
prefs: []
type: TYPE_PRE
zh: '[PRE12]'
- en: The output is in the form of logits. It is not in the form of probability.
id: totrans-53
prefs: []
type: TYPE_NORMAL
zh: 输出以logits的形式呈现,而不是概率的形式。
- en: Let’s visualize this.
id: totrans-54
prefs: []
type: TYPE_NORMAL
zh: 让我们可视化这个过程。
- en: '[PRE13]'
id: totrans-55
prefs: []
type: TYPE_PRE
zh: '[PRE13]'
- en: '![Classification result](../Images/ce8601d728900194dc8cb21fbd524cf7.png)'
id: totrans-56
prefs: []
type: TYPE_IMG
zh: '![分类结果](../Images/ce8601d728900194dc8cb21fbd524cf7.png)'
- en: '[PRE14]'
id: totrans-57
prefs: []
type: TYPE_PRE
zh: '[PRE14]'
- en: We can see that there are strong indications to certain labels across the time
line.
id: totrans-58
prefs: []
type: TYPE_NORMAL
zh: 我们可以看到在时间线上有对某些标签的强烈指示。
- en: Generating transcripts[](#generating-transcripts "Permalink to this heading")
id: totrans-59
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 生成转录
- en: From the sequence of label probabilities, now we want to generate transcripts.
The process to generate hypotheses is often called “decoding”.
id: totrans-60
prefs: []
type: TYPE_NORMAL
zh: 从标签概率序列中,现在我们想生成转录。生成假设的过程通常称为“解码”。
- en: Decoding is more elaborate than simple classification because decoding at certain
time step can be affected by surrounding observations.
id: totrans-61
prefs: []
type: TYPE_NORMAL
zh: 解码比简单分类更复杂,因为在某个时间步骤的解码可能会受到周围观察的影响。
- en: For example, take a word like `night` and `knight`. Even if their prior probability
distribution are differnt (in typical conversations, `night` would occur way more
often than `knight`), to accurately generate transcripts with `knight`, such as
`a knight with a sword`, the decoding process has to postpone the final decision
until it sees enough context.
id: totrans-62
prefs: []
type: TYPE_NORMAL
zh: 例如,拿一个词像`night`和`knight`。即使它们的先验概率分布不同(在典型对话中,`night`会比`knight`发生得更频繁),为了准确生成带有`knight`的转录,比如`a
knight with a sword`,解码过程必须推迟最终决定,直到看到足够的上下文。
- en: There are many decoding techniques proposed, and they require external resources,
such as word dictionary and language models.
id: totrans-63
prefs: []
type: TYPE_NORMAL
zh: 有许多提出的解码技术,它们需要外部资源,如单词词典和语言模型。
- en: In this tutorial, for the sake of simplicity, we will perform greedy decoding
which does not depend on such external components, and simply pick up the best
hypothesis at each time step. Therefore, the context information are not used,
and only one transcript can be generated.
id: totrans-64
prefs: []
type: TYPE_NORMAL
zh: 在本教程中,为了简单起见,我们将执行贪婪解码,它不依赖于外部组件,并且只在每个时间步骤选择最佳假设。因此,上下文信息未被使用,只能生成一个转录。
- en: We start by defining greedy decoding algorithm.
id: totrans-65
prefs: []
type: TYPE_NORMAL
zh: 我们首先定义贪婪解码算法。
- en: '[PRE15]'
id: totrans-66
prefs: []
type: TYPE_PRE
zh: '[PRE15]'
- en: Now create the decoder object and decode the transcript.
id: totrans-67
prefs: []
type: TYPE_NORMAL
zh: 现在创建解码器对象并解码转录。
- en: '[PRE16]'
id: totrans-68
prefs: []
type: TYPE_PRE
zh: '[PRE16]'
- en: Let’s check the result and listen again to the audio.
id: totrans-69
prefs: []
type: TYPE_NORMAL
zh: 让我们检查结果并再次听音频。
- en: '[PRE17]'
id: totrans-70
prefs: []
type: TYPE_PRE
zh: '[PRE17]'
- en: '[PRE18]'
id: totrans-71
prefs: []
type: TYPE_PRE
- en:
zh: '[PRE18]'
- en: null
id: totrans-72
prefs: []
type: TYPE_NORMAL
- en: Your browser does not support the audio element.
id: totrans-73
prefs: []
type: TYPE_NORMAL
zh: 您的浏览器不支持音频元素。
- en: The ASR model is fine-tuned using a loss function called Connectionist Temporal
Classification (CTC). The detail of CTC loss is explained [here](https://distill.pub/2017/ctc/).
In CTC a blank token (ϵ) is a special token which represents a repetition of the
previous symbol. In decoding, these are simply ignored.
id: totrans-74
prefs: []
type: TYPE_NORMAL
zh: ASR模型使用一种称为连接主义时间分类(CTC)的损失函数进行微调。CTC损失的详细信息在[这里](https://distill.pub/2017/ctc/)有解释。在CTC中,空白标记(ϵ)是一个特殊标记,表示前一个符号的重复。在解码中,这些标记被简单地忽略。
- en: Conclusion[](#conclusion "Permalink to this heading")
id: totrans-75
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 结论
- en: In this tutorial, we looked at how to use [`Wav2Vec2ASRBundle`](../generated/torchaudio.pipelines.Wav2Vec2ASRBundle.html#torchaudio.pipelines.Wav2Vec2ASRBundle
"torchaudio.pipelines.Wav2Vec2ASRBundle") to perform acoustic feature extraction
and speech recognition. Constructing a model and getting the emission is as short
as two lines.
id: totrans-76
prefs: []
type: TYPE_NORMAL
zh: 在本教程中,我们看了如何使用[`Wav2Vec2ASRBundle`](../generated/torchaudio.pipelines.Wav2Vec2ASRBundle.html#torchaudio.pipelines.Wav2Vec2ASRBundle)执行声学特征提取和语音识别。构建模型并获取发射只需两行代码。
- en: '[PRE19]'
id: totrans-77
prefs: []
type: TYPE_PRE
zh: '[PRE19]'
- en: '**Total running time of the script:** ( 0 minutes 6.833 seconds)'
id: totrans-78
prefs: []
type: TYPE_NORMAL
zh: '**脚本的总运行时间:**(0分钟6.833秒)'
- en: '[`Download Python source code: speech_recognition_pipeline_tutorial.py`](../_downloads/a0b5016bbf34fce4ac5549f4075dd10f/speech_recognition_pipeline_tutorial.py)'
id: totrans-79
prefs: []
type: TYPE_NORMAL
zh: '[`下载Python源代码:speech_recognition_pipeline_tutorial.py`](../_downloads/a0b5016bbf34fce4ac5549f4075dd10f/speech_recognition_pipeline_tutorial.py)'
- en: '[`Download Jupyter notebook: speech_recognition_pipeline_tutorial.ipynb`](../_downloads/ca83af2ea8d7db05fb63211d515b7fde/speech_recognition_pipeline_tutorial.ipynb)'
id: totrans-80
prefs: []
type: TYPE_NORMAL
zh: '[`下载Jupyter笔记本:speech_recognition_pipeline_tutorial.ipynb`](../_downloads/ca83af2ea8d7db05fb63211d515b7fde/speech_recognition_pipeline_tutorial.ipynb)'
- en: '[Gallery generated by Sphinx-Gallery](https://sphinx-gallery.github.io)'
id: totrans-81
prefs: []
type: TYPE_NORMAL
zh: '[Sphinx-Gallery生成的图库](https://sphinx-gallery.github.io)'
此差异已折叠。
- en: ASR Inference with CUDA CTC Decoder
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 使用CUDA CTC解码器进行ASR推理
- en: 原文:[https://pytorch.org/audio/stable/tutorials/asr_inference_with_cuda_ctc_decoder_tutorial.html](https://pytorch.org/audio/stable/tutorials/asr_inference_with_cuda_ctc_decoder_tutorial.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://pytorch.org/audio/stable/tutorials/asr_inference_with_cuda_ctc_decoder_tutorial.html](https://pytorch.org/audio/stable/tutorials/asr_inference_with_cuda_ctc_decoder_tutorial.html)
- en: Note
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: Click [here](#sphx-glr-download-tutorials-asr-inference-with-cuda-ctc-decoder-tutorial-py)
to download the full example code
id: totrans-3
prefs: []
type: TYPE_NORMAL
zh: 点击这里下载完整示例代码
- en: '**Author**: [Yuekai Zhang](mailto:yuekaiz%40nvidia.com)'
id: totrans-4
prefs: []
type: TYPE_NORMAL
zh: 作者:Yuekai Zhang
- en: This tutorial shows how to perform speech recognition inference using a CUDA-based
CTC beam search decoder. We demonstrate this on a pretrained [Zipformer](https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_ctc)
model from [Next-gen Kaldi](https://nadirapovey.com/next-gen-kaldi-what-is-it)
project.
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: 本教程展示了如何使用基于CUDA的CTC波束搜索解码器执行语音识别推理。我们在来自[Next-gen Kaldi](https://nadirapovey.com/next-gen-kaldi-what-is-it)项目的预训练[Zipformer](https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_ctc)模型上演示了这一点。
- en: Overview[](#overview "Permalink to this heading")
id: totrans-6
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 概述
- en: Beam search decoding works by iteratively expanding text hypotheses (beams)
with next possible characters, and maintaining only the hypotheses with the highest
scores at each time step.
id: totrans-7
prefs: []
type: TYPE_NORMAL
zh: 波束搜索解码通过迭代地扩展文本假设(波束)与下一个可能的字符,并在每个时间步仅保留得分最高的假设来工作。
- en: The underlying implementation uses cuda to acclerate the whole decoding process
id: totrans-8
prefs: []
type: TYPE_NORMAL
zh: 底层实现使用cuda来加速整个解码过程
- en: A mathematical formula for the decoder can be
id: totrans-9
prefs: []
type: TYPE_NORMAL
zh: 解码器的数学公式可以是
- en: found in the [paper](https://arxiv.org/pdf/1408.2873.pdf), and a more detailed
algorithm can be found in this [blog](https://distill.pub/2017/ctc/).
id: totrans-10
prefs: []
type: TYPE_NORMAL
zh: 在[论文](https://arxiv.org/pdf/1408.2873.pdf)中找到,并且更详细的算法可以在这个[博客](https://distill.pub/2017/ctc/)中找到。
- en: Running ASR inference using a CUDA CTC Beam Search decoder requires the following
components
id: totrans-11
prefs: []
type: TYPE_NORMAL
zh: 使用CUDA CTC波束搜索解码器运行ASR推理需要以下组件
- en: 'Acoustic Model: model predicting modeling units (BPE in this tutorial) from
acoustic features'
id: totrans-12
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 声学模型:从声学特征预测建模单元(本教程中为BPE)的模型
- en: 'BPE Model: the byte-pair encoding (BPE) tokenizer file'
id: totrans-13
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: BPE模型:字节对编码(BPE)分词器文件
- en: Acoustic Model and Set Up[](#acoustic-model-and-set-up "Permalink to this heading")
id: totrans-14
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 声学模型和设置
- en: First we import the necessary utilities and fetch the data that we are working
with
id: totrans-15
prefs: []
type: TYPE_NORMAL
zh: 首先,我们导入必要的工具并获取我们要处理的数据
- en: '[PRE0]'
id: totrans-16
prefs: []
type: TYPE_PRE
zh: '[PRE0]'
- en: '[PRE1]'
id: totrans-17
prefs: []
type: TYPE_PRE
zh: '[PRE1]'
- en: '[PRE2]'
id: totrans-18
prefs: []
type: TYPE_PRE
zh: '[PRE2]'
- en: We use the pretrained [Zipformer](https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-ctc-2022-12-01)
model that is trained on the [LibriSpeech dataset](http://www.openslr.org/12).
The model is jointly trained with CTC and Transducer loss functions. In this tutorial,
we only use CTC head of the model.
id: totrans-19
prefs: []
type: TYPE_NORMAL
zh: 我们使用预训练的[Zipformer](https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-ctc-2022-12-01)模型,该模型在[LibriSpeech数据集](http://www.openslr.org/12)上进行了训练。该模型同时使用CTC和Transducer损失函数进行训练。在本教程中,我们仅使用模型的CTC头部。
- en: '[PRE3]'
id: totrans-20
prefs: []
type: TYPE_PRE
zh: '[PRE3]'
- en: '[PRE4]'
id: totrans-21
prefs: []
type: TYPE_PRE
zh: '[PRE4]'
- en: We will load a sample from the LibriSpeech test-other dataset.
id: totrans-22
prefs: []
type: TYPE_NORMAL
zh: 我们将从LibriSpeech测试其他数据集中加载一个样本。
- en: '[PRE5]'
id: totrans-23
prefs: []
type: TYPE_PRE
zh: '[PRE5]'
- en: '[PRE6]'
id: totrans-24
prefs: []
type: TYPE_PRE
- en:
zh: '[PRE6]'
- en: null
id: totrans-25
prefs: []
type: TYPE_NORMAL
- en: Your browser does not support the audio element.
id: totrans-26
prefs: []
type: TYPE_NORMAL
zh: 您的浏览器不支持音频元素。
- en: The transcript corresponding to this audio file is
id: totrans-27
prefs: []
type: TYPE_NORMAL
zh: 与此音频文件对应的抄本是
- en: '[PRE7]'
id: totrans-28
prefs: []
type: TYPE_PRE
zh: '[PRE7]'
- en: Files and Data for Decoder[](#files-and-data-for-decoder "Permalink to this
heading")
id: totrans-29
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 解码器的文件和数据
- en: Next, we load in our token from BPE model, which is the tokenizer for decoding.
id: totrans-30
prefs: []
type: TYPE_NORMAL
zh: 接下来,我们从BPE模型中加载我们的标记,这是用于解码的分词器。
- en: Tokens[](#tokens "Permalink to this heading")
id: totrans-31
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: 标记
- en: The tokens are the possible symbols that the acoustic model can predict, including
the blank symbol in CTC. In this tutorial, it includes 500 BPE tokens. It can
either be passed in as a file, where each line consists of the tokens corresponding
to the same index, or as a list of tokens, each mapping to a unique index.
id: totrans-32
prefs: []
type: TYPE_NORMAL
zh: 标记是声学模型可以预测的可能符号,包括CTC中的空白符号。在本教程中,它包括500个BPE标记。它可以作为文件传入,其中每行包含与相同索引对应的标记,或作为标记列表传入,每个标记映射到一个唯一的索引。
- en: '[PRE8]'
id: totrans-33
prefs: []
type: TYPE_PRE
zh: '[PRE8]'
- en: '[PRE9]'
id: totrans-34
prefs: []
type: TYPE_PRE
zh: '[PRE9]'
- en: '[PRE10]'
id: totrans-35
prefs: []
type: TYPE_PRE
zh: '[PRE10]'
- en: Construct CUDA Decoder[](#construct-cuda-decoder "Permalink to this heading")
id: totrans-36
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 构建CUDA解码器
- en: In this tutorial, we will construct a CUDA beam search decoder. The decoder
can be constructed using the factory function [`cuda_ctc_decoder()`](../generated/torchaudio.models.decoder.cuda_ctc_decoder.html#torchaudio.models.decoder.cuda_ctc_decoder
"torchaudio.models.decoder.cuda_ctc_decoder").
id: totrans-37
prefs: []
type: TYPE_NORMAL
zh: 在本教程中,我们将构建一个CUDA波束搜索解码器。可以使用工厂函数[`cuda_ctc_decoder()`](../generated/torchaudio.models.decoder.cuda_ctc_decoder.html#torchaudio.models.decoder.cuda_ctc_decoder
"torchaudio.models.decoder.cuda_ctc_decoder")来构建解码器。
- en: '[PRE11]'
id: totrans-38
prefs: []
type: TYPE_PRE
zh: '[PRE11]'
- en: Run Inference[](#run-inference "Permalink to this heading")
id: totrans-39
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 运行推理
- en: Now that we have the data, acoustic model, and decoder, we can perform inference.
The output of the beam search decoder is of type [`CUCTCHypothesis`](../generated/torchaudio.models.decoder.CUCTCDecoder.html#torchaudio.models.decoder.CUCTCHypothesis
"torchaudio.models.decoder.CUCTCHypothesis"), consisting of the predicted token
IDs, words (symbols corresponding to the token IDs), and hypothesis scores. Recall
the transcript corresponding to the waveform is
id: totrans-40
prefs: []
type: TYPE_NORMAL
zh: 现在我们有了数据、声学模型和解码器,我们可以执行推理。波束搜索解码器的输出类型为[`CUCTCHypothesis`](../generated/torchaudio.models.decoder.CUCTCDecoder.html#torchaudio.models.decoder.CUCTCHypothesis
"torchaudio.models.decoder.CUCTCHypothesis"),包括预测的标记ID、单词(与标记ID对应的符号)和假设分数。回想一下与波形对应的抄本是
- en: '[PRE12]'
id: totrans-41
prefs: []
type: TYPE_PRE
zh: '[PRE12]'
- en: '[PRE13]'
id: totrans-42
prefs: []
type: TYPE_PRE
zh: '[PRE13]'
- en: '[PRE14]'
id: totrans-43
prefs: []
type: TYPE_PRE
zh: '[PRE14]'
- en: The cuda ctc decoder gives the following result.
id: totrans-44
prefs: []
type: TYPE_NORMAL
zh: cuda ctc解码器给出以下结果。
- en: '[PRE15]'
id: totrans-45
prefs: []
type: TYPE_PRE
zh: '[PRE15]'
- en: '[PRE16]'
id: totrans-46
prefs: []
type: TYPE_PRE
zh: '[PRE16]'
- en: Beam Search Decoder Parameters[](#beam-search-decoder-parameters "Permalink
to this heading")
id: totrans-47
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 波束搜索解码器参数
- en: In this section, we go a little bit more in depth about some different parameters
and tradeoffs. For the full list of customizable parameters, please refer to the
[`documentation`](../generated/torchaudio.models.decoder.cuda_ctc_decoder.html#torchaudio.models.decoder.cuda_ctc_decoder
"torchaudio.models.decoder.cuda_ctc_decoder").
id: totrans-48
prefs: []
type: TYPE_NORMAL
zh: 在本节中,我们将更深入地讨论一些不同参数和权衡。有关可自定义参数的完整列表,请参考[`文档`](../generated/torchaudio.models.decoder.cuda_ctc_decoder.html#torchaudio.models.decoder.cuda_ctc_decoder
"torchaudio.models.decoder.cuda_ctc_decoder")。
- en: Helper Function[](#helper-function "Permalink to this heading")
id: totrans-49
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: 辅助函数[](#helper-function "跳转到此标题")
- en: '[PRE17]'
id: totrans-50
prefs: []
type: TYPE_PRE
zh: '[PRE17]'
- en: nbest[](#nbest "Permalink to this heading")
id: totrans-51
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: nbest[](#nbest "跳转到此标题")
- en: This parameter indicates the number of best hypotheses to return. For instance,
by setting `nbest=10` when constructing the beam search decoder earlier, we can
now access the hypotheses with the top 10 scores.
id: totrans-52
prefs: []
type: TYPE_NORMAL
zh: 此参数表示要返回的最佳假设数量。例如,在之前构建波束搜索解码器时设置 `nbest=10`,现在我们可以访问得分前10名的假设。
- en: '[PRE18]'
id: totrans-53
prefs: []
type: TYPE_PRE
zh: '[PRE18]'
- en: '[PRE19]'
id: totrans-54
prefs: []
type: TYPE_PRE
zh: '[PRE19]'
- en: beam size[](#beam-size "Permalink to this heading")
id: totrans-55
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: 波束大小[](#beam-size "跳转到此标题")
- en: The `beam_size` parameter determines the maximum number of best hypotheses to
hold after each decoding step. Using larger beam sizes allows for exploring a
larger range of possible hypotheses which can produce hypotheses with higher scores,
but it does not provide additional gains beyond a certain point. We recommend
to set beam_size=10 for cuda beam search decoder.
id: totrans-56
prefs: []
type: TYPE_NORMAL
zh: '`beam_size`参数确定每个解码步骤后保留的最佳假设数量上限。使用更大的波束大小可以探索更广泛的可能假设范围,这可以产生得分更高的假设,但在一定程度上不会提供额外的收益。我们建议为cuda波束搜索解码器设置`beam_size=10`。'
- en: In the example below, we see improvement in decoding quality as we increase
beam size from 1 to 3, but notice how using a beam size of 3 provides the same
output as beam size 10.
id: totrans-57
prefs: []
type: TYPE_NORMAL
zh: 在下面的示例中,我们可以看到随着波束大小从1增加到3,解码质量有所提高,但请注意,使用波束大小为3时提供与波束大小为10相同的输出。
- en: '[PRE20]'
id: totrans-58
prefs: []
type: TYPE_PRE
zh: '[PRE20]'
- en: '[PRE21]'
id: totrans-59
prefs: []
type: TYPE_PRE
zh: '[PRE21]'
- en: blank skip threshold[](#blank-skip-threshold "Permalink to this heading")
id: totrans-60
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: blank skip threshold[](#blank-skip-threshold "跳转到此标题")
- en: The `blank_skip_threshold` parameter is used to prune the frames which have
large blank probability. Pruning these frames with a good blank_skip_threshold
could speed up decoding process a lot while no accuracy drop. Since the rule of
CTC, we would keep at least one blank frame between two non-blank frames to avoid
mistakenly merge two consecutive identical symbols. We recommend to set blank_skip_threshold=0.95
for cuda beam search decoder.
id: totrans-61
prefs: []
type: TYPE_NORMAL
zh: '`blank_skip_threshold`参数用于修剪具有较大空白概率的帧。使用良好的`blank_skip_threshold`修剪这些帧可以大大加快解码过程,而不会降低准确性。根据CTC规则,我们应至少在两个非空白帧之间保留一个空白帧,以避免错误地合并两个连续相同的符号。我们建议为cuda波束搜索解码器设置`blank_skip_threshold=0.95`。'
- en: '[PRE22]'
id: totrans-62
prefs: []
type: TYPE_PRE
zh: '[PRE22]'
- en: '[PRE23]'
id: totrans-63
prefs: []
type: TYPE_PRE
zh: '[PRE23]'
- en: Benchmark with flashlight CPU decoder[](#benchmark-with-flashlight-cpu-decoder
"Permalink to this heading")
id: totrans-64
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 使用手电筒CPU解码器进行基准测试[](#benchmark-with-flashlight-cpu-decoder "跳转到此标题")
- en: We benchmark the throughput and accuracy between CUDA decoder and CPU decoder
using librispeech test_other set. To reproduce below benchmark results, you may
refer [here](https://github.com/pytorch/audio/tree/main/examples/asr/librispeech_cuda_ctc_decoder).
id: totrans-65
prefs: []
type: TYPE_NORMAL
zh: 我们使用librispeech test_other数据集对CUDA解码器和CPU解码器之间的吞吐量和准确性进行基准测试。要重现下面的基准测试结果,您可以参考[这里](https://github.com/pytorch/audio/tree/main/examples/asr/librispeech_cuda_ctc_decoder)。
- en: '| Decoder | Setting | WER (%) | N-Best Oracle WER (%) | Decoder Cost Time (seconds)
|'
id: totrans-66
prefs: []
type: TYPE_TB
zh: '| 解码器 | 设置 | WER (%) | N-Best Oracle WER (%) | 解码器成本时间 (秒) |'
- en: '| --- | --- | --- | --- | --- |'
id: totrans-67
prefs: []
type: TYPE_TB
zh: '| --- | --- | --- | --- | --- |'
- en: '| CUDA decoder | blank_skip_threshold 0.95 | 5.81 | 4.11 | 2.57 |'
id: totrans-68
prefs: []
type: TYPE_TB
zh: '| CUDA解码器 | blank_skip_threshold 0.95 | 5.81 | 4.11 | 2.57 |'
- en: '| CUDA decoder | blank_skip_threshold 1.0 (no frame-skip) | 5.81 | 4.09 | 6.24
|'
id: totrans-69
prefs: []
type: TYPE_TB
zh: '| CUDA解码器 | blank_skip_threshold 1.0 (无帧跳过) | 5.81 | 4.09 | 6.24 |'
- en: '| CPU decoder | beam_size_token 10 | 5.86 | 4.30 | 28.61 |'
id: totrans-70
prefs: []
type: TYPE_TB
zh: '| CPU解码器 | beam_size_token 10 | 5.86 | 4.30 | 28.61 |'
- en: '| CPU decoder | beam_size_token 500 | 5.86 | 4.30 | 791.80 |'
id: totrans-71
prefs: []
type: TYPE_TB
zh: '| CPU解码器 | beam_size_token 500 | 5.86 | 4.30 | 791.80 |'
- en: From the above table, CUDA decoder could give a slight improvement in WER and
a significant increase in throughput.
id: totrans-72
prefs: []
type: TYPE_NORMAL
zh: 从上表中可以看出,CUDA解码器在WER方面略有改善,并且吞吐量显著增加。
- en: '**Total running time of the script:** ( 0 minutes 8.752 seconds)'
id: totrans-73
prefs: []
type: TYPE_NORMAL
zh: '**脚本的总运行时间:** ( 0 分钟 8.752 秒)'
- en: '[`Download Python source code: asr_inference_with_cuda_ctc_decoder_tutorial.py`](../_downloads/3956cf493d21711e687e9610c91f9cd1/asr_inference_with_cuda_ctc_decoder_tutorial.py)'
id: totrans-74
prefs: []
type: TYPE_NORMAL
zh: '[`下载Python源代码: asr_inference_with_cuda_ctc_decoder_tutorial.py`](../_downloads/3956cf493d21711e687e9610c91f9cd1/asr_inference_with_cuda_ctc_decoder_tutorial.py)'
- en: '[`Download Jupyter notebook: asr_inference_with_cuda_ctc_decoder_tutorial.ipynb`](../_downloads/96982138e59c541534342222a3f5c69e/asr_inference_with_cuda_ctc_decoder_tutorial.ipynb)'
id: totrans-75
prefs: []
type: TYPE_NORMAL
zh: '[`下载Jupyter笔记本: asr_inference_with_cuda_ctc_decoder_tutorial.ipynb`](../_downloads/96982138e59c541534342222a3f5c69e/asr_inference_with_cuda_ctc_decoder_tutorial.ipynb)'
- en: '[Gallery generated by Sphinx-Gallery](https://sphinx-gallery.github.io)'
id: totrans-76
prefs: []
type: TYPE_NORMAL
zh: '[Sphinx-Gallery生成的图库](https://sphinx-gallery.github.io)'
- en: Online ASR with Emformer RNN-T
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 使用Emformer RNN-T进行在线ASR
- en: 原文:[https://pytorch.org/audio/stable/tutorials/online_asr_tutorial.html](https://pytorch.org/audio/stable/tutorials/online_asr_tutorial.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://pytorch.org/audio/stable/tutorials/online_asr_tutorial.html](https://pytorch.org/audio/stable/tutorials/online_asr_tutorial.html)
- en: Note
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: Click [here](#sphx-glr-download-tutorials-online-asr-tutorial-py) to download
the full example code
id: totrans-3
prefs: []
type: TYPE_NORMAL
zh: 点击[这里](#sphx-glr-download-tutorials-online-asr-tutorial-py)下载完整示例代码
- en: '**Author**: [Jeff Hwang](mailto:jeffhwang%40meta.com), [Moto Hira](mailto:moto%40meta.com)'
id: totrans-4
prefs: []
type: TYPE_NORMAL
zh: '**作者**:[Jeff Hwang](mailto:jeffhwang%40meta.com), [Moto Hira](mailto:moto%40meta.com)'
- en: This tutorial shows how to use Emformer RNN-T and streaming API to perform online
speech recognition.
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: 本教程展示了如何使用Emformer RNN-T和流式API执行在线语音识别。
- en: Note
id: totrans-6
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: This tutorial requires FFmpeg libraries and SentencePiece.
id: totrans-7
prefs: []
type: TYPE_NORMAL
zh: 本教程需要使用FFmpeg库和SentencePiece。
- en: Please refer to [Optional Dependencies](../installation.html#optional-dependencies)
for the detail.
id: totrans-8
prefs: []
type: TYPE_NORMAL
zh: 有关详细信息,请参阅[可选依赖项](../installation.html#optional-dependencies)。
- en: 1\. Overview[](#overview "Permalink to this heading")
id: totrans-9
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 1\. 概述[](#overview "跳转到此标题的永久链接")
- en: Performing online speech recognition is composed of the following steps
id: totrans-10
prefs: []
type: TYPE_NORMAL
zh: 在线语音识别的执行由以下步骤组成
- en: 'Build the inference pipeline Emformer RNN-T is composed of three components:
feature extractor, decoder and token processor.'
id: totrans-11
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 构建推理管道Emformer RNN-T由三个组件组成:特征提取器、解码器和标记处理器。
- en: Format the waveform into chunks of expected sizes.
id: totrans-12
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 将波形格式化为预期大小的块。
- en: Pass data through the pipeline.
id: totrans-13
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 通过管道传递数据。
- en: 2\. Preparation[](#preparation "Permalink to this heading")
id: totrans-14
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 2\. 准备[](#preparation "跳转到此标题的永久链接")
- en: '[PRE0]'
id: totrans-15
prefs: []
type: TYPE_PRE
zh: '[PRE0]'
- en: '[PRE1]'
id: totrans-16
prefs: []
type: TYPE_PRE
zh: '[PRE1]'
- en: 3\. Construct the pipeline[](#construct-the-pipeline "Permalink to this heading")
id: totrans-17
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 3\. 构建管道[](#construct-the-pipeline "跳转到此标题的永久链接")
- en: Pre-trained model weights and related pipeline components are bundled as [`torchaudio.pipelines.RNNTBundle`](../generated/torchaudio.pipelines.RNNTBundle.html#torchaudio.pipelines.RNNTBundle
"torchaudio.pipelines.RNNTBundle").
id: totrans-18
prefs: []
type: TYPE_NORMAL
zh: 预训练模型权重和相关管道组件被捆绑为[`torchaudio.pipelines.RNNTBundle`](../generated/torchaudio.pipelines.RNNTBundle.html#torchaudio.pipelines.RNNTBundle)。
- en: We use [`torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH`](../generated/torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH.html#torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH
"torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH"), which is a Emformer RNN-T
model trained on LibriSpeech dataset.
id: totrans-19
prefs: []
type: TYPE_NORMAL
zh: 我们使用[`torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH`](../generated/torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH.html#torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH),这是在LibriSpeech数据集上训练的Emformer
RNN-T模型。
- en: '[PRE2]'
id: totrans-20
prefs: []
type: TYPE_PRE
zh: '[PRE2]'
- en: '[PRE3]'
id: totrans-21
prefs: []
type: TYPE_PRE
zh: '[PRE3]'
- en: Streaming inference works on input data with overlap. Emformer RNN-T model treats
the newest portion of the input data as the “right context” — a preview of future
context. In each inference call, the model expects the main segment to start from
this right context from the previous inference call. The following figure illustrates
this.
id: totrans-22
prefs: []
type: TYPE_NORMAL
zh: 流式推理适用于具有重叠的输入数据。Emformer RNN-T模型将输入数据的最新部分视为“右上下文” —— 未来上下文的预览。在每次推理调用中,模型期望主段从上一次推理调用的右上下文开始。以下图示说明了这一点。
- en: '![https://download.pytorch.org/torchaudio/tutorial-assets/emformer_rnnt_context.png](../Images/0e1c9a1ab0a1725ac44a8f5ae79784d9.png)'
id: totrans-23
prefs: []
type: TYPE_IMG
zh: '![https://download.pytorch.org/torchaudio/tutorial-assets/emformer_rnnt_context.png](../Images/0e1c9a1ab0a1725ac44a8f5ae79784d9.png)'
- en: The size of main segment and right context, along with the expected sample rate
can be retrieved from bundle.
id: totrans-24
prefs: []
type: TYPE_NORMAL
zh: 主段和右上下文的大小,以及预期的采样率可以从bundle中检索。
- en: '[PRE4]'
id: totrans-25
prefs: []
type: TYPE_PRE
zh: '[PRE4]'
- en: '[PRE5]'
id: totrans-26
prefs: []
type: TYPE_PRE
- en: 4\. Configure the audio stream[](#configure-the-audio-stream "Permalink to
this heading")
zh: '[PRE5]'
- en: 4\. Configure the audio stream[](#configure-the-audio-stream "Permalink to this
heading")
id: totrans-27
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 4\. 配置音频流[](#configure-the-audio-stream "跳转到此标题的永久链接")
- en: Next, we configure the input audio stream using [`torchaudio.io.StreamReader`](../generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader
"torchaudio.io.StreamReader").
id: totrans-28
prefs: []
type: TYPE_NORMAL
zh: 接下来,我们使用[`torchaudio.io.StreamReader`](../generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader)配置输入音频流。
- en: For the detail of this API, please refer to the [StreamReader Basic Usage](./streamreader_basic_tutorial.html).
id: totrans-29
prefs: []
type: TYPE_NORMAL
zh: 有关此API的详细信息,请参阅[StreamReader基本用法](./streamreader_basic_tutorial.html)。
- en: The following audio file was originally published by LibriVox project, and it
is in the public domain.
id: totrans-30
prefs: []
type: TYPE_NORMAL
zh: 以下音频文件最初由LibriVox项目发布,属于公共领域。
- en: '[https://librivox.org/great-pirate-stories-by-joseph-lewis-french/](https://librivox.org/great-pirate-stories-by-joseph-lewis-french/)'
id: totrans-31
prefs: []
type: TYPE_NORMAL
zh: '[https://librivox.org/great-pirate-stories-by-joseph-lewis-french/](https://librivox.org/great-pirate-stories-by-joseph-lewis-french/)'
- en: It was re-uploaded for the sake of the tutorial.
id: totrans-32
prefs: []
type: TYPE_NORMAL
zh: 出于教程目的,它被重新上传。
- en: '[PRE6]'
id: totrans-33
prefs: []
type: TYPE_PRE
zh: '[PRE6]'
- en: '[PRE7]'
id: totrans-34
prefs: []
type: TYPE_PRE
zh: '[PRE7]'
- en: As previously explained, Emformer RNN-T model expects input data with overlaps;
however, Streamer iterates the source media without overlap, so we make a helper
structure that caches a part of input data from Streamer as right context and
then appends it to the next input data from Streamer.
id: totrans-35
prefs: []
type: TYPE_NORMAL
zh: 如前所述,Emformer RNN-T模型期望具有重叠的输入数据;然而,Streamer在没有重叠的情况下迭代源媒体,因此我们制作了一个辅助结构,从Streamer缓存一部分输入数据作为右上下文,然后将其附加到来自Streamer的下一个输入数据。
- en: The following figure illustrates this.
id: totrans-36
prefs: []
type: TYPE_NORMAL
zh: 以下图示说明了这一点。
- en: '![https://download.pytorch.org/torchaudio/tutorial-assets/emformer_rnnt_streamer_context.png](../Images/a57362a983bfc8977c146b9cec1fbdc5.png)'
id: totrans-37
prefs: []
type: TYPE_IMG
zh: '![https://download.pytorch.org/torchaudio/tutorial-assets/emformer_rnnt_streamer_context.png](../Images/a57362a983bfc8977c146b9cec1fbdc5.png)'
- en: '[PRE8]'
id: totrans-38
prefs: []
type: TYPE_PRE
zh: '[PRE8]'
- en: 5\. Run stream inference[](#run-stream-inference "Permalink to this heading")
id: totrans-39
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 5\. 运行流推理[](#run-stream-inference "跳转到此标题的永久链接")
- en: Finally, we run the recognition.
id: totrans-40
prefs: []
type: TYPE_NORMAL
zh: 最后,我们运行识别。
- en: First, we initialize the stream iterator, context cacher, and state and hypothesis
that are used by decoder to carry over the decoding state between inference calls.
id: totrans-41
prefs: []
type: TYPE_NORMAL
zh: 首先,我们初始化流迭代器、上下文缓存器以及解码器使用的状态和假设,用于在推理调用之间传递解码状态。
- en: '[PRE9]'
id: totrans-42
prefs: []
type: TYPE_PRE
zh: '[PRE9]'
- en: Next we, run the inference.
id: totrans-43
prefs: []
type: TYPE_NORMAL
zh: 接下来,我们运行推理。
- en: For the sake of better display, we create a helper function which processes
the source stream up to the given times and call it repeatedly.
id: totrans-44
prefs: []
type: TYPE_NORMAL
zh: 为了更好地显示,我们创建了一个辅助函数,该函数处理源流直到给定次数,并重复调用它。
- en: '[PRE10]'
id: totrans-45
prefs: []
type: TYPE_PRE
zh: '[PRE10]'
- en: '[PRE11]'
id: totrans-46
prefs: []
type: TYPE_PRE
zh: '[PRE11]'
- en: '![MelSpectrogram Feature](../Images/6f88cad1fa15680732704d2ab1568895.png)'
id: totrans-47
prefs: []
type: TYPE_IMG
zh: '![MelSpectrogram特征](../Images/6f88cad1fa15680732704d2ab1568895.png)'
- en: '[PRE12]'
id: totrans-48
prefs: []
type: TYPE_PRE
- en:
zh: '[PRE12]'
- en: null
id: totrans-49
prefs: []
type: TYPE_NORMAL
- en: Your browser does not support the audio element.
id: totrans-50
prefs: []
type: TYPE_NORMAL
zh: 您的浏览器不支持音频元素。
- en: '[PRE13]'
id: totrans-51
prefs: []
type: TYPE_PRE
zh: '[PRE13]'
- en: '![MelSpectrogram Feature](../Images/63ea9ff950b6828668774e9e16e2da72.png)'
id: totrans-52
prefs: []
type: TYPE_IMG
zh: '![Mel频谱特征](../Images/63ea9ff950b6828668774e9e16e2da72.png)'
- en: '[PRE14]'
id: totrans-53
prefs: []
type: TYPE_PRE
- en:
zh: '[PRE14]'
- en: null
id: totrans-54
prefs: []
type: TYPE_NORMAL
- en: Your browser does not support the audio element.
id: totrans-55
prefs: []
type: TYPE_NORMAL
zh: 您的浏览器不支持音频元素。
- en: '[PRE15]'
id: totrans-56
prefs: []
type: TYPE_PRE
zh: '[PRE15]'
- en: '![MelSpectrogram Feature](../Images/9fd0eaf340cc4769da822a728893c8d0.png)'
id: totrans-57
prefs: []
type: TYPE_IMG
zh: '![Mel频谱特征](../Images/9fd0eaf340cc4769da822a728893c8d0.png)'
- en: '[PRE16]'
id: totrans-58
prefs: []
type: TYPE_PRE
- en:
zh: '[PRE16]'
- en: null
id: totrans-59
prefs: []
type: TYPE_NORMAL
- en: Your browser does not support the audio element.
id: totrans-60
prefs: []
type: TYPE_NORMAL
zh: 您的浏览器不支持音频元素。
- en: '[PRE17]'
id: totrans-61
prefs: []
type: TYPE_PRE
zh: '[PRE17]'
- en: '![MelSpectrogram Feature](../Images/27361e962edf9ff4e1dc7a554b09d885.png)'
id: totrans-62
prefs: []
type: TYPE_IMG
zh: '![Mel频谱特征](../Images/27361e962edf9ff4e1dc7a554b09d885.png)'
- en: '[PRE18]'
id: totrans-63
prefs: []
type: TYPE_PRE
- en:
zh: '[PRE18]'
- en: null
id: totrans-64
prefs: []
type: TYPE_NORMAL
- en: Your browser does not support the audio element.
id: totrans-65
prefs: []
type: TYPE_NORMAL
zh: 您的浏览器不支持音频元素。
- en: '[PRE19]'
id: totrans-66
prefs: []
type: TYPE_PRE
zh: '[PRE19]'
- en: '![MelSpectrogram Feature](../Images/78b4f08b9d73ca155002dca9b67d5139.png)'
id: totrans-67
prefs: []
type: TYPE_IMG
zh: '![Mel频谱特征](../Images/78b4f08b9d73ca155002dca9b67d5139.png)'
- en: '[PRE20]'
id: totrans-68
prefs: []
type: TYPE_PRE
- en:
zh: '[PRE20]'
- en: null
id: totrans-69
prefs: []
type: TYPE_NORMAL
- en: Your browser does not support the audio element.
id: totrans-70
prefs: []
type: TYPE_NORMAL
zh: 您的浏览器不支持音频元素。
- en: '[PRE21]'
id: totrans-71
prefs: []
type: TYPE_PRE
zh: '[PRE21]'
- en: '![MelSpectrogram Feature](../Images/8e43113644bb019dfc4bb4603e5bc696.png)'
id: totrans-72
prefs: []
type: TYPE_IMG
zh: '![Mel频谱特征](../Images/8e43113644bb019dfc4bb4603e5bc696.png)'
- en: '[PRE22]'
id: totrans-73
prefs: []
type: TYPE_PRE
- en:
zh: '[PRE22]'
- en: null
id: totrans-74
prefs: []
type: TYPE_NORMAL
- en: Your browser does not support the audio element.
id: totrans-75
prefs: []
type: TYPE_NORMAL
zh: 您的浏览器不支持音频元素。
- en: '[PRE23]'
id: totrans-76
prefs: []
type: TYPE_PRE
zh: '[PRE23]'
- en: '![MelSpectrogram Feature](../Images/74f496d6db06d496150b2e6b919a7fea.png)'
id: totrans-77
prefs: []
type: TYPE_IMG
zh: '![Mel频谱特征](../Images/74f496d6db06d496150b2e6b919a7fea.png)'
- en: '[PRE24]'
id: totrans-78
prefs: []
type: TYPE_PRE
- en:
zh: '[PRE24]'
- en: null
id: totrans-79
prefs: []
type: TYPE_NORMAL
- en: Your browser does not support the audio element.
id: totrans-80
prefs: []
type: TYPE_NORMAL
zh: 您的浏览器不支持音频元素。
- en: '[PRE25]'
id: totrans-81
prefs: []
type: TYPE_PRE
zh: '[PRE25]'
- en: '![MelSpectrogram Feature](../Images/1d8004d0bd1aaa132e299f5e7b3f4d65.png)'
id: totrans-82
prefs: []
type: TYPE_IMG
zh: '![Mel频谱特征](../Images/1d8004d0bd1aaa132e299f5e7b3f4d65.png)'
- en: '[PRE26]'
id: totrans-83
prefs: []
type: TYPE_PRE
- en:
zh: '[PRE26]'
- en: null
id: totrans-84
prefs: []
type: TYPE_NORMAL
- en: Your browser does not support the audio element.
id: totrans-85
prefs: []
type: TYPE_NORMAL
zh: 您的浏览器不支持音频元素。
- en: '[PRE27]'
id: totrans-86
prefs: []
type: TYPE_PRE
zh: '[PRE27]'
- en: '![MelSpectrogram Feature](../Images/078602e6329acdc28d9f151361d84fa4.png)'
id: totrans-87
prefs: []
type: TYPE_IMG
zh: '![Mel频谱特征](../Images/078602e6329acdc28d9f151361d84fa4.png)'
- en: '[PRE28]'
id: totrans-88
prefs: []
type: TYPE_PRE
- en:
zh: '[PRE28]'
- en: null
id: totrans-89
prefs: []
type: TYPE_NORMAL
- en: Your browser does not support the audio element.
id: totrans-90
prefs: []
type: TYPE_NORMAL
zh: 您的浏览器不支持音频元素。
- en: '[PRE29]'
id: totrans-91
prefs: []
type: TYPE_PRE
zh: '[PRE29]'
- en: '![MelSpectrogram Feature](../Images/09c62d29a7ebfdca810fb7715b4d6deb.png)'
id: totrans-92
prefs: []
type: TYPE_IMG
zh: '![Mel频谱特征](../Images/09c62d29a7ebfdca810fb7715b4d6deb.png)'
- en: '[PRE30]'
id: totrans-93
prefs: []
type: TYPE_PRE
- en:
zh: '[PRE30]'
- en: null
id: totrans-94
prefs: []
type: TYPE_NORMAL
- en: Your browser does not support the audio element.
id: totrans-95
prefs: []
type: TYPE_NORMAL
zh: 您的浏览器不支持音频元素。
- en: '[PRE31]'
id: totrans-96
prefs: []
type: TYPE_PRE
zh: '[PRE31]'
- en: '![MelSpectrogram Feature](../Images/bd6f77d39b92dab706c4579cee78d49b.png)'
id: totrans-97
prefs: []
type: TYPE_IMG
zh: '![Mel频谱特征](../Images/bd6f77d39b92dab706c4579cee78d49b.png)'
- en: '[PRE32]'
id: totrans-98
prefs: []
type: TYPE_PRE
- en:
zh: '[PRE32]'
- en: null
id: totrans-99
prefs: []
type: TYPE_NORMAL
- en: Your browser does not support the audio element.
id: totrans-100
prefs: []
type: TYPE_NORMAL
zh: 您的浏览器不支持音频元素。
- en: '[PRE33]'
id: totrans-101
prefs: []
type: TYPE_PRE
zh: '[PRE33]'
- en: '![MelSpectrogram Feature](../Images/1d08a0f2dfb8662795d4a456d55369b9.png)'
id: totrans-102
prefs: []
type: TYPE_IMG
zh: '![Mel频谱特征](../Images/1d08a0f2dfb8662795d4a456d55369b9.png)'
- en: '[PRE34]'
id: totrans-103
prefs: []
type: TYPE_PRE
- en:
zh: '[PRE34]'
- en: null
id: totrans-104
prefs: []
type: TYPE_NORMAL
- en: Your browser does not support the audio element.
id: totrans-105
prefs: []
type: TYPE_NORMAL
zh: 您的浏览器不支持音频元素。
- en: '[PRE35]'
id: totrans-106
prefs: []
type: TYPE_PRE
zh: '[PRE35]'
- en: '![MelSpectrogram Feature](../Images/b5ffe860eeae95b44bae565c68a36a14.png)'
id: totrans-107
prefs: []
type: TYPE_IMG
zh: '![Mel频谱特征](../Images/b5ffe860eeae95b44bae565c68a36a14.png)'
- en: '[PRE36]'
id: totrans-108
prefs: []
type: TYPE_PRE
- en:
zh: '[PRE36]'
- en: null
id: totrans-109
prefs: []
type: TYPE_NORMAL
- en: Your browser does not support the audio element.
id: totrans-110
prefs: []
type: TYPE_NORMAL
zh: 您的浏览器不支持音频元素。
- en: 'Tag: [`torchaudio.io`](../io.html#module-torchaudio.io "torchaudio.io")'
id: totrans-111
prefs: []
type: TYPE_NORMAL
zh: 标签:[`torchaudio.io`](../io.html#module-torchaudio.io "torchaudio.io")
- en: '**Total running time of the script:** ( 1 minutes 34.955 seconds)'
id: totrans-112
prefs: []
type: TYPE_NORMAL
zh: '**脚本的总运行时间:**(1分钟34.955秒)'
- en: '[`Download Python source code: online_asr_tutorial.py`](../_downloads/f9f593098569966df0b815e29c13dd20/online_asr_tutorial.py)'
id: totrans-113
prefs: []
type: TYPE_NORMAL
zh: '[`下载Python源代码:online_asr_tutorial.py`](../_downloads/f9f593098569966df0b815e29c13dd20/online_asr_tutorial.py)'
- en: '[`Download Jupyter notebook: online_asr_tutorial.ipynb`](../_downloads/bd34dff0656a1aa627d444a8d1a5957f/online_asr_tutorial.ipynb)'
id: totrans-114
prefs: []
type: TYPE_NORMAL
zh: '[`下载Jupyter笔记本:online_asr_tutorial.ipynb`](../_downloads/bd34dff0656a1aa627d444a8d1a5957f/online_asr_tutorial.ipynb)'
- en: '[Gallery generated by Sphinx-Gallery](https://sphinx-gallery.github.io)'
id: totrans-115
prefs: []
type: TYPE_NORMAL
zh: '[Sphinx-Gallery生成的图库](https://sphinx-gallery.github.io)'
- en: Device ASR with Emformer RNN-T
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 使用Emformer RNN-T的设备ASR
- en: 原文:[https://pytorch.org/audio/stable/tutorials/device_asr.html](https://pytorch.org/audio/stable/tutorials/device_asr.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://pytorch.org/audio/stable/tutorials/device_asr.html](https://pytorch.org/audio/stable/tutorials/device_asr.html)
- en: Note
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: Click [here](#sphx-glr-download-tutorials-device-asr-py) to download the full
example code
id: totrans-3
prefs: []
type: TYPE_NORMAL
zh: 点击[这里](#sphx-glr-download-tutorials-device-asr-py)下载完整示例代码
- en: '**Author**: [Moto Hira](mailto:moto%40meta.com), [Jeff Hwang](mailto:jeffhwang%40meta.com).'
id: totrans-4
prefs: []
type: TYPE_NORMAL
zh: '**作者**:[Moto Hira](mailto:moto%40meta.com), [Jeff Hwang](mailto:jeffhwang%40meta.com)。'
- en: This tutorial shows how to use Emformer RNN-T and streaming API to perform speech
recognition on a streaming device input, i.e. microphone on laptop.
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: 本教程展示了如何使用Emformer RNN-T和流式API在流式设备输入上执行语音识别,即笔记本电脑上的麦克风。
- en: Note
id: totrans-6
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: This tutorial requires FFmpeg libraries. Please refer to [FFmpeg dependency](../installation.html#ffmpeg-dependency)
for the detail.
id: totrans-7
prefs: []
type: TYPE_NORMAL
zh: 本教程需要FFmpeg库。请参考[FFmpeg依赖](../installation.html#ffmpeg-dependency)获取详细信息。
- en: Note
id: totrans-8
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: This tutorial was tested on MacBook Pro and Dynabook with Windows 10.
id: totrans-9
prefs: []
type: TYPE_NORMAL
zh: 本教程在MacBook Pro和安装了Windows 10的Dynabook上进行了测试。
- en: This tutorial does NOT work on Google Colab because the server running this
tutorial does not have a microphone that you can talk to.
id: totrans-10
prefs: []
type: TYPE_NORMAL
zh: 本教程在Google Colab上不起作用,因为运行本教程的服务器没有可以与之交谈的麦克风。
- en: 1\. Overview[](#overview "Permalink to this heading")
id: totrans-11
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 1\. 概述[](#overview "Permalink to this heading")
- en: We use streaming API to fetch audio from audio device (microphone) chunk by
chunk, then run inference using Emformer RNN-T.
id: totrans-12
prefs: []
type: TYPE_NORMAL
zh: 我们使用流式API逐块从音频设备(麦克风)获取音频,然后使用Emformer RNN-T进行推理。
- en: For the basic usage of the streaming API and Emformer RNN-T please refer to
[StreamReader Basic Usage](./streamreader_basic_tutorial.html) and [Online ASR
with Emformer RNN-T](./online_asr_tutorial.html).
id: totrans-13
prefs: []
type: TYPE_NORMAL
zh: 有关流式API和Emformer RNN-T的基本用法,请参考[StreamReader基本用法](./streamreader_basic_tutorial.html)和[使用Emformer
RNN-T进行在线ASR](./online_asr_tutorial.html)。
- en: 2\. Checking the supported devices[](#checking-the-supported-devices "Permalink
to this heading")
id: totrans-14
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 2\. 检查支持的设备[](#checking-the-supported-devices "Permalink to this heading")
- en: Firstly, we need to check the devices that Streaming API can access, and figure
out the arguments (`src` and `format`) we need to pass to [`StreamReader()`](../generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader
"torchaudio.io.StreamReader") class.
id: totrans-15
prefs: []
type: TYPE_NORMAL
zh: 首先,我们需要检查流式API可以访问的设备,并找出我们需要传递给[`StreamReader()`](../generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader
"torchaudio.io.StreamReader")类的参数(`src`和`format`)。
- en: We use `ffmpeg` command for this. `ffmpeg` abstracts away the difference of
underlying hardware implementations, but the expected value for `format` varies
across OS and each `format` defines different syntax for `src`.
id: totrans-16
prefs: []
type: TYPE_NORMAL
zh: 我们使用`ffmpeg`命令来实现。`ffmpeg`抽象了底层硬件实现的差异,但`format`的预期值在不同操作系统上有所不同,每个`format`定义了不同的`src`语法。
- en: The details of supported `format` values and `src` syntax can be found in [https://ffmpeg.org/ffmpeg-devices.html](https://ffmpeg.org/ffmpeg-devices.html).
id: totrans-17
prefs: []
type: TYPE_NORMAL
zh: 有关支持的`format`值和`src`语法的详细信息,请参考[https://ffmpeg.org/ffmpeg-devices.html](https://ffmpeg.org/ffmpeg-devices.html)。
- en: For macOS, the following command will list the available devices.
id: totrans-18
prefs: []
type: TYPE_NORMAL
zh: 对于macOS,以下命令将列出可用设备。
- en: '[PRE0]'
id: totrans-19
prefs: []
type: TYPE_PRE
zh: '[PRE0]'
- en: We will use the following values for Streaming API.
id: totrans-20
prefs: []
type: TYPE_NORMAL
zh: 我们将使用以下值进行流式API。
- en: '[PRE1]'
id: totrans-21
prefs: []
type: TYPE_PRE
zh: '[PRE1]'
- en: For Windows, `dshow` device should work.
id: totrans-22
prefs: []
type: TYPE_NORMAL
zh: 对于Windows,`dshow`设备应该可以工作。
- en: '[PRE2]'
id: totrans-23
prefs: []
type: TYPE_PRE
zh: '[PRE2]'
- en: In the above case, the following value can be used to stream from microphone.
id: totrans-24
prefs: []
type: TYPE_NORMAL
zh: 在上述情况下,可以使用以下值从麦克风进行流式传输。
- en: '[PRE3]'
id: totrans-25
prefs: []
type: TYPE_PRE
zh: '[PRE3]'
- en: 3\. Data acquisition[](#data-acquisition "Permalink to this heading")
id: totrans-26
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 3\. 数据采集[](#data-acquisition "Permalink to this heading")
- en: Streaming audio from microphone input requires properly timing data acquisition.
Failing to do so may introduce discontinuities in the data stream.
id: totrans-27
prefs: []
type: TYPE_NORMAL
zh: 从麦克风输入流式音频需要正确计时数据采集。如果未能这样做,可能会导致数据流中出现不连续性。
- en: For this reason, we will run the data acquisition in a subprocess.
id: totrans-28
prefs: []
type: TYPE_NORMAL
zh: 因此,我们将在子进程中运行数据采集。
- en: Firstly, we create a helper function that encapsulates the whole process executed
in the subprocess.
id: totrans-29
prefs: []
type: TYPE_NORMAL
zh: 首先,我们创建一个封装在子进程中执行的整个过程的辅助函数。
- en: This function initializes the streaming API, acquires data then puts it in a
queue, which the main process is watching.
id: totrans-30
prefs: []
type: TYPE_NORMAL
zh: 此函数初始化流式API,获取数据然后将其放入队列,主进程正在监视该队列。
- en: '[PRE4]'
id: totrans-31
prefs: []
type: TYPE_PRE
zh: '[PRE4]'
- en: The notable difference from the non-device streaming is that, we provide `timeout`
and `backoff` parameters to `stream` method.
id: totrans-32
prefs: []
type: TYPE_NORMAL
zh: 与非设备流式的显着区别在于,我们为`stream`方法提供了`timeout`和`backoff`参数。
- en: When acquiring data, if the rate of acquisition requests is higher than that
at which the hardware can prepare the data, then the underlying implementation
reports special error code, and expects client code to retry.
id: totrans-33
prefs: []
type: TYPE_NORMAL
zh: 在获取数据时,如果获取请求的速率高于硬件准备数据的速率,则底层实现会报告特殊的错误代码,并期望客户端代码重试。
- en: Precise timing is the key for smooth streaming. Reporting this error from low-level
implementation all the way back to Python layer, before retrying adds undesired
overhead. For this reason, the retry behavior is implemented in C++ layer, and
`timeout` and `backoff` parameters allow client code to control the behavior.
id: totrans-34
prefs: []
type: TYPE_NORMAL
zh: 精确的时序是流畅流媒体的关键。从低级实现报告此错误一直返回到Python层,在重试之前会增加不必要的开销。因此,重试行为是在C++层实现的,`timeout`和`backoff`参数允许客户端代码控制行为。
- en: For the detail of `timeout` and `backoff` parameters, please refer to the documentation
of `stream()` method.
id: totrans-35
prefs: []
type: TYPE_NORMAL
zh: 有关`timeout`和`backoff`参数的详细信息,请参考`stream()`方法的文档。
- en: Note
id: totrans-36
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: The proper value of `backoff` depends on the system configuration. One way to
see if `backoff` value is appropriate is to save the series of acquired chunks
as a continuous audio and listen to it. If `backoff` value is too large, then
the data stream is discontinuous. The resulting audio sounds sped up. If `backoff`
value is too small or zero, the audio stream is fine, but the data acquisition
process enters busy-waiting state, and this increases the CPU consumption.
id: totrans-37
prefs: []
type: TYPE_NORMAL
zh: '`backoff`的适当值取决于系统配置。检查`backoff`值是否合适的一种方法是将获取的一系列块保存为连续音频并进行听取。如果`backoff`值太大,则数据流是不连续的。生成的音频听起来加快了。如果`backoff`值太小或为零,则音频流正常,但数据采集过程进入忙等待状态,这会增加CPU消耗。'
- en: 4\. Building inference pipeline[](#building-inference-pipeline "Permalink to
this heading")
id: totrans-38
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 4\. 构建推理流程[](#building-inference-pipeline "跳转到此标题")
- en: The next step is to create components required for inference.
id: totrans-39
prefs: []
type: TYPE_NORMAL
zh: 接下来的步骤是创建推理所需的组件。
- en: This is the same process as [Online ASR with Emformer RNN-T](./online_asr_tutorial.html).
id: totrans-40
prefs: []
type: TYPE_NORMAL
zh: 这与[使用Emformer RNN-T进行在线ASR](./online_asr_tutorial.html)是相同的流程。
- en: '[PRE5]'
id: totrans-41
prefs: []
type: TYPE_PRE
zh: '[PRE5]'
- en: '[PRE6]'
id: totrans-42
prefs: []
type: TYPE_PRE
zh: '[PRE6]'
- en: 5\. The main process[](#the-main-process "Permalink to this heading")
id: totrans-43
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 5\. 主要流程[](#the-main-process "跳转到此标题")
- en: 'The execution flow of the main process is as follows:'
id: totrans-44
prefs: []
type: TYPE_NORMAL
zh: 主进程的执行流程如下:
- en: Initialize the inference pipeline.
id: totrans-45
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 初始化推理流程。
- en: Launch data acquisition subprocess.
id: totrans-46
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 启动数据获取子进程。
- en: Run inference.
id: totrans-47
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 运行推理。
- en: Clean up
id: totrans-48
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 清理
- en: Note
id: totrans-49
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: As the data acquisition subprocess will be launched with “spawn” method, all
the code on global scope are executed on the subprocess as well.
id: totrans-50
prefs: []
type: TYPE_NORMAL
zh: 由于数据获取子进程将使用“spawn”方法启动,全局范围的所有代码也将在子进程中执行。
- en: We want to instantiate pipeline only in the main process, so we put them in
a function and invoke it within __name__ == “__main__” guard.
id: totrans-51
prefs: []
type: TYPE_NORMAL
zh: 我们希望只在主进程中实例化流程,因此我们将它们放在一个函数中,并在`__name__ == "__main__"`保护内调用它。
- en: '[PRE7]'
id: totrans-52
prefs: []
type: TYPE_PRE
zh: '[PRE7]'
- en: '[PRE8]'
id: totrans-53
prefs: []
type: TYPE_PRE
zh: '[PRE8]'
- en: 'Tag: [`torchaudio.io`](../io.html#module-torchaudio.io "torchaudio.io")'
id: totrans-54
prefs: []
type: TYPE_NORMAL
zh: 标签:[`torchaudio.io`](../io.html#module-torchaudio.io "torchaudio.io")
- en: '**Total running time of the script:** ( 0 minutes 0.000 seconds)'
id: totrans-55
prefs: []
type: TYPE_NORMAL
zh: '**脚本的总运行时间:**(0分钟0.000秒)'
- en: '[`Download Python source code: device_asr.py`](../_downloads/8009eae2a3a1a322f175ecc138597775/device_asr.py)'
id: totrans-56
prefs: []
type: TYPE_NORMAL
zh: '[`下载Python源代码:device_asr.py`](../_downloads/8009eae2a3a1a322f175ecc138597775/device_asr.py)'
- en: '[`Download Jupyter notebook: device_asr.ipynb`](../_downloads/c8265c298ed19ff44b504d5c3aa72563/device_asr.ipynb)'
id: totrans-57
prefs: []
type: TYPE_NORMAL
zh: '[`下载Jupyter笔记本:device_asr.ipynb`](../_downloads/c8265c298ed19ff44b504d5c3aa72563/device_asr.ipynb)'
- en: '[Gallery generated by Sphinx-Gallery](https://sphinx-gallery.github.io)'
id: totrans-58
prefs: []
type: TYPE_NORMAL
zh: '[Sphinx-Gallery生成的画廊](https://sphinx-gallery.github.io)'
- en: Device AV-ASR with Emformer RNN-T
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 使用Emformer RNN-T的设备AV-ASR
- en: 原文:[https://pytorch.org/audio/stable/tutorials/device_avsr.html](https://pytorch.org/audio/stable/tutorials/device_avsr.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://pytorch.org/audio/stable/tutorials/device_avsr.html](https://pytorch.org/audio/stable/tutorials/device_avsr.html)
- en: Note
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: Click [here](#sphx-glr-download-tutorials-device-avsr-py) to download the full
example code
id: totrans-3
prefs: []
type: TYPE_NORMAL
zh: 点击[这里](#sphx-glr-download-tutorials-device-avsr-py)下载完整示例代码
- en: '**Author**: [Pingchuan Ma](mailto:pingchuanma%40meta.com), [Moto Hira](mailto:moto%40meta.com).'
id: totrans-4
prefs: []
type: TYPE_NORMAL
zh: '**作者**:[Pingchuan Ma](mailto:pingchuanma%40meta.com), [Moto Hira](mailto:moto%40meta.com)。'
- en: This tutorial shows how to run on-device audio-visual speech recognition (AV-ASR,
or AVSR) with TorchAudio on a streaming device input, i.e. microphone on laptop.
AV-ASR is the task of transcribing text from audio and visual streams, which has
recently attracted a lot of research attention due to its robustness against noise.
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: 本教程展示了如何在流设备输入上(即笔记本电脑上的麦克风)使用TorchAudio运行设备上的音频-视觉语音识别(AV-ASR或AVSR)。AV-ASR是从音频和视觉流中转录文本的任务,最近因其对噪声的稳健性而引起了许多研究的关注。
- en: Note
id: totrans-6
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: This tutorial requires ffmpeg, sentencepiece, mediapipe, opencv-python and scikit-image
libraries.
id: totrans-7
prefs: []
type: TYPE_NORMAL
zh: 此教程需要ffmpeg、sentencepiece、mediapipe、opencv-python和scikit-image库。
- en: There are multiple ways to install ffmpeg libraries. If you are using Anaconda
Python distribution, `conda install -c conda-forge 'ffmpeg<7'` will install compatible
FFmpeg libraries.
id: totrans-8
prefs: []
type: TYPE_NORMAL
zh: 有多种安装ffmpeg库的方法。如果您使用Anaconda Python发行版,`conda install -c conda-forge 'ffmpeg<7'`将安装兼容的FFmpeg库。
- en: You can run `pip install sentencepiece mediapipe opencv-python scikit-image`
to install the other libraries mentioned.
id: totrans-9
prefs: []
type: TYPE_NORMAL
zh: 您可以运行`pip install sentencepiece mediapipe opencv-python scikit-image`来安装其他提到的库。
- en: Note
id: totrans-10
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: To run this tutorial, please make sure you are in the tutorial folder.
id: totrans-11
prefs: []
type: TYPE_NORMAL
zh: 要运行此教程,请确保您在教程文件夹中。
- en: Note
id: totrans-12
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: We tested the tutorial on torchaudio version 2.0.2 on Macbook Pro (M1 Pro).
id: totrans-13
prefs: []
type: TYPE_NORMAL
zh: 我们在Macbook Pro(M1 Pro)上测试了torchaudio版本2.0.2上的教程。
- en: '[PRE0]'
id: totrans-14
prefs: []
type: TYPE_PRE
zh: '[PRE0]'
- en: Overview[](#overview "Permalink to this heading")
id: totrans-15
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 概述[](#overview "Permalink to this heading")
- en: The real-time AV-ASR system is presented as follows, which consists of three
components, a data collection module, a pre-processing module and an end-to-end
model. The data collection module is hardware, such as a microphone and camera.
......@@ -64,140 +96,217 @@
collected, the pre-processing module location and crop out face. Next, we feed
the raw audio stream and the pre-processed video stream into our end-to-end model
for inference.
id: totrans-16
prefs: []
type: TYPE_NORMAL
zh: 实时AV-ASR系统如下所示,由三个组件组成,即数据收集模块、预处理模块和端到端模型。数据收集模块是硬件,如麦克风和摄像头。它的作用是从现实世界收集信息。一旦信息被收集,预处理模块会定位和裁剪出脸部。接下来,我们将原始音频流和预处理的视频流馈送到我们的端到端模型进行推断。
- en: '![https://download.pytorch.org/torchaudio/doc-assets/avsr/overview.png](../Images/757b2c4226d175a3a1b0d10e928d909c.png)'
id: totrans-17
prefs: []
type: TYPE_IMG
zh: '![https://download.pytorch.org/torchaudio/doc-assets/avsr/overview.png](../Images/757b2c4226d175a3a1b0d10e928d909c.png)'
- en: 1\. Data acquisition[](#data-acquisition "Permalink to this heading")
id: totrans-18
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 1\. 数据采集[](#data-acquisition "Permalink to this heading")
- en: Firstly, we define the function to collect videos from microphone and camera.
To be specific, we use [`StreamReader`](../generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader
"torchaudio.io.StreamReader") class for the purpose of data collection, which
supports capturing audio/video from microphone and camera. For the detailed usage
of this class, please refer to the [tutorial](./streamreader_basic_tutorial.html).
id: totrans-19
prefs: []
type: TYPE_NORMAL
zh: 首先,我们定义了从麦克风和摄像头收集视频的函数。具体来说,我们使用[`StreamReader`](../generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader
"torchaudio.io.StreamReader")类来进行数据收集,该类支持从麦克风和摄像头捕获音频/视频。有关此类的详细用法,请参考[教程](./streamreader_basic_tutorial.html)。
- en: '[PRE1]'
id: totrans-20
prefs: []
type: TYPE_PRE
zh: '[PRE1]'
- en: 2\. Pre-processing[](#pre-processing "Permalink to this heading")
id: totrans-21
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 2\. 预处理[](#pre-processing "Permalink to this heading")
- en: Before feeding the raw stream into our model, each video sequence has to undergo
a specific pre-processing procedure. This involves three critical steps. The first
step is to perform face detection. Following that, each individual frame is aligned
to a referenced frame, commonly known as the mean face, in order to normalize
rotation and size differences across frames. The final step in the pre-processing
module is to crop the face region from the aligned face image.
id: totrans-22
prefs: []
type: TYPE_NORMAL
zh: 在将原始流馈送到我们的模型之前,每个视频序列都必须经过特定的预处理过程。这涉及三个关键步骤。第一步是进行人脸检测。随后,将每个单独的帧对齐到一个参考帧,通常称为平均脸,以规范化帧之间的旋转和大小差异。预处理模块中的最后一步是从对齐的人脸图像中裁剪出脸部区域。
- en: '| ![https://download.pytorch.org/torchaudio/doc-assets/avsr/original.gif](../Images/b9142268a9c0666c9697c22b10755a18.png)
| ![https://download.pytorch.org/torchaudio/doc-assets/avsr/detected.gif](../Images/b44fd7d78a200f7ef203259295e21a8a.png)
| ![https://download.pytorch.org/torchaudio/doc-assets/avsr/transformed.gif](../Images/7029d284337ec7c2222d6b4344ac49d0.png)
| ![https://download.pytorch.org/torchaudio/doc-assets/avsr/cropped.gif](../Images/5aa4bb57e0b31b6d34ac3b4766e5503f.png)
|'
id: totrans-23
prefs: []
type: TYPE_TB
zh: '| ![https://download.pytorch.org/torchaudio/doc-assets/avsr/original.gif](../Images/b9142268a9c0666c9697c22b10755a18.png)
| ![https://download.pytorch.org/torchaudio/doc-assets/avsr/detected.gif](../Images/b44fd7d78a200f7ef203259295e21a8a.png)
| ![https://download.pytorch.org/torchaudio/doc-assets/avsr/transformed.gif](../Images/7029d284337ec7c2222d6b4344ac49d0.png)
| ![https://download.pytorch.org/torchaudio/doc-assets/avsr/cropped.gif](../Images/5aa4bb57e0b31b6d34ac3b4766e5503f.png)
|'
- en: '|'
id: totrans-24
prefs: []
type: TYPE_NORMAL
zh: '|'
- en: Original
id: totrans-25
prefs:
- PREF_OL
type: TYPE_NORMAL
zh:
- en: '|'
id: totrans-26
prefs: []
type: TYPE_NORMAL
zh: '|'
- en: Detected
id: totrans-27
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 检测
- en: '|'
id: totrans-28
prefs: []
type: TYPE_NORMAL
zh: '|'
- en: Transformed
id: totrans-29
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 转换
- en: '|'
id: totrans-30
prefs: []
type: TYPE_NORMAL
zh: '|'
- en: Cropped
id: totrans-31
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 裁剪
- en: '|'
id: totrans-32
prefs: []
type: TYPE_NORMAL
zh: '|'
- en: '[PRE2]'
id: totrans-33
prefs: []
type: TYPE_PRE
zh: '[PRE2]'
- en: 3\. Building inference pipeline[](#building-inference-pipeline "Permalink to
this heading")
id: totrans-34
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 3\. 构建推断管道[](#building-inference-pipeline "Permalink to this heading")
- en: The next step is to create components required for pipeline.
id: totrans-35
prefs: []
type: TYPE_NORMAL
zh: 下一步是创建管道所需的组件。
- en: We use convolutional-based front-ends to extract features from both the raw
audio and video streams. These features are then passed through a two-layer MLP
for fusion. For our transducer model, we leverage the TorchAudio library, which
incorporates an encoder (Emformer), a predictor, and a joint network. The architecture
of the proposed AV-ASR model is illustrated as follows.
id: totrans-36
prefs: []
type: TYPE_NORMAL
zh: 我们使用基于卷积的前端从原始音频和视频流中提取特征。然后,这些特征通过两层MLP进行融合。对于我们的转录器模型,我们利用了TorchAudio库,该库包含一个编码器(Emformer)、一个预测器和一个联合网络。所提出的AV-ASR模型的架构如下所示。
- en: '![https://download.pytorch.org/torchaudio/doc-assets/avsr/architecture.png](../Images/ed7f525d50ee520d70b7e9c6f6b7fd66.png)'
id: totrans-37
prefs: []
type: TYPE_IMG
zh: '![https://download.pytorch.org/torchaudio/doc-assets/avsr/architecture.png](../Images/ed7f525d50ee520d70b7e9c6f6b7fd66.png)'
- en: '[PRE3]'
id: totrans-38
prefs: []
type: TYPE_PRE
zh: '[PRE3]'
- en: 4\. The main process[](#the-main-process "Permalink to this heading")
id: totrans-39
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 4. 主进程[](#the-main-process "Permalink to this heading")
- en: 'The execution flow of the main process is as follows:'
id: totrans-40
prefs: []
type: TYPE_NORMAL
zh: 主进程的执行流程如下:
- en: Initialize the inference pipeline.
id: totrans-41
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 初始化推断流程。
- en: Launch data acquisition subprocess.
id: totrans-42
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 启动数据采集子进程。
- en: Run inference.
id: totrans-43
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 运行推断。
- en: Clean up
id: totrans-44
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 清理
- en: '[PRE4]'
id: totrans-45
prefs: []
type: TYPE_PRE
zh: '[PRE4]'
- en: '[PRE5]'
id: totrans-46
prefs: []
type: TYPE_PRE
zh: '[PRE5]'
- en: 'Tag: [`torchaudio.io`](../io.html#module-torchaudio.io "torchaudio.io")'
id: totrans-47
prefs: []
type: TYPE_NORMAL
zh: 标签:[`torchaudio.io`](../io.html#module-torchaudio.io "torchaudio.io")
- en: '**Total running time of the script:** ( 0 minutes 0.000 seconds)'
id: totrans-48
prefs: []
type: TYPE_NORMAL
zh: '**脚本的总运行时间:**(0分钟0.000秒)'
- en: '[`Download Python source code: device_avsr.py`](../_downloads/e10abb57121274b0bbaca74dbbd1fbc4/device_avsr.py)'
id: totrans-49
prefs: []
type: TYPE_NORMAL
zh: 下载Python源代码:device_avsr.py
- en: '[`Download Jupyter notebook: device_avsr.ipynb`](../_downloads/eb72a6f2273304a15352dfcf3b824b42/device_avsr.ipynb)'
id: totrans-50
prefs: []
type: TYPE_NORMAL
zh: 下载Jupyter笔记本:device_avsr.ipynb
- en: '[Gallery generated by Sphinx-Gallery](https://sphinx-gallery.github.io)'
id: totrans-51
prefs: []
type: TYPE_NORMAL
zh: '[Sphinx-Gallery生成的图库](https://sphinx-gallery.github.io)'
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
- en: Training Recipes
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 训练食谱
- en: Python API Reference
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: Python API 参考文档
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
- en: torchaudio.models.decoder
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: torchaudio.models.decoder
- en: 原文:[https://pytorch.org/audio/stable/models.decoder.html](https://pytorch.org/audio/stable/models.decoder.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://pytorch.org/audio/stable/models.decoder.html](https://pytorch.org/audio/stable/models.decoder.html)
- en: '## CTC Decoder[](#ctc-decoder "Permalink to this heading")'
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: '## CTC解码器[](#ctc-decoder "跳转到此标题")'
- en: '| [`CTCDecoder`](generated/torchaudio.models.decoder.CTCDecoder.html#torchaudio.models.decoder.CTCDecoder
"torchaudio.models.decoder.CTCDecoder") | CTC beam search decoder from *Flashlight*
[[Kahn *et al.*, 2022](references.html#id35 "Jacob Kahn, Vineel Pratap, Tatiana
Likhomanenko, Qiantong Xu, Awni Hannun, Jeff Cai, Paden Tomasello, Ann Lee, Edouard
Grave, Gilad Avidov, and others. Flashlight: enabling innovation in tools for
machine learning. arXiv preprint arXiv:2201.12465, 2022.")]. |'
id: totrans-3
prefs: []
type: TYPE_TB
zh: '| [`CTCDecoder`](generated/torchaudio.models.decoder.CTCDecoder.html#torchaudio.models.decoder.CTCDecoder
"torchaudio.models.decoder.CTCDecoder") | 来自 *Flashlight* 的CTC波束搜索解码器 [[Kahn *et
al.*, 2022](references.html#id35 "Jacob Kahn, Vineel Pratap, Tatiana Likhomanenko,
Qiantong Xu, Awni Hannun, Jeff Cai, Paden Tomasello, Ann Lee, Edouard Grave, Gilad
Avidov, and others. Flashlight: enabling innovation in tools for machine learning.
arXiv preprint arXiv:2201.12465, 2022.")]。 |'
- en: '| [`ctc_decoder`](generated/torchaudio.models.decoder.ctc_decoder.html#torchaudio.models.decoder.ctc_decoder
"torchaudio.models.decoder.ctc_decoder") | Builds an instance of [`CTCDecoder`](generated/torchaudio.models.decoder.CTCDecoder.html#torchaudio.models.decoder.CTCDecoder
"torchaudio.models.decoder.CTCDecoder"). |'
id: totrans-4
prefs: []
type: TYPE_TB
zh: '| [`ctc_decoder`](generated/torchaudio.models.decoder.ctc_decoder.html#torchaudio.models.decoder.ctc_decoder
"torchaudio.models.decoder.ctc_decoder") | 构建 [`CTCDecoder`](generated/torchaudio.models.decoder.CTCDecoder.html#torchaudio.models.decoder.CTCDecoder
"torchaudio.models.decoder.CTCDecoder") 的实例。 |'
- en: '| [`download_pretrained_files`](generated/torchaudio.models.decoder.download_pretrained_files.html#torchaudio.models.decoder.download_pretrained_files
"torchaudio.models.decoder.download_pretrained_files") | Retrieves pretrained
data files used for [`ctc_decoder()`](generated/torchaudio.models.decoder.ctc_decoder.html#torchaudio.models.decoder.ctc_decoder
"torchaudio.models.decoder.ctc_decoder"). |'
id: totrans-5
prefs: []
type: TYPE_TB
zh: '| [`download_pretrained_files`](generated/torchaudio.models.decoder.download_pretrained_files.html#torchaudio.models.decoder.download_pretrained_files
"torchaudio.models.decoder.download_pretrained_files") | 获取用于 [`ctc_decoder()`](generated/torchaudio.models.decoder.ctc_decoder.html#torchaudio.models.decoder.ctc_decoder
"torchaudio.models.decoder.ctc_decoder") 的预训练数据文件。 |'
- en: Tutorials using CTC Decoder
id: totrans-6
prefs: []
type: TYPE_NORMAL
zh: 使用CTC解码器的教程
- en: '![ASR Inference with CTC Decoder](../Images/260e63239576cae8ee00cfcba8e4889e.png)'
id: totrans-7
prefs: []
type: TYPE_IMG
zh: '![使用CTC解码器的ASR推理](../Images/260e63239576cae8ee00cfcba8e4889e.png)'
- en: '[ASR Inference with CTC Decoder](tutorials/asr_inference_with_ctc_decoder_tutorial.html#sphx-glr-tutorials-asr-inference-with-ctc-decoder-tutorial-py)'
id: totrans-8
prefs: []
type: TYPE_NORMAL
zh: '[使用CTC解码器的ASR推理](tutorials/asr_inference_with_ctc_decoder_tutorial.html#sphx-glr-tutorials-asr-inference-with-ctc-decoder-tutorial-py)'
- en: ASR Inference with CTC Decoder
id: totrans-9
prefs: []
type: TYPE_NORMAL
zh: 使用CTC解码器的ASR推理
- en: CUDA CTC Decoder[](#cuda-ctc-decoder "Permalink to this heading")
id: totrans-10
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: CUDA CTC解码器[](#cuda-ctc-decoder "跳转到此标题")
- en: '| [`CUCTCDecoder`](generated/torchaudio.models.decoder.CUCTCDecoder.html#torchaudio.models.decoder.CUCTCDecoder
"torchaudio.models.decoder.CUCTCDecoder") | CUDA CTC beam search decoder. |'
id: totrans-11
prefs: []
type: TYPE_TB
zh: '| [`CUCTCDecoder`](generated/torchaudio.models.decoder.CUCTCDecoder.html#torchaudio.models.decoder.CUCTCDecoder
"torchaudio.models.decoder.CUCTCDecoder") | CUDA CTC波束搜索解码器。 |'
- en: '| [`cuda_ctc_decoder`](generated/torchaudio.models.decoder.cuda_ctc_decoder.html#torchaudio.models.decoder.cuda_ctc_decoder
"torchaudio.models.decoder.cuda_ctc_decoder") | Builds an instance of [`CUCTCDecoder`](generated/torchaudio.models.decoder.CUCTCDecoder.html#torchaudio.models.decoder.CUCTCDecoder
"torchaudio.models.decoder.CUCTCDecoder"). |'
id: totrans-12
prefs: []
type: TYPE_TB
zh: '| [`cuda_ctc_decoder`](generated/torchaudio.models.decoder.cuda_ctc_decoder.html#torchaudio.models.decoder.cuda_ctc_decoder
"torchaudio.models.decoder.cuda_ctc_decoder") | 构建 [`CUCTCDecoder`](generated/torchaudio.models.decoder.CUCTCDecoder.html#torchaudio.models.decoder.CUCTCDecoder
"torchaudio.models.decoder.CUCTCDecoder") 的实例。 |'
- en: Tutorials using CUDA CTC Decoder
id: totrans-13
prefs: []
type: TYPE_NORMAL
zh: 使用CUDA CTC解码器的教程
- en: '![ASR Inference with CUDA CTC Decoder](../Images/9d0a043104707d980656cfaf03fdd1a1.png)'
id: totrans-14
prefs: []
type: TYPE_IMG
zh: '![使用CUDA CTC解码器的ASR推理](../Images/9d0a043104707d980656cfaf03fdd1a1.png)'
- en: '[ASR Inference with CUDA CTC Decoder](tutorials/asr_inference_with_cuda_ctc_decoder_tutorial.html#sphx-glr-tutorials-asr-inference-with-cuda-ctc-decoder-tutorial-py)'
id: totrans-15
prefs: []
type: TYPE_NORMAL
zh: '[使用CUDA CTC解码器的ASR推理](tutorials/asr_inference_with_cuda_ctc_decoder_tutorial.html#sphx-glr-tutorials-asr-inference-with-cuda-ctc-decoder-tutorial-py)'
- en: ASR Inference with CUDA CTC Decoder
id: totrans-16
prefs: []
type: TYPE_NORMAL
zh: 使用CUDA CTC解码器的ASR推理
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
- en: torio.utils
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: torio.utils
- en: 原文:[https://pytorch.org/audio/stable/torio.utils.html](https://pytorch.org/audio/stable/torio.utils.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://pytorch.org/audio/stable/torio.utils.html](https://pytorch.org/audio/stable/torio.utils.html)
- en: '`torio.utils` module contains utility functions to query and configure the
global state of third party libraries.'
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: '`torio.utils` 模块包含用于查询和配置第三方库全局状态的实用函数。'
- en: '| [`ffmpeg_utils`](generated/torio.utils.ffmpeg_utils.html#module-torio.utils.ffmpeg_utils
"torio.utils.ffmpeg_utils") | Module to change the configuration of FFmpeg libraries
(such as libavformat). |'
id: totrans-3
prefs: []
type: TYPE_TB
zh: '| [`ffmpeg_utils`](generated/torio.utils.ffmpeg_utils.html#module-torio.utils.ffmpeg_utils
"torio.utils.ffmpeg_utils") | 用于更改 FFmpeg 库(如 libavformat)配置的模块。 '
- en: Python Prototype API Reference
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: Python 原型 API 参考
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册