- en: torchaudio.pipelines id: totrans-0 prefs: - PREF_H1 type: TYPE_NORMAL zh: torchaudio.pipelines - en: 原文:[https://pytorch.org/audio/stable/pipelines.html](https://pytorch.org/audio/stable/pipelines.html) id: totrans-1 prefs: - PREF_BQ type: TYPE_NORMAL zh: 原文:[https://pytorch.org/audio/stable/pipelines.html](https://pytorch.org/audio/stable/pipelines.html) - en: The `torchaudio.pipelines` module packages pre-trained models with support functions and meta-data into simple APIs tailored to perform specific tasks. id: totrans-2 prefs: [] type: TYPE_NORMAL zh: '`torchaudio.pipelines`模块将预训练模型与支持函数和元数据打包成简单的API,以执行特定任务。' - en: When using pre-trained models to perform a task, in addition to instantiating the model with pre-trained weights, the client code also needs to build pipelines for feature extractions and post processing in the same way they were done during the training. This requires to carrying over information used during the training, such as the type of transforms and the their parameters (for example, sampling rate the number of FFT bins). id: totrans-3 prefs: [] type: TYPE_NORMAL zh: 当使用预训练模型执行任务时,除了使用预训练权重实例化模型外,客户端代码还需要以与训练期间相同的方式构建特征提取和后处理流水线。这需要将训练期间使用的信息传递过去,比如变换的类型和参数(例如,采样率和FFT频率数量)。 - en: To make this information tied to a pre-trained model and easily accessible, `torchaudio.pipelines` module uses the concept of a Bundle class, which defines a set of APIs to instantiate pipelines, and the interface of the pipelines. id: totrans-4 prefs: [] type: TYPE_NORMAL zh: 为了将这些信息与预训练模型绑定并轻松访问,`torchaudio.pipelines`模块使用Bundle类的概念,该类定义了一组API来实例化流水线和流水线的接口。 - en: The following figure illustrates this. id: totrans-5 prefs: [] type: TYPE_NORMAL zh: 以下图示说明了这一点。 - en: '![https://download.pytorch.org/torchaudio/doc-assets/pipelines-intro.png](../Images/7dc27a33a67f5b02c554368a2500bcb8.png)' id: totrans-6 prefs: [] type: TYPE_IMG zh: '![https://download.pytorch.org/torchaudio/doc-assets/pipelines-intro.png](../Images/7dc27a33a67f5b02c554368a2500bcb8.png)' - en: A pre-trained model and associated pipelines are expressed as an instance of `Bundle`. Different instances of same `Bundle` share the interface, but their implementations are not constrained to be of same types. For example, [`SourceSeparationBundle`](generated/torchaudio.pipelines.SourceSeparationBundle.html#torchaudio.pipelines.SourceSeparationBundle "torchaudio.pipelines.SourceSeparationBundle") defines the interface for performing source separation, but its instance [`CONVTASNET_BASE_LIBRI2MIX`](generated/torchaudio.pipelines.CONVTASNET_BASE_LIBRI2MIX.html#torchaudio.pipelines.CONVTASNET_BASE_LIBRI2MIX "torchaudio.pipelines.CONVTASNET_BASE_LIBRI2MIX") instantiates a model of [`ConvTasNet`](generated/torchaudio.models.ConvTasNet.html#torchaudio.models.ConvTasNet "torchaudio.models.ConvTasNet") while [`HDEMUCS_HIGH_MUSDB`](generated/torchaudio.pipelines.HDEMUCS_HIGH_MUSDB.html#torchaudio.pipelines.HDEMUCS_HIGH_MUSDB "torchaudio.pipelines.HDEMUCS_HIGH_MUSDB") instantiates a model of [`HDemucs`](generated/torchaudio.models.HDemucs.html#torchaudio.models.HDemucs "torchaudio.models.HDemucs"). Still, because they share the same interface, the usage is the same. id: totrans-7 prefs: [] type: TYPE_NORMAL zh: 预训练模型和相关流水线被表示为`Bundle`的实例。相同`Bundle`的不同实例共享接口,但它们的实现不受限于相同类型。例如,[`SourceSeparationBundle`](generated/torchaudio.pipelines.SourceSeparationBundle.html#torchaudio.pipelines.SourceSeparationBundle "torchaudio.pipelines.SourceSeparationBundle")定义了执行源分离的接口,但其实例[`CONVTASNET_BASE_LIBRI2MIX`](generated/torchaudio.pipelines.CONVTASNET_BASE_LIBRI2MIX.html#torchaudio.pipelines.CONVTASNET_BASE_LIBRI2MIX "torchaudio.pipelines.CONVTASNET_BASE_LIBRI2MIX")实例化了一个[`ConvTasNet`](generated/torchaudio.models.ConvTasNet.html#torchaudio.models.ConvTasNet "torchaudio.models.ConvTasNet")模型,而[`HDEMUCS_HIGH_MUSDB`](generated/torchaudio.pipelines.HDEMUCS_HIGH_MUSDB.html#torchaudio.pipelines.HDEMUCS_HIGH_MUSDB "torchaudio.pipelines.HDEMUCS_HIGH_MUSDB")实例化了一个[`HDemucs`](generated/torchaudio.models.HDemucs.html#torchaudio.models.HDemucs "torchaudio.models.HDemucs")模型。尽管如此,因为它们共享相同的接口,使用方式是相同的。 - en: Note id: totrans-8 prefs: [] type: TYPE_NORMAL zh: 注意 - en: Under the hood, the implementations of `Bundle` use components from other `torchaudio` modules, such as [`torchaudio.models`](models.html#module-torchaudio.models "torchaudio.models") and [`torchaudio.transforms`](transforms.html#module-torchaudio.transforms "torchaudio.transforms"), or even third party libraries like [SentencPiece](https://github.com/google/sentencepiece) and [DeepPhonemizer](https://github.com/as-ideas/DeepPhonemizer). But this implementation detail is abstracted away from library users. id: totrans-9 prefs: [] type: TYPE_NORMAL zh: 在底层,`Bundle`的实现使用了来自其他`torchaudio`模块的组件,比如[`torchaudio.models`](models.html#module-torchaudio.models "torchaudio.models")和[`torchaudio.transforms`](transforms.html#module-torchaudio.transforms "torchaudio.transforms"),甚至第三方库如[SentencPiece](https://github.com/google/sentencepiece)和[DeepPhonemizer](https://github.com/as-ideas/DeepPhonemizer)。但这些实现细节对库用户是抽象的。 - en: '## RNN-T Streaming/Non-Streaming ASR[](#rnn-t-streaming-non-streaming-asr "Permalink to this heading")' id: totrans-10 prefs: [] type: TYPE_NORMAL zh: '## RNN-T流式/非流式ASR[](#rnn-t-streaming-non-streaming-asr "Permalink to this heading")' - en: Interface[](#interface "Permalink to this heading") id: totrans-11 prefs: - PREF_H3 type: TYPE_NORMAL zh: 接口[](#interface "Permalink to this heading") - en: '`RNNTBundle` defines ASR pipelines and consists of three steps: feature extraction, inference, and de-tokenization.' id: totrans-12 prefs: [] type: TYPE_NORMAL zh: '`RNNTBundle`定义了ASR流水线,包括三个步骤:特征提取、推理和去标记化。' - en: '![https://download.pytorch.org/torchaudio/doc-assets/pipelines-rnntbundle.png](../Images/d53f88ebd8f526f56982a4de4848dcaf.png)' id: totrans-13 prefs: [] type: TYPE_IMG zh: '![https://download.pytorch.org/torchaudio/doc-assets/pipelines-rnntbundle.png](../Images/d53f88ebd8f526f56982a4de4848dcaf.png)' - en: '| [`RNNTBundle`](generated/torchaudio.pipelines.RNNTBundle.html#torchaudio.pipelines.RNNTBundle "torchaudio.pipelines.RNNTBundle") | Dataclass that bundles components for performing automatic speech recognition (ASR, speech-to-text) inference with an RNN-T model. |' id: totrans-14 prefs: [] type: TYPE_TB zh: '| [`RNNTBundle`](generated/torchaudio.pipelines.RNNTBundle.html#torchaudio.pipelines.RNNTBundle "torchaudio.pipelines.RNNTBundle") | 用于执行自动语音识别(ASR,语音转文本)推理的RNN-T模型的组件捆绑数据类。 |' - en: '| [`RNNTBundle.FeatureExtractor`](generated/torchaudio.pipelines.RNNTBundle.FeatureExtractor.html#torchaudio.pipelines.RNNTBundle.FeatureExtractor "torchaudio.pipelines.RNNTBundle.FeatureExtractor") | Interface of the feature extraction part of RNN-T pipeline |' id: totrans-15 prefs: [] type: TYPE_TB zh: '| [`RNNTBundle.FeatureExtractor`](generated/torchaudio.pipelines.RNNTBundle.FeatureExtractor.html#torchaudio.pipelines.RNNTBundle.FeatureExtractor "torchaudio.pipelines.RNNTBundle.FeatureExtractor") | RNN-T流水线中特征提取部分的接口 |' - en: '| [`RNNTBundle.TokenProcessor`](generated/torchaudio.pipelines.RNNTBundle.TokenProcessor.html#torchaudio.pipelines.RNNTBundle.TokenProcessor "torchaudio.pipelines.RNNTBundle.TokenProcessor") | Interface of the token processor part of RNN-T pipeline |' id: totrans-16 prefs: [] type: TYPE_TB zh: '| [`RNNTBundle.TokenProcessor`](generated/torchaudio.pipelines.RNNTBundle.TokenProcessor.html#torchaudio.pipelines.RNNTBundle.TokenProcessor "torchaudio.pipelines.RNNTBundle.TokenProcessor") | RNN-T流水线中标记处理器部分的接口 |' - en: Tutorials using `RNNTBundle` id: totrans-17 prefs: [] type: TYPE_NORMAL zh: 使用`RNNTBundle`的教程 - en: '![Online ASR with Emformer RNN-T](../Images/200081d049505bef5c1ce8e3c321134d.png)' id: totrans-18 prefs: [] type: TYPE_IMG zh: '![在线 ASR 与 Emformer RNN-T](../Images/200081d049505bef5c1ce8e3c321134d.png)' - en: '[Online ASR with Emformer RNN-T](tutorials/online_asr_tutorial.html#sphx-glr-tutorials-online-asr-tutorial-py)' id: totrans-19 prefs: [] type: TYPE_NORMAL zh: '[在线 ASR 与 Emformer RNN-T](tutorials/online_asr_tutorial.html#sphx-glr-tutorials-online-asr-tutorial-py)' - en: Online ASR with Emformer RNN-T![Device ASR with Emformer RNN-T](../Images/62ca7f96e6d3a3011aa85c2a9228f03f.png) id: totrans-20 prefs: [] type: TYPE_NORMAL zh: 在线 ASR 与 Emformer RNN-T![设备 ASR 与 Emformer RNN-T](../Images/62ca7f96e6d3a3011aa85c2a9228f03f.png) - en: '[Device ASR with Emformer RNN-T](tutorials/device_asr.html#sphx-glr-tutorials-device-asr-py)' id: totrans-21 prefs: [] type: TYPE_NORMAL zh: '[设备 ASR 与 Emformer RNN-T](tutorials/device_asr.html#sphx-glr-tutorials-device-asr-py)' - en: Device ASR with Emformer RNN-T id: totrans-22 prefs: [] type: TYPE_NORMAL zh: 设备 ASR 与 Emformer RNN-T - en: Pretrained Models[](#pretrained-models "Permalink to this heading") id: totrans-23 prefs: - PREF_H3 type: TYPE_NORMAL zh: 预训练模型[](#pretrained-models "跳转到此标题") - en: '| [`EMFORMER_RNNT_BASE_LIBRISPEECH`](generated/torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH.html#torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH "torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH") | ASR pipeline based on Emformer-RNNT, pretrained on *LibriSpeech* dataset [[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")], capable of performing both streaming and non-streaming inference. |' id: totrans-24 prefs: [] type: TYPE_TB zh: '| [`EMFORMER_RNNT_BASE_LIBRISPEECH`](generated/torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH.html#torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH "torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH") | 基于 Emformer-RNNT 的 ASR 流水线,在 *LibriSpeech* 数据集上进行预训练[[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")], 能够执行流式和非流式推理。 |' - en: wav2vec 2.0 / HuBERT / WavLM - SSL[](#wav2vec-2-0-hubert-wavlm-ssl "Permalink to this heading") id: totrans-25 prefs: - PREF_H2 type: TYPE_NORMAL zh: wav2vec 2.0 / HuBERT / WavLM - SSL[](#wav2vec-2-0-hubert-wavlm-ssl "跳转到此标题") - en: Interface[](#id2 "Permalink to this heading") id: totrans-26 prefs: - PREF_H3 type: TYPE_NORMAL zh: 界面[](#id2 "跳转到此标题") - en: '`Wav2Vec2Bundle` instantiates models that generate acoustic features that can be used for downstream inference and fine-tuning.' id: totrans-27 prefs: [] type: TYPE_NORMAL zh: '`Wav2Vec2Bundle` 实例化生成声学特征的模型,可用于下游推理和微调。' - en: '![https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2bundle.png](../Images/7a92fa41c1718aa05693226b9462514d.png)' id: totrans-28 prefs: [] type: TYPE_IMG zh: '![https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2bundle.png](../Images/7a92fa41c1718aa05693226b9462514d.png)' - en: '| [`Wav2Vec2Bundle`](generated/torchaudio.pipelines.Wav2Vec2Bundle.html#torchaudio.pipelines.Wav2Vec2Bundle "torchaudio.pipelines.Wav2Vec2Bundle") | Data class that bundles associated information to use pretrained [`Wav2Vec2Model`](generated/torchaudio.models.Wav2Vec2Model.html#torchaudio.models.Wav2Vec2Model "torchaudio.models.Wav2Vec2Model"). |' id: totrans-29 prefs: [] type: TYPE_TB zh: '| [`Wav2Vec2Bundle`](generated/torchaudio.pipelines.Wav2Vec2Bundle.html#torchaudio.pipelines.Wav2Vec2Bundle "torchaudio.pipelines.Wav2Vec2Bundle") | 数据类,捆绑相关信息以使用预训练的 [`Wav2Vec2Model`](generated/torchaudio.models.Wav2Vec2Model.html#torchaudio.models.Wav2Vec2Model "torchaudio.models.Wav2Vec2Model")。 |' - en: Pretrained Models[](#id3 "Permalink to this heading") id: totrans-30 prefs: - PREF_H3 type: TYPE_NORMAL zh: 预训练模型[](#id3 "跳转到此标题") - en: '| [`WAV2VEC2_BASE`](generated/torchaudio.pipelines.WAV2VEC2_BASE.html#torchaudio.pipelines.WAV2VEC2_BASE "torchaudio.pipelines.WAV2VEC2_BASE") | Wav2vec 2.0 model ("base" architecture), pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), not fine-tuned. |' id: totrans-31 prefs: [] type: TYPE_TB zh: '| [`WAV2VEC2_BASE`](generated/torchaudio.pipelines.WAV2VEC2_BASE.html#torchaudio.pipelines.WAV2VEC2_BASE "torchaudio.pipelines.WAV2VEC2_BASE") | Wav2vec 2.0 模型(“基础”架构),在 *LibriSpeech* 数据集的 960 小时未标记音频上进行预训练[[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")]("train-clean-100"、"train-clean-360" 和 "train-other-500" 的组合),未进行微调。 |' - en: '| [`WAV2VEC2_LARGE`](generated/torchaudio.pipelines.WAV2VEC2_LARGE.html#torchaudio.pipelines.WAV2VEC2_LARGE "torchaudio.pipelines.WAV2VEC2_LARGE") | Wav2vec 2.0 model ("large" architecture), pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), not fine-tuned. |' id: totrans-32 prefs: [] type: TYPE_TB zh: '| [`WAV2VEC2_LARGE`](generated/torchaudio.pipelines.WAV2VEC2_LARGE.html#torchaudio.pipelines.WAV2VEC2_LARGE "torchaudio.pipelines.WAV2VEC2_LARGE") | Wav2vec 2.0 模型(“大”架构),在 *LibriSpeech* 数据集的 960 小时未标记音频上进行预训练[[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")]("train-clean-100"、"train-clean-360" 和 "train-other-500" 的组合),未进行微调。 |' - en: '| [`WAV2VEC2_LARGE_LV60K`](generated/torchaudio.pipelines.WAV2VEC2_LARGE_LV60K.html#torchaudio.pipelines.WAV2VEC2_LARGE_LV60K "torchaudio.pipelines.WAV2VEC2_LARGE_LV60K") | Wav2vec 2.0 model ("large-lv60k" architecture), pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset [[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")], not fine-tuned. |' id: totrans-33 prefs: [] type: TYPE_TB zh: '| [`WAV2VEC2_LARGE_LV60K`](generated/torchaudio.pipelines.WAV2VEC2_LARGE_LV60K.html#torchaudio.pipelines.WAV2VEC2_LARGE_LV60K "torchaudio.pipelines.WAV2VEC2_LARGE_LV60K") | Wav2vec 2.0 模型(“large-lv60k”架构),在 *Libri-Light* 数据集的 60,000 小时未标记音频上进行预训练,未进行微调。 |' - en: '| [`WAV2VEC2_XLSR53`](generated/torchaudio.pipelines.WAV2VEC2_XLSR53.html#torchaudio.pipelines.WAV2VEC2_XLSR53 "torchaudio.pipelines.WAV2VEC2_XLSR53") | Wav2vec 2.0 model ("base" architecture), pre-trained on 56,000 hours of unlabeled audio from multiple datasets ( *Multilingual LibriSpeech* [[Pratap *et al.*, 2020](references.html#id11 "Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: a large-scale multilingual dataset for speech research. Interspeech 2020, Oct 2020\. URL: http://dx.doi.org/10.21437/Interspeech.2020-2826, doi:10.21437/interspeech.2020-2826.")], *CommonVoice* [[Ardila *et al.*, 2020](references.html#id10 "Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common voice: a massively-multilingual speech corpus. 2020\. arXiv:1912.06670.")] and *BABEL* [[Gales *et al.*, 2014](references.html#id9 "Mark John Francis Gales, Kate Knill, Anton Ragni, and Shakti Prasad Rath. Speech recognition and keyword spotting for low-resource languages: babel project research at cued. In SLTU. 2014.")]), not fine-tuned. |' id: totrans-34 prefs: [] type: TYPE_TB zh: '| [`WAV2VEC2_XLSR53`](generated/torchaudio.pipelines.WAV2VEC2_XLSR53.html#torchaudio.pipelines.WAV2VEC2_XLSR53 "torchaudio.pipelines.WAV2VEC2_XLSR53") | Wav2vec 2.0 模型(“基础”架构),在多个数据集的 56,000 小时未标记音频上进行预训练(*多语言 LibriSpeech*,*CommonVoice* 和 *BABEL*),未进行微调。 |' - en: '| [`WAV2VEC2_XLSR_300M`](generated/torchaudio.pipelines.WAV2VEC2_XLSR_300M.html#torchaudio.pipelines.WAV2VEC2_XLSR_300M "torchaudio.pipelines.WAV2VEC2_XLSR_300M") | XLS-R model with 300 million parameters, pre-trained on 436,000 hours of unlabeled audio from multiple datasets ( *Multilingual LibriSpeech* [[Pratap *et al.*, 2020](references.html#id11 "Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: a large-scale multilingual dataset for speech research. Interspeech 2020, Oct 2020\. URL: http://dx.doi.org/10.21437/Interspeech.2020-2826, doi:10.21437/interspeech.2020-2826.")], *CommonVoice* [[Ardila *et al.*, 2020](references.html#id10 "Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common voice: a massively-multilingual speech corpus. 2020\. arXiv:1912.06670.")], *VoxLingua107* [[Valk and Alumäe, 2021](references.html#id61 "Jörgen Valk and Tanel Alumäe. Voxlingua107: a dataset for spoken language recognition. In 2021 IEEE Spoken Language Technology Workshop (SLT), 652–658\. IEEE, 2021.")], *BABEL* [[Gales *et al.*, 2014](references.html#id9 "Mark John Francis Gales, Kate Knill, Anton Ragni, and Shakti Prasad Rath. Speech recognition and keyword spotting for low-resource languages: babel project research at cued. In SLTU. 2014.")], and *VoxPopuli* [[Wang *et al.*, 2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. CoRR, 2021\. URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390.")]) in 128 languages, not fine-tuned. |' id: totrans-35 prefs: [] type: TYPE_TB zh: '| [`WAV2VEC2_XLSR_300M`](generated/torchaudio.pipelines.WAV2VEC2_XLSR_300M.html#torchaudio.pipelines.WAV2VEC2_XLSR_300M "torchaudio.pipelines.WAV2VEC2_XLSR_300M") | XLS-R 模型,具有 3 亿个参数,在多个数据集的 436,000 小时未标记音频上进行预训练(*多语言 LibriSpeech*,*CommonVoice*,*VoxLingua107*,*BABEL* 和 *VoxPopuli*)涵盖 128 种语言,未进行微调。 |' - en: '| [`WAV2VEC2_XLSR_1B`](generated/torchaudio.pipelines.WAV2VEC2_XLSR_1B.html#torchaudio.pipelines.WAV2VEC2_XLSR_1B "torchaudio.pipelines.WAV2VEC2_XLSR_1B") | XLS-R model with 1 billion parameters, pre-trained on 436,000 hours of unlabeled audio from multiple datasets ( *Multilingual LibriSpeech* [[Pratap *et al.*, 2020](references.html#id11 "Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: a large-scale multilingual dataset for speech research. Interspeech 2020, Oct 2020\. URL: http://dx.doi.org/10.21437/Interspeech.2020-2826, doi:10.21437/interspeech.2020-2826.")], *CommonVoice* [[Ardila *et al.*, 2020](references.html#id10 "Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common voice: a massively-multilingual speech corpus. 2020\. arXiv:1912.06670.")], *VoxLingua107* [[Valk and Alumäe, 2021](references.html#id61 "Jörgen Valk and Tanel Alumäe. Voxlingua107: a dataset for spoken language recognition. In 2021 IEEE Spoken Language Technology Workshop (SLT), 652–658\. IEEE, 2021.")], *BABEL* [[Gales *et al.*, 2014](references.html#id9 "Mark John Francis Gales, Kate Knill, Anton Ragni, and Shakti Prasad Rath. Speech recognition and keyword spotting for low-resource languages: babel project research at cued. In SLTU. 2014.")], and *VoxPopuli* [[Wang *et al.*, 2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. CoRR, 2021\. URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390.")]) in 128 languages, not fine-tuned. |' id: totrans-36 prefs: [] type: TYPE_TB zh: '| [`WAV2VEC2_XLSR_1B`](generated/torchaudio.pipelines.WAV2VEC2_XLSR_1B.html#torchaudio.pipelines.WAV2VEC2_XLSR_1B "torchaudio.pipelines.WAV2VEC2_XLSR_1B") | XLS-R 模型,具有 10 亿个参数,在多个数据集的 436,000 小时未标记音频上进行了预训练(*多语言 LibriSpeech*,*CommonVoice*,*VoxLingua107*,*BABEL* 和 *VoxPopuli*)共 128 种语言,未进行微调。|' - en: '| [`WAV2VEC2_XLSR_2B`](generated/torchaudio.pipelines.WAV2VEC2_XLSR_2B.html#torchaudio.pipelines.WAV2VEC2_XLSR_2B "torchaudio.pipelines.WAV2VEC2_XLSR_2B") | XLS-R model with 2 billion parameters, pre-trained on 436,000 hours of unlabeled audio from multiple datasets ( *Multilingual LibriSpeech* [[Pratap *et al.*, 2020](references.html#id11 "Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: a large-scale multilingual dataset for speech research. Interspeech 2020, Oct 2020\. URL: http://dx.doi.org/10.21437/Interspeech.2020-2826, doi:10.21437/interspeech.2020-2826.")], *CommonVoice* [[Ardila *et al.*, 2020](references.html#id10 "Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common voice: a massively-multilingual speech corpus. 2020\. arXiv:1912.06670.")], *VoxLingua107* [[Valk and Alumäe, 2021](references.html#id61 "Jörgen Valk and Tanel Alumäe. Voxlingua107: a dataset for spoken language recognition. In 2021 IEEE Spoken Language Technology Workshop (SLT), 652–658\. IEEE, 2021.")], *BABEL* [[Gales *et al.*, 2014](references.html#id9 "Mark John Francis Gales, Kate Knill, Anton Ragni, and Shakti Prasad Rath. Speech recognition and keyword spotting for low-resource languages: babel project research at cued. In SLTU. 2014.")], and *VoxPopuli* [[Wang *et al.*, 2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. CoRR, 2021\. URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390.")]) in 128 languages, not fine-tuned. |' id: totrans-37 prefs: [] type: TYPE_TB zh: '| [`WAV2VEC2_XLSR_2B`](generated/torchaudio.pipelines.WAV2VEC2_XLSR_2B.html#torchaudio.pipelines.WAV2VEC2_XLSR_2B "torchaudio.pipelines.WAV2VEC2_XLSR_2B") | XLS-R 模型,具有 20 亿个参数,在多个数据集的 436,000 小时未标记音频上进行了预训练(*多语言 LibriSpeech*,*CommonVoice*,*VoxLingua107*,*BABEL* 和 *VoxPopuli*)共 128 种语言,未进行微调。|' - en: '| [`HUBERT_BASE`](generated/torchaudio.pipelines.HUBERT_BASE.html#torchaudio.pipelines.HUBERT_BASE "torchaudio.pipelines.HUBERT_BASE") | HuBERT model ("base" architecture), pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), not fine-tuned. |' id: totrans-38 prefs: [] type: TYPE_TB zh: '| [`HUBERT_BASE`](generated/torchaudio.pipelines.HUBERT_BASE.html#torchaudio.pipelines.HUBERT_BASE "torchaudio.pipelines.HUBERT_BASE") | HuBERT模型(“基础”架构),在*LibriSpeech*数据集的960小时未标记音频上进行预训练[[Panayotov等人,2015年](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210. 2015. doi:10.1109/ICASSP.2015.7178964.")](包括“train-clean-100”,“train-clean-360”和“train-other-500”),未进行微调。|' - en: '| [`HUBERT_LARGE`](generated/torchaudio.pipelines.HUBERT_LARGE.html#torchaudio.pipelines.HUBERT_LARGE "torchaudio.pipelines.HUBERT_LARGE") | HuBERT model ("large" architecture), pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset [[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")], not fine-tuned. |' id: totrans-39 prefs: [] type: TYPE_TB zh: '| [`HUBERT_LARGE`](generated/torchaudio.pipelines.HUBERT_LARGE.html#torchaudio.pipelines.HUBERT_LARGE "torchaudio.pipelines.HUBERT_LARGE") | HuBERT模型(“大”架构),在*Libri-Light*数据集的60,000小时未标记音频上进行预训练[[Kahn等人,2020年](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673. 2020. \url https://github.com/facebookresearch/libri-light.")],未进行微调。|' - en: '| [`HUBERT_XLARGE`](generated/torchaudio.pipelines.HUBERT_XLARGE.html#torchaudio.pipelines.HUBERT_XLARGE "torchaudio.pipelines.HUBERT_XLARGE") | HuBERT model ("extra large" architecture), pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset [[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")], not fine-tuned. |' id: totrans-40 prefs: [] type: TYPE_TB zh: '| [`HUBERT_XLARGE`](generated/torchaudio.pipelines.HUBERT_XLARGE.html#torchaudio.pipelines.HUBERT_XLARGE "torchaudio.pipelines.HUBERT_XLARGE") | HuBERT模型(“超大”架构),在*Libri-Light*数据集的60,000小时未标记音频上进行预训练[[Kahn等人,2020年](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673. 2020. \url https://github.com/facebookresearch/libri-light.")],未进行微调。|' - en: '| [`WAVLM_BASE`](generated/torchaudio.pipelines.WAVLM_BASE.html#torchaudio.pipelines.WAVLM_BASE "torchaudio.pipelines.WAVLM_BASE") | WavLM Base model ("base" architecture), pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")], not fine-tuned. |' id: totrans-41 prefs: [] type: TYPE_TB zh: '| [`WAVLM_BASE`](generated/torchaudio.pipelines.WAVLM_BASE.html#torchaudio.pipelines.WAVLM_BASE "torchaudio.pipelines.WAVLM_BASE") | WavLM基础模型(“基础”架构),在*LibriSpeech*数据集的960小时未标记音频上进行预训练[[Panayotov等人,2015年](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210. 2015. doi:10.1109/ICASSP.2015.7178964.")],未进行微调。|' - en: '| [`WAVLM_BASE_PLUS`](generated/torchaudio.pipelines.WAVLM_BASE_PLUS.html#torchaudio.pipelines.WAVLM_BASE_PLUS "torchaudio.pipelines.WAVLM_BASE_PLUS") | WavLM Base+ model ("base" architecture), pre-trained on 60,000 hours of Libri-Light dataset [[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")], 10,000 hours of GigaSpeech [[Chen *et al.*, 2021](references.html#id56 "Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, and Zhiyong Yan. Gigaspeech: an evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. In Proc. Interspeech 2021\. 2021.")], and 24,000 hours of *VoxPopuli* [[Wang *et al.*, 2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. CoRR, 2021\. URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390.")], not fine-tuned. |' id: totrans-42 prefs: [] type: TYPE_TB zh: '| [`WAVLM_BASE_PLUS`](generated/torchaudio.pipelines.WAVLM_BASE_PLUS.html#torchaudio.pipelines.WAVLM_BASE_PLUS "torchaudio.pipelines.WAVLM_BASE_PLUS") | WavLM 基础+ 模型("base" 架构),在 60,000 小时的 Libri-Light 数据集上进行了预训练[[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")],10,000 小时的 GigaSpeech[[Chen *et al.*, 2021](references.html#id56 "Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, and Zhiyong Yan. Gigaspeech: an evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. In Proc. Interspeech 2021\. 2021.")],以及 24,000 小时的 *VoxPopuli*[[Wang *et al.*, 2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. CoRR, 2021\. URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390.")],未进行微调。|' - en: '| [`WAVLM_LARGE`](generated/torchaudio.pipelines.WAVLM_LARGE.html#torchaudio.pipelines.WAVLM_LARGE "torchaudio.pipelines.WAVLM_LARGE") | WavLM Large model ("large" architecture), pre-trained on 60,000 hours of Libri-Light dataset [[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")], 10,000 hours of GigaSpeech [[Chen *et al.*, 2021](references.html#id56 "Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, and Zhiyong Yan. Gigaspeech: an evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. In Proc. Interspeech 2021\. 2021.")], and 24,000 hours of *VoxPopuli* [[Wang *et al.*, 2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. CoRR, 2021\. URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390.")], not fine-tuned. |' id: totrans-43 prefs: [] type: TYPE_TB zh: '[`WAVLM_LARGE`](generated/torchaudio.pipelines.WAVLM_LARGE.html#torchaudio.pipelines.WAVLM_LARGE "torchaudio.pipelines.WAVLM_LARGE") | WavLM 大型模型("large" 架构),在 60,000 小时的 Libri-Light 数据集上进行了预训练[[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")],10,000 小时的 GigaSpeech[[Chen *et al.*, 2021](references.html#id56 "Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, and Zhiyong Yan. Gigaspeech: an evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. In Proc. Interspeech 2021\. 2021.")],以及 24,000 小时的 *VoxPopuli*[[Wang *et al.*, 2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. CoRR, 2021\. URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390.")],未进行微调。|' - en: wav2vec 2.0 / HuBERT - Fine-tuned ASR[](#wav2vec-2-0-hubert-fine-tuned-asr "Permalink to this heading") id: totrans-44 prefs: - PREF_H2 type: TYPE_NORMAL zh: wav2vec 2.0 / HuBERT - 微调 ASR[](#wav2vec-2-0-hubert-fine-tuned-asr "Permalink to this heading") - en: Interface[](#id35 "Permalink to this heading") id: totrans-45 prefs: - PREF_H3 type: TYPE_NORMAL zh: Interface[](#id35 "Permalink to this heading") - en: '`Wav2Vec2ASRBundle` instantiates models that generate probability distribution over pre-defined labels, that can be used for ASR.' id: totrans-46 prefs: [] type: TYPE_NORMAL zh: '`Wav2Vec2ASRBundle` 实例化了生成预定义标签上的概率分布的模型,可用于 ASR。' - en: '![https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2asrbundle.png](../Images/5f9b45dac675bb2cb840209162a85158.png)' id: totrans-47 prefs: [] type: TYPE_IMG zh: '![https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2asrbundle.png](../Images/5f9b45dac675bb2cb840209162a85158.png)' - en: '| [`Wav2Vec2ASRBundle`](generated/torchaudio.pipelines.Wav2Vec2ASRBundle.html#torchaudio.pipelines.Wav2Vec2ASRBundle "torchaudio.pipelines.Wav2Vec2ASRBundle") | Data class that bundles associated information to use pretrained [`Wav2Vec2Model`](generated/torchaudio.models.Wav2Vec2Model.html#torchaudio.models.Wav2Vec2Model "torchaudio.models.Wav2Vec2Model"). |' id: totrans-48 prefs: [] type: TYPE_TB zh: '| [`Wav2Vec2ASRBundle`](generated/torchaudio.pipelines.Wav2Vec2ASRBundle.html#torchaudio.pipelines.Wav2Vec2ASRBundle "torchaudio.pipelines.Wav2Vec2ASRBundle") | 数据类,捆绑了与预训练的 [`Wav2Vec2Model`](generated/torchaudio.models.Wav2Vec2Model.html#torchaudio.models.Wav2Vec2Model "torchaudio.models.Wav2Vec2Model") 相关的信息。 |' - en: Tutorials using `Wav2Vec2ASRBundle` id: totrans-49 prefs: [] type: TYPE_NORMAL zh: 使用 `Wav2Vec2ASRBundle` 的教程 - en: '![Speech Recognition with Wav2Vec2](../Images/a6aefab61852740b8a11d3cfd1ac6866.png)' id: totrans-50 prefs: [] type: TYPE_IMG zh: '![使用 Wav2Vec2 进行语音识别](../Images/a6aefab61852740b8a11d3cfd1ac6866.png)' - en: '[Speech Recognition with Wav2Vec2](tutorials/speech_recognition_pipeline_tutorial.html#sphx-glr-tutorials-speech-recognition-pipeline-tutorial-py)' id: totrans-51 prefs: [] type: TYPE_NORMAL zh: '[使用 Wav2Vec2 进行语音识别](tutorials/speech_recognition_pipeline_tutorial.html#sphx-glr-tutorials-speech-recognition-pipeline-tutorial-py)' - en: Speech Recognition with Wav2Vec2![ASR Inference with CTC Decoder](../Images/260e63239576cae8ee00cfcba8e4889e.png) id: totrans-52 prefs: [] type: TYPE_NORMAL zh: 使用 Wav2Vec2 进行语音识别![CTC 解码器进行 ASR 推断](../Images/260e63239576cae8ee00cfcba8e4889e.png) - en: '[ASR Inference with CTC Decoder](tutorials/asr_inference_with_ctc_decoder_tutorial.html#sphx-glr-tutorials-asr-inference-with-ctc-decoder-tutorial-py)' id: totrans-53 prefs: [] type: TYPE_NORMAL zh: '[CTC 解码器进行 ASR 推断](tutorials/asr_inference_with_ctc_decoder_tutorial.html#sphx-glr-tutorials-asr-inference-with-ctc-decoder-tutorial-py)' - en: ASR Inference with CTC Decoder![Forced Alignment with Wav2Vec2](../Images/6658c9fe256ea584e84432cc92cd4db9.png) id: totrans-54 prefs: [] type: TYPE_NORMAL zh: CTC 解码器进行 ASR 推断![使用 Wav2Vec2 进行强制对齐](../Images/6658c9fe256ea584e84432cc92cd4db9.png) - en: '[Forced Alignment with Wav2Vec2](tutorials/forced_alignment_tutorial.html#sphx-glr-tutorials-forced-alignment-tutorial-py)' id: totrans-55 prefs: [] type: TYPE_NORMAL zh: '[使用 Wav2Vec2 进行强制对齐](tutorials/forced_alignment_tutorial.html#sphx-glr-tutorials-forced-alignment-tutorial-py)' - en: Forced Alignment with Wav2Vec2 id: totrans-56 prefs: [] type: TYPE_NORMAL zh: 使用 Wav2Vec2 进行强制对齐 - en: Pretrained Models[](#id36 "Permalink to this heading") id: totrans-57 prefs: - PREF_H3 type: TYPE_NORMAL zh: 预训练模型[](#id36 "跳转到此标题的永久链接") - en: '| [`WAV2VEC2_ASR_BASE_10M`](generated/torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M.html#torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M "torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M") | Wav2vec 2.0 model ("base" architecture with an extra linear module), pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and fine-tuned for ASR on 10 minutes of transcribed audio from *Libri-Light* dataset [[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")] ("train-10min" subset). |' id: totrans-58 prefs: [] type: TYPE_TB zh: '| [`WAV2VEC2_ASR_BASE_10M`](generated/torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M.html#torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M "torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M") | Wav2vec 2.0 模型(带有额外线性模块的“基础”架构),在 *LibriSpeech* 数据集的 960 小时未标记音频上进行预训练[[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")](由 "train-clean-100"、"train-clean-360" 和 "train-other-500" 组成),并在 *Libri-Light* 数据集的 10 分钟转录音频上进行了 ASR 微调[[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")]("train-10min" 子集)。 |' - en: '| [`WAV2VEC2_ASR_BASE_100H`](generated/torchaudio.pipelines.WAV2VEC2_ASR_BASE_100H.html#torchaudio.pipelines.WAV2VEC2_ASR_BASE_100H "torchaudio.pipelines.WAV2VEC2_ASR_BASE_100H") | Wav2vec 2.0 model ("base" architecture with an extra linear module), pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and fine-tuned for ASR on 100 hours of transcribed audio from "train-clean-100" subset. |' id: totrans-59 prefs: [] type: TYPE_TB zh: '| [`WAV2VEC2_ASR_BASE_100H`](generated/torchaudio.pipelines.WAV2VEC2_ASR_BASE_100H.html#torchaudio.pipelines.WAV2VEC2_ASR_BASE_100H "torchaudio.pipelines.WAV2VEC2_ASR_BASE_100H") | Wav2vec 2.0 模型(带有额外线性模块的“基础”架构),在 *LibriSpeech* 数据集的 960 小时未标记音频上进行预训练[[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")](由 "train-clean-100"、"train-clean-360" 和 "train-other-500" 组成),并在 "train-clean-100" 子集的 100 小时转录音频上进行了 ASR 微调。 |' - en: '| [`WAV2VEC2_ASR_BASE_960H`](generated/torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H.html#torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H "torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H") | Wav2vec 2.0 model ("base" architecture with an extra linear module), pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and fine-tuned for ASR on the same audio with the corresponding transcripts. |' id: totrans-60 prefs: [] type: TYPE_TB zh: '| [`WAV2VEC2_ASR_BASE_960H`](generated/torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H.html#torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H "torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H") | Wav2vec 2.0 模型("base" 架构,带有额外的线性模块),在 *LibriSpeech* 数据集的 960 小时未标记音频上进行预训练[[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")]("train-clean-100"、"train-clean-360" 和 "train-other-500" 的组合),并在相同音频上与相应的转录进行了 ASR 微调。 |' - en: '| [`WAV2VEC2_ASR_LARGE_10M`](generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_10M.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_10M "torchaudio.pipelines.WAV2VEC2_ASR_LARGE_10M") | Wav2vec 2.0 model ("large" architecture with an extra linear module), pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and fine-tuned for ASR on 10 minutes of transcribed audio from *Libri-Light* dataset [[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")] ("train-10min" subset). |' id: totrans-61 prefs: [] type: TYPE_TB zh: '| [`WAV2VEC2_ASR_LARGE_10M`](generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_10M.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_10M "torchaudio.pipelines.WAV2VEC2_ASR_LARGE_10M") | Wav2vec 2.0 模型("large" 架构,带有额外的线性模块),在 *LibriSpeech* 数据集的 960 小时未标记音频上进行预训练[[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")]("train-clean-100"、"train-clean-360" 和 "train-other-500" 的组合),并在 *Libri-Light* 数据集的 10 分钟转录音频上进行了 ASR 微调[[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")]("train-10min" 子集)。 |' - en: '| [`WAV2VEC2_ASR_LARGE_100H`](generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_100H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_100H "torchaudio.pipelines.WAV2VEC2_ASR_LARGE_100H") | Wav2vec 2.0 model ("large" architecture with an extra linear module), pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and fine-tuned for ASR on 100 hours of transcribed audio from the same dataset ("train-clean-100" subset). |' id: totrans-62 prefs: [] type: TYPE_TB zh: '| [`WAV2VEC2_ASR_LARGE_100H`](generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_100H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_100H "torchaudio.pipelines.WAV2VEC2_ASR_LARGE_100H") | Wav2vec 2.0 模型("large" 架构,带有额外的线性模块),在 *LibriSpeech* 数据集的 960 小时未标记音频上进行预训练[[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")]("train-clean-100"、"train-clean-360" 和 "train-other-500" 的组合),并在相同数据集的 100 小时转录音频上进行了 ASR 微调("train-clean-100" 子集)。 |' - en: '| [`WAV2VEC2_ASR_LARGE_960H`](generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_960H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_960H "torchaudio.pipelines.WAV2VEC2_ASR_LARGE_960H") | Wav2vec 2.0 model ("large" architecture with an extra linear module), pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and fine-tuned for ASR on the same audio with the corresponding transcripts. |' id: totrans-63 prefs: [] type: TYPE_TB zh: '| [`WAV2VEC2_ASR_LARGE_960H`](generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_960H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_960H "torchaudio.pipelines.WAV2VEC2_ASR_LARGE_960H") | Wav2vec 2.0 模型("large" 架构,带有额外的线性模块),在 *LibriSpeech* 数据集的 960 小时未标记音频上进行预训练[[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")]("train-clean-100"、"train-clean-360" 和 "train-other-500" 的组合),并在相同音频上与相应的转录进行了 ASR 微调。 |' - en: '| [`WAV2VEC2_ASR_LARGE_LV60K_10M`](generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M "torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M") | Wav2vec 2.0 model ("large-lv60k" architecture with an extra linear module), pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset [[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")], and fine-tuned for ASR on 10 minutes of transcribed audio from the same dataset ("train-10min" subset). |' id: totrans-64 prefs: [] type: TYPE_TB zh: '| [`WAV2VEC2_ASR_LARGE_LV60K_10M`](generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M "torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M") | Wav2vec 2.0 模型("large-lv60k" 架构,带有额外的线性模块),在 *Libri-Light* 数据集的 60,000 小时未标记音频上进行预训练[[Kahn 等人,2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")],并在相同数据集的经过转录的音频上进行了 ASR 的微调("train-10min" 子集)。 |' - en: '| [`WAV2VEC2_ASR_LARGE_LV60K_100H`](generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H "torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H") | Wav2vec 2.0 model ("large-lv60k" architecture with an extra linear module), pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset [[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")], and fine-tuned for ASR on 100 hours of transcribed audio from *LibriSpeech* dataset [[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")] ("train-clean-100" subset). |' id: totrans-65 prefs: [] type: TYPE_TB zh: '| [`WAV2VEC2_ASR_LARGE_LV60K_100H`](generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H "torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H") | Wav2vec 2.0 模型("large-lv60k" 架构,带有额外的线性模块),在 *Libri-Light* 数据集的 60,000 小时未标记音频上进行预训练[[Kahn 等人,2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")],并在 *LibriSpeech* 数据集的经过转录的音频上进行了 ASR 的微调,微调时长为 100 小时[[Panayotov 等人,2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")]("train-clean-100" 子集)。 |' - en: '| [`WAV2VEC2_ASR_LARGE_LV60K_960H`](generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H "torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H") | Wav2vec 2.0 model ("large-lv60k" architecture with an extra linear module), pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* [[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")] dataset, and fine-tuned for ASR on 960 hours of transcribed audio from *LibriSpeech* dataset [[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"). |' id: totrans-66 prefs: [] type: TYPE_TB zh: '| [`WAV2VEC2_ASR_LARGE_LV60K_960H`](generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H "torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H") | Wav2vec 2.0 模型("large-lv60k" 架构,带有额外的线性模块),在 *Libri-Light* 数据集的 60,000 小时未标记音频上进行预训练[[Kahn 等人,2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")] 数据集,并在 *LibriSpeech* 数据集的经过转录的音频上进行了 ASR 的微调,微调时长为 960 小时[[Panayotov 等人,2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")]("train-clean-100"、"train-clean-360" 和 "train-other-500" 的组合)。 |' - en: '| [`VOXPOPULI_ASR_BASE_10K_DE`](generated/torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_DE.html#torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_DE "torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_DE") | wav2vec 2.0 model ("base" architecture), pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [[Wang *et al.*, 2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. CoRR, 2021\. URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390.")] ("10k" subset, consisting of 23 languages), and fine-tuned for ASR on 282 hours of transcribed audio from "de" subset. |' id: totrans-67 prefs: [] type: TYPE_TB zh: '| [`VOXPOPULI_ASR_BASE_10K_DE`](generated/torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_DE.html#torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_DE "torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_DE") | wav2vec 2.0 模型(“基础”架构),在 *VoxPopuli* 数据集的 10k 小时未标记音频上进行预训练[[Wang 等人,2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, 和 Emmanuel Dupoux. Voxpopuli: 用于表示学习、半监督学习和解释的大规模多语言语音语料库。CoRR,2021。URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390。")](由 23 种语言组成的“10k”子集),并在来自“de”子集的 282 小时转录音频上进行了 ASR 微调。|' - en: '| [`VOXPOPULI_ASR_BASE_10K_EN`](generated/torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_EN.html#torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_EN "torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_EN") | wav2vec 2.0 model ("base" architecture), pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [[Wang *et al.*, 2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. CoRR, 2021\. URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390.")] ("10k" subset, consisting of 23 languages), and fine-tuned for ASR on 543 hours of transcribed audio from "en" subset. |' id: totrans-68 prefs: [] type: TYPE_TB zh: '| [`VOXPOPULI_ASR_BASE_10K_EN`](generated/torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_EN.html#torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_EN "torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_EN") | wav2vec 2.0 模型(“基础”架构),在 *VoxPopuli* 数据集的 10k 小时未标记音频上进行预训练[[Wang 等人,2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, 和 Emmanuel Dupoux. Voxpopuli: 用于表示学习、半监督学习和解释的大规模多语言语音语料库。CoRR,2021。URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390。")](由 23 种语言组成的“10k”子集),并在来自“en”子集的 543 小时转录音频上进行了 ASR 微调。|' - en: '| [`VOXPOPULI_ASR_BASE_10K_ES`](generated/torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_ES.html#torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_ES "torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_ES") | wav2vec 2.0 model ("base" architecture), pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [[Wang *et al.*, 2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. CoRR, 2021\. URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390.")] ("10k" subset, consisting of 23 languages), and fine-tuned for ASR on 166 hours of transcribed audio from "es" subset. |' id: totrans-69 prefs: [] type: TYPE_TB zh: '| [`VOXPOPULI_ASR_BASE_10K_ES`](generated/torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_ES.html#torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_ES "torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_ES") | wav2vec 2.0 模型(“基础”架构),在 *VoxPopuli* 数据集的 10k 小时未标记音频上进行预训练[[Wang 等人,2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, 和 Emmanuel Dupoux. Voxpopuli: 用于表示学习、半监督学习和解释的大规模多语言语音语料库。CoRR,2021。URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390。")](由 23 种语言组成的“10k”子集),并在来自“es”子集的 166 小时转录音频上进行了 ASR 微调。|' - en: '| [`VOXPOPULI_ASR_BASE_10K_FR`](generated/torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_FR.html#torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_FR "torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_FR") | wav2vec 2.0 model ("base" architecture), pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [[Wang *et al.*, 2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. CoRR, 2021\. URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390.")] ("10k" subset, consisting of 23 languages), and fine-tuned for ASR on 211 hours of transcribed audio from "fr" subset. |' id: totrans-70 prefs: [] type: TYPE_TB zh: '| [`VOXPOPULI_ASR_BASE_10K_FR`](generated/torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_FR.html#torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_FR "torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_FR") | wav2vec 2.0 模型(“基础”架构),在 *VoxPopuli* 数据集的 10k 小时未标记音频上进行预训练[[Wang 等人,2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, 和 Emmanuel Dupoux. Voxpopuli: 用于表示学习、半监督学习和解释的大规模多语言语音语料库。CoRR,2021。URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390。")](由 23 种语言组成的“10k”子集),并在来自“fr”子集的 211 小时转录音频上进行了 ASR 微调。|' - en: '| [`VOXPOPULI_ASR_BASE_10K_IT`](generated/torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_IT.html#torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_IT "torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_IT") | wav2vec 2.0 model ("base" architecture), pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [[Wang *et al.*, 2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. CoRR, 2021\. URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390.")] ("10k" subset, consisting of 23 languages), and fine-tuned for ASR on 91 hours of transcribed audio from "it" subset. |' id: totrans-71 prefs: [] type: TYPE_TB zh: '| [`VOXPOPULI_ASR_BASE_10K_IT`](generated/torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_IT.html#torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_IT "torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_IT") | wav2vec 2.0 模型(“base” 架构),在 *VoxPopuli* 数据集的 10,000 小时未标记音频上进行预训练[[Wang *et al.*, 2021](references.html#id5 "Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. CoRR, 2021\. URL: https://arxiv.org/abs/2101.00390, arXiv:2101.00390.")](由 23 种语言组成的“10k”子集),并在来自“it”子集的 91 小时转录音频上进行了 ASR 微调。 |' - en: '| [`HUBERT_ASR_LARGE`](generated/torchaudio.pipelines.HUBERT_ASR_LARGE.html#torchaudio.pipelines.HUBERT_ASR_LARGE "torchaudio.pipelines.HUBERT_ASR_LARGE") | HuBERT model ("large" architecture), pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset [[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")], and fine-tuned for ASR on 960 hours of transcribed audio from *LibriSpeech* dataset [[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"). |' id: totrans-72 prefs: [] type: TYPE_TB zh: '| [`HUBERT_ASR_LARGE`](generated/torchaudio.pipelines.HUBERT_ASR_LARGE.html#torchaudio.pipelines.HUBERT_ASR_LARGE "torchaudio.pipelines.HUBERT_ASR_LARGE") | HuBERT 模型(“large” 架构),在 *Libri-Light* 数据集的 60,000 小时未标记音频上进行预训练[[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")], 并在来自 *LibriSpeech* 数据集的 960 小时转录音频上进行了 ASR 微调(由 "train-clean-100", "train-clean-360", 和 "train-other-500" 组成)。 |' - en: '| [`HUBERT_ASR_XLARGE`](generated/torchaudio.pipelines.HUBERT_ASR_XLARGE.html#torchaudio.pipelines.HUBERT_ASR_XLARGE "torchaudio.pipelines.HUBERT_ASR_XLARGE") | HuBERT model ("extra large" architecture), pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset [[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")], and fine-tuned for ASR on 960 hours of transcribed audio from *LibriSpeech* dataset [[Panayotov *et al.*, 2015](references.html#id13 "Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206-5210\. 2015\. doi:10.1109/ICASSP.2015.7178964.")] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"). |' id: totrans-73 prefs: [] type: TYPE_TB zh: '| [`HUBERT_ASR_XLARGE`](generated/torchaudio.pipelines.HUBERT_ASR_XLARGE.html#torchaudio.pipelines.HUBERT_ASR_XLARGE "torchaudio.pipelines.HUBERT_ASR_XLARGE") | HuBERT 模型(“extra large” 架构),在 *Libri-Light* 数据集的 60,000 小时未标记音频上进行预训练[[Kahn *et al.*, 2020](references.html#id12 "J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669-7673\. 2020\. \url https://github.com/facebookresearch/libri-light.")], 并在来自 *LibriSpeech* 数据集的 960 小时转录音频上进行了 ASR 微调(由 "train-clean-100", "train-clean-360", 和 "train-other-500" 组成)。 |' - en: wav2vec 2.0 / HuBERT - Forced Alignment[](#wav2vec-2-0-hubert-forced-alignment "Permalink to this heading") id: totrans-74 prefs: - PREF_H2 type: TYPE_NORMAL zh: wav2vec 2.0 / HuBERT - 强制对齐[](#wav2vec-2-0-hubert-forced-alignment "Permalink to this heading") - en: Interface[](#id59 "Permalink to this heading") id: totrans-75 prefs: - PREF_H3 type: TYPE_NORMAL zh: 界面[](#id59 "Permalink to this heading") - en: '`Wav2Vec2FABundle` bundles pre-trained model and its associated dictionary. Additionally, it supports appending `star` token dimension.' id: totrans-76 prefs: [] type: TYPE_NORMAL zh: '`Wav2Vec2FABundle` 包含预训练模型及其相关字典。此外,它支持附加 `star` 标记维度。' - en: '![https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2fabundle.png](../Images/81159a1c90b6bf1cc96789ecb75c13f0.png)' id: totrans-77 prefs: [] type: TYPE_IMG zh: '![https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2fabundle.png](../Images/81159a1c90b6bf1cc96789ecb75c13f0.png)' - en: '| [`Wav2Vec2FABundle`](generated/torchaudio.pipelines.Wav2Vec2FABundle.html#torchaudio.pipelines.Wav2Vec2FABundle "torchaudio.pipelines.Wav2Vec2FABundle") | Data class that bundles associated information to use pretrained [`Wav2Vec2Model`](generated/torchaudio.models.Wav2Vec2Model.html#torchaudio.models.Wav2Vec2Model "torchaudio.models.Wav2Vec2Model") for forced alignment. |' id: totrans-78 prefs: [] type: TYPE_TB zh: '| [`Wav2Vec2FABundle`](generated/torchaudio.pipelines.Wav2Vec2FABundle.html#torchaudio.pipelines.Wav2Vec2FABundle "torchaudio.pipelines.Wav2Vec2FABundle") | 数据类,捆绑了与预训练的[`Wav2Vec2Model`](generated/torchaudio.models.Wav2Vec2Model.html#torchaudio.models.Wav2Vec2Model "torchaudio.models.Wav2Vec2Model")用于强制对齐的相关信息。 |' - en: '| [`Wav2Vec2FABundle.Tokenizer`](generated/torchaudio.pipelines.Wav2Vec2FABundle.Tokenizer.html#torchaudio.pipelines.Wav2Vec2FABundle.Tokenizer "torchaudio.pipelines.Wav2Vec2FABundle.Tokenizer") | Interface of the tokenizer |' id: totrans-79 prefs: [] type: TYPE_TB zh: '| [`Wav2Vec2FABundle.Tokenizer`](generated/torchaudio.pipelines.Wav2Vec2FABundle.Tokenizer.html#torchaudio.pipelines.Wav2Vec2FABundle.Tokenizer "torchaudio.pipelines.Wav2Vec2FABundle.Tokenizer") | 分词器的接口 |' - en: '| [`Wav2Vec2FABundle.Aligner`](generated/torchaudio.pipelines.Wav2Vec2FABundle.Aligner.html#torchaudio.pipelines.Wav2Vec2FABundle.Aligner "torchaudio.pipelines.Wav2Vec2FABundle.Aligner") | Interface of the aligner |' id: totrans-80 prefs: [] type: TYPE_TB zh: '| [`Wav2Vec2FABundle.Aligner`](generated/torchaudio.pipelines.Wav2Vec2FABundle.Aligner.html#torchaudio.pipelines.Wav2Vec2FABundle.Aligner "torchaudio.pipelines.Wav2Vec2FABundle.Aligner") | 对齐器的接口 |' - en: Tutorials using `Wav2Vec2FABundle` id: totrans-81 prefs: [] type: TYPE_NORMAL zh: 使用`Wav2Vec2FABundle`的教程 - en: '![CTC forced alignment API tutorial](../Images/644afa8c7cc662a8465d389ef96d587c.png)' id: totrans-82 prefs: [] type: TYPE_IMG zh: '![CTC强制对齐API教程](../Images/644afa8c7cc662a8465d389ef96d587c.png)' - en: '[CTC forced alignment API tutorial](tutorials/ctc_forced_alignment_api_tutorial.html#sphx-glr-tutorials-ctc-forced-alignment-api-tutorial-py)' id: totrans-83 prefs: [] type: TYPE_NORMAL zh: '[CTC强制对齐API教程](tutorials/ctc_forced_alignment_api_tutorial.html#sphx-glr-tutorials-ctc-forced-alignment-api-tutorial-py)' - en: CTC forced alignment API tutorial![Forced alignment for multilingual data](../Images/ca023cbba331b61f65d37937f8a25beb.png) id: totrans-84 prefs: [] type: TYPE_NORMAL zh: CTC强制对齐API教程![多语言数据的强制对齐](../Images/ca023cbba331b61f65d37937f8a25beb.png) - en: '[Forced alignment for multilingual data](tutorials/forced_alignment_for_multilingual_data_tutorial.html#sphx-glr-tutorials-forced-alignment-for-multilingual-data-tutorial-py)' id: totrans-85 prefs: [] type: TYPE_NORMAL zh: '[多语言数据的强制对齐](tutorials/forced_alignment_for_multilingual_data_tutorial.html#sphx-glr-tutorials-forced-alignment-for-multilingual-data-tutorial-py)' - en: Forced alignment for multilingual data![Forced Alignment with Wav2Vec2](../Images/6658c9fe256ea584e84432cc92cd4db9.png) id: totrans-86 prefs: [] type: TYPE_NORMAL zh: 多语言数据的强制对齐![使用Wav2Vec2进行强制对齐](../Images/6658c9fe256ea584e84432cc92cd4db9.png) - en: '[Forced Alignment with Wav2Vec2](tutorials/forced_alignment_tutorial.html#sphx-glr-tutorials-forced-alignment-tutorial-py)' id: totrans-87 prefs: [] type: TYPE_NORMAL zh: '[使用Wav2Vec2进行强制对齐](tutorials/forced_alignment_tutorial.html#sphx-glr-tutorials-forced-alignment-tutorial-py)' - en: Forced Alignment with Wav2Vec2 id: totrans-88 prefs: [] type: TYPE_NORMAL zh: 使用Wav2Vec2进行强制对齐 - en: Pertrained Models[](#pertrained-models "Permalink to this heading") id: totrans-89 prefs: - PREF_H3 type: TYPE_NORMAL zh: 预训练模型[](#pertrained-models "跳转到此标题的永久链接") - en: '| [`MMS_FA`](generated/torchaudio.pipelines.MMS_FA.html#torchaudio.pipelines.MMS_FA "torchaudio.pipelines.MMS_FA") | Trained on 31K hours of data in 1,130 languages from *Scaling Speech Technology to 1,000+ Languages* [[Pratap *et al.*, 2023](references.html#id71 "Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling speech technology to 1,000+ languages. 2023\. arXiv:2305.13516.")]. |' id: totrans-90 prefs: [] type: TYPE_TB zh: '| [`MMS_FA`](generated/torchaudio.pipelines.MMS_FA.html#torchaudio.pipelines.MMS_FA "torchaudio.pipelines.MMS_FA") | 在来自*将语音技术扩展到1000多种语言*的1,130种语言的31,000小时数据上训练[[Pratap等人,2023](references.html#id71 "Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling speech technology to 1,000+ languages. 2023\. arXiv:2305.13516.")] |' - en: '## Tacotron2 Text-To-Speech[](#tacotron2-text-to-speech "Permalink to this heading")' id: totrans-91 prefs: [] type: TYPE_NORMAL zh: '## Tacotron2文本到语音[](#tacotron2-text-to-speech "跳转到此标题的永久链接")' - en: '`Tacotron2TTSBundle` defines text-to-speech pipelines and consists of three steps: tokenization, spectrogram generation and vocoder. The spectrogram generation is based on [`Tacotron2`](generated/torchaudio.models.Tacotron2.html#torchaudio.models.Tacotron2 "torchaudio.models.Tacotron2") model.' id: totrans-92 prefs: [] type: TYPE_NORMAL zh: '`Tacotron2TTSBundle`定义了文本到语音流水线,包括三个步骤:分词、频谱图生成和声码器。频谱图生成基于[`Tacotron2`](generated/torchaudio.models.Tacotron2.html#torchaudio.models.Tacotron2 "torchaudio.models.Tacotron2")模型。' - en: '![https://download.pytorch.org/torchaudio/doc-assets/pipelines-tacotron2bundle.png](../Images/97c575d1ba15c954a23c68df0d5b0471.png)' id: totrans-93 prefs: [] type: TYPE_IMG zh: '![https://download.pytorch.org/torchaudio/doc-assets/pipelines-tacotron2bundle.png](../Images/97c575d1ba15c954a23c68df0d5b0471.png)' - en: '`TextProcessor` can be rule-based tokenization in the case of characters, or it can be a neural-netowrk-based G2P model that generates sequence of phonemes from input text.' id: totrans-94 prefs: [] type: TYPE_NORMAL zh: '`TextProcessor`可以是基于规则的字符分词,也可以是一个神经网络的G2P模型,从输入文本生成音素序列。' - en: Similarly `Vocoder` can be an algorithm without learning parameters, like Griffin-Lim, or a neural-network-based model like Waveglow. id: totrans-95 prefs: [] type: TYPE_NORMAL zh: 同样,`Vocoder`可以是一个没有学习参数的算法,比如Griffin-Lim,也可以是一个基于神经网络的模型,比如Waveglow。 - en: Interface[](#id61 "Permalink to this heading") id: totrans-96 prefs: - PREF_H3 type: TYPE_NORMAL zh: 接口[](#id61 "跳转到此标题的永久链接") - en: '| [`Tacotron2TTSBundle`](generated/torchaudio.pipelines.Tacotron2TTSBundle.html#torchaudio.pipelines.Tacotron2TTSBundle "torchaudio.pipelines.Tacotron2TTSBundle") | Data class that bundles associated information to use pretrained Tacotron2 and vocoder. |' id: totrans-97 prefs: [] type: TYPE_TB zh: '| [`Tacotron2TTSBundle`](generated/torchaudio.pipelines.Tacotron2TTSBundle.html#torchaudio.pipelines.Tacotron2TTSBundle "torchaudio.pipelines.Tacotron2TTSBundle") | 数据类,捆绑了与预训练的Tacotron2和声码器相关信息。 |' - en: '| [`Tacotron2TTSBundle.TextProcessor`](generated/torchaudio.pipelines.Tacotron2TTSBundle.TextProcessor.html#torchaudio.pipelines.Tacotron2TTSBundle.TextProcessor "torchaudio.pipelines.Tacotron2TTSBundle.TextProcessor") | Interface of the text processing part of Tacotron2TTS pipeline |' id: totrans-98 prefs: [] type: TYPE_TB zh: '| [`Tacotron2TTSBundle.TextProcessor`](generated/torchaudio.pipelines.Tacotron2TTSBundle.TextProcessor.html#torchaudio.pipelines.Tacotron2TTSBundle.TextProcessor "torchaudio.pipelines.Tacotron2TTSBundle.TextProcessor") | Tacotron2TTS流水线文本处理部分的接口 |' - en: '| [`Tacotron2TTSBundle.Vocoder`](generated/torchaudio.pipelines.Tacotron2TTSBundle.Vocoder.html#torchaudio.pipelines.Tacotron2TTSBundle.Vocoder "torchaudio.pipelines.Tacotron2TTSBundle.Vocoder") | Interface of the vocoder part of Tacotron2TTS pipeline |' id: totrans-99 prefs: [] type: TYPE_TB zh: '| [`Tacotron2TTSBundle.Vocoder`](generated/torchaudio.pipelines.Tacotron2TTSBundle.Vocoder.html#torchaudio.pipelines.Tacotron2TTSBundle.Vocoder "torchaudio.pipelines.Tacotron2TTSBundle.Vocoder") | Tacotron2TTS流水线的声码器部分的接口 |' - en: Tutorials using `Tacotron2TTSBundle` id: totrans-100 prefs: [] type: TYPE_NORMAL zh: 使用`Tacotron2TTSBundle`的教程 - en: '![Text-to-Speech with Tacotron2](../Images/5a248f30c367f9fb17d182966714fd7d.png)' id: totrans-101 prefs: [] type: TYPE_IMG zh: '![使用Tacotron2进行文本到语音转换](../Images/5a248f30c367f9fb17d182966714fd7d.png)' - en: '[Text-to-Speech with Tacotron2](tutorials/tacotron2_pipeline_tutorial.html#sphx-glr-tutorials-tacotron2-pipeline-tutorial-py)' id: totrans-102 prefs: [] type: TYPE_NORMAL zh: '[使用Tacotron2进行文本到语音转换](tutorials/tacotron2_pipeline_tutorial.html#sphx-glr-tutorials-tacotron2-pipeline-tutorial-py)' - en: Text-to-Speech with Tacotron2 id: totrans-103 prefs: [] type: TYPE_NORMAL zh: 使用Tacotron2进行文本到语音转换 - en: Pretrained Models[](#id62 "Permalink to this heading") id: totrans-104 prefs: - PREF_H3 type: TYPE_NORMAL zh: 预训练模型[](#id62 "跳转到此标题") - en: '| [`TACOTRON2_WAVERNN_PHONE_LJSPEECH`](generated/torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH.html#torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH "torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH") | Phoneme-based TTS pipeline with [`Tacotron2`](generated/torchaudio.models.Tacotron2.html#torchaudio.models.Tacotron2 "torchaudio.models.Tacotron2") trained on *LJSpeech* [[Ito and Johnson, 2017](references.html#id7 "Keith Ito and Linda Johnson. The lj speech dataset. \url https://keithito.com/LJ-Speech-Dataset/, 2017.")] for 1,500 epochs, and [`WaveRNN`](generated/torchaudio.models.WaveRNN.html#torchaudio.models.WaveRNN "torchaudio.models.WaveRNN") vocoder trained on 8 bits depth waveform of *LJSpeech* [[Ito and Johnson, 2017](references.html#id7 "Keith Ito and Linda Johnson. The lj speech dataset. \url https://keithito.com/LJ-Speech-Dataset/, 2017.")] for 10,000 epochs. |' id: totrans-105 prefs: [] type: TYPE_TB zh: '| [`TACOTRON2_WAVERNN_PHONE_LJSPEECH`](generated/torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH.html#torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH "torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH") | 基于音素的TTS流水线,使用在*LJSpeech*上训练的[`Tacotron2`](generated/torchaudio.models.Tacotron2.html#torchaudio.models.Tacotron2 "torchaudio.models.Tacotron2"),训练了1,500个时代,并使用在*LJSpeech*的8位深度波形上训练了10,000个时代的[`WaveRNN`](generated/torchaudio.models.WaveRNN.html#torchaudio.models.WaveRNN "torchaudio.models.WaveRNN")声码器。 |' - en: '| [`TACOTRON2_WAVERNN_CHAR_LJSPEECH`](generated/torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH.html#torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH "torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH") | Character-based TTS pipeline with [`Tacotron2`](generated/torchaudio.models.Tacotron2.html#torchaudio.models.Tacotron2 "torchaudio.models.Tacotron2") trained on *LJSpeech* [[Ito and Johnson, 2017](references.html#id7 "Keith Ito and Linda Johnson. The lj speech dataset. \url https://keithito.com/LJ-Speech-Dataset/, 2017.")] for 1,500 epochs and [`WaveRNN`](generated/torchaudio.models.WaveRNN.html#torchaudio.models.WaveRNN "torchaudio.models.WaveRNN") vocoder trained on 8 bits depth waveform of *LJSpeech* [[Ito and Johnson, 2017](references.html#id7 "Keith Ito and Linda Johnson. The lj speech dataset. \url https://keithito.com/LJ-Speech-Dataset/, 2017.")] for 10,000 epochs. |' id: totrans-106 prefs: [] type: TYPE_TB zh: '| [`TACOTRON2_WAVERNN_CHAR_LJSPEECH`](generated/torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH.html#torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH "torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH") | 基于字符的TTS流水线,使用在*LJSpeech*上训练的[`Tacotron2`](generated/torchaudio.models.Tacotron2.html#torchaudio.models.Tacotron2 "torchaudio.models.Tacotron2"),训练了1,500个时代,并使用在*LJSpeech*的8位深度波形上训练了10,000个时代的[`WaveRNN`](generated/torchaudio.models.WaveRNN.html#torchaudio.models.WaveRNN "torchaudio.models.WaveRNN")声码器。 |' - en: '| [`TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH`](generated/torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH.html#torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH "torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH") | Phoneme-based TTS pipeline with [`Tacotron2`](generated/torchaudio.models.Tacotron2.html#torchaudio.models.Tacotron2 "torchaudio.models.Tacotron2") trained on *LJSpeech* [[Ito and Johnson, 2017](references.html#id7 "Keith Ito and Linda Johnson. The lj speech dataset. \url https://keithito.com/LJ-Speech-Dataset/, 2017.")] for 1,500 epochs and [`GriffinLim`](generated/torchaudio.transforms.GriffinLim.html#torchaudio.transforms.GriffinLim "torchaudio.transforms.GriffinLim") as vocoder. |' id: totrans-107 prefs: [] type: TYPE_TB zh: '| [`TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH`](generated/torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH.html#torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH "torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH") | 基于音素的TTS流水线,使用在*LJSpeech*上训练的[`Tacotron2`](generated/torchaudio.models.Tacotron2.html#torchaudio.models.Tacotron2 "torchaudio.models.Tacotron2"),训练了1,500个时代,并使用[`GriffinLim`](generated/torchaudio.transforms.GriffinLim.html#torchaudio.transforms.GriffinLim "torchaudio.transforms.GriffinLim")作为声码器。 |' - en: '| [`TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH`](generated/torchaudio.pipelines.TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.html#torchaudio.pipelines.TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH "torchaudio.pipelines.TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH") | Character-based TTS pipeline with [`Tacotron2`](generated/torchaudio.models.Tacotron2.html#torchaudio.models.Tacotron2 "torchaudio.models.Tacotron2") trained on *LJSpeech* [[Ito and Johnson, 2017](references.html#id7 "Keith Ito and Linda Johnson. The lj speech dataset. \url https://keithito.com/LJ-Speech-Dataset/, 2017.")] for 1,500 epochs, and [`GriffinLim`](generated/torchaudio.transforms.GriffinLim.html#torchaudio.transforms.GriffinLim "torchaudio.transforms.GriffinLim") as vocoder. |' id: totrans-108 prefs: [] type: TYPE_TB zh: '| [`TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH`](generated/torchaudio.pipelines.TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.html#torchaudio.pipelines.TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH "torchaudio.pipelines.TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH") | 基于字符的TTS流水线,使用在*LJSpeech*上训练的[`Tacotron2`](generated/torchaudio.models.Tacotron2.html#torchaudio.models.Tacotron2 "torchaudio.models.Tacotron2"),训练了1,500个时代,并使用[`GriffinLim`](generated/torchaudio.transforms.GriffinLim.html#torchaudio.transforms.GriffinLim "torchaudio.transforms.GriffinLim")作为声码器。 |' - en: Source Separation[](#source-separation "Permalink to this heading") id: totrans-109 prefs: - PREF_H2 type: TYPE_NORMAL zh: 声源分离[](#source-separation "跳转到此标题") - en: Interface[](#id69 "Permalink to this heading") id: totrans-110 prefs: - PREF_H3 type: TYPE_NORMAL zh: 界面[](#id69 "跳转到此标题") - en: '`SourceSeparationBundle` instantiates source separation models which take single channel audio and generates multi-channel audio.' id: totrans-111 prefs: [] type: TYPE_NORMAL zh: '`SourceSeparationBundle`实例化声源分离模型,该模型接收单声道音频并生成多声道音频。' - en: '![https://download.pytorch.org/torchaudio/doc-assets/pipelines-sourceseparationbundle.png](../Images/69b4503224dac9c3e845bd309a996829.png)' id: totrans-112 prefs: [] type: TYPE_IMG zh: '![https://download.pytorch.org/torchaudio/doc-assets/pipelines-sourceseparationbundle.png](../Images/69b4503224dac9c3e845bd309a996829.png)' - en: '| [`SourceSeparationBundle`](generated/torchaudio.pipelines.SourceSeparationBundle.html#torchaudio.pipelines.SourceSeparationBundle "torchaudio.pipelines.SourceSeparationBundle") | Dataclass that bundles components for performing source separation. |' id: totrans-113 prefs: [] type: TYPE_TB zh: '| [`SourceSeparationBundle`](generated/torchaudio.pipelines.SourceSeparationBundle.html#torchaudio.pipelines.SourceSeparationBundle "torchaudio.pipelines.SourceSeparationBundle") | 用于执行源分离的组件的数据类。 |' - en: Tutorials using `SourceSeparationBundle` id: totrans-114 prefs: [] type: TYPE_NORMAL zh: 使用`SourceSeparationBundle`的教程 - en: '![Music Source Separation with Hybrid Demucs](../Images/f822c0c06abbbf25ee5b2b2573665977.png)' id: totrans-115 prefs: [] type: TYPE_IMG zh: '![使用混合Demucs进行音乐源分离](../Images/f822c0c06abbbf25ee5b2b2573665977.png)' - en: '[Music Source Separation with Hybrid Demucs](tutorials/hybrid_demucs_tutorial.html#sphx-glr-tutorials-hybrid-demucs-tutorial-py)' id: totrans-116 prefs: [] type: TYPE_NORMAL zh: '[使用混合Demucs进行音乐源分离](tutorials/hybrid_demucs_tutorial.html#sphx-glr-tutorials-hybrid-demucs-tutorial-py)' - en: Music Source Separation with Hybrid Demucs id: totrans-117 prefs: [] type: TYPE_NORMAL zh: 使用混合Demucs进行音乐源分离 - en: Pretrained Models[](#id70 "Permalink to this heading") id: totrans-118 prefs: - PREF_H3 type: TYPE_NORMAL zh: 预训练模型[](#id70 "跳转到此标题") - en: '| [`CONVTASNET_BASE_LIBRI2MIX`](generated/torchaudio.pipelines.CONVTASNET_BASE_LIBRI2MIX.html#torchaudio.pipelines.CONVTASNET_BASE_LIBRI2MIX "torchaudio.pipelines.CONVTASNET_BASE_LIBRI2MIX") | Pre-trained Source Separation pipeline with *ConvTasNet* [[Luo and Mesgarani, 2019](references.html#id22 "Yi Luo and Nima Mesgarani. Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256–1266, Aug 2019\. URL: http://dx.doi.org/10.1109/TASLP.2019.2915167, doi:10.1109/taslp.2019.2915167.")] trained on *Libri2Mix dataset* [[Cosentino *et al.*, 2020](references.html#id37 "Joris Cosentino, Manuel Pariente, Samuele Cornell, Antoine Deleforge, and Emmanuel Vincent. Librimix: an open-source dataset for generalizable speech separation. 2020\. arXiv:2005.11262.")]. |' id: totrans-119 prefs: [] type: TYPE_TB zh: '| [`CONVTASNET_BASE_LIBRI2MIX`](generated/torchaudio.pipelines.CONVTASNET_BASE_LIBRI2MIX.html#torchaudio.pipelines.CONVTASNET_BASE_LIBRI2MIX "torchaudio.pipelines.CONVTASNET_BASE_LIBRI2MIX") | 使用*ConvTasNet*预训练的源分离流水线[[Luo和Mesgarani,2019](references.html#id22 "Yi Luo和Nima Mesgarani。Conv-tasnet: 超越理想的时频幅度掩蔽进行语音分离。IEEE/ACM音频、语音和语言处理交易,27(8):1256–1266,2019年8月。URL: http://dx.doi.org/10.1109/TASLP.2019.2915167, doi:10.1109/taslp.2019.2915167。")],在*Libri2Mix数据集*上进行训练[[Cosentino等,2020](references.html#id37 "Joris Cosentino, Manuel Pariente, Samuele Cornell, Antoine Deleforge和Emmanuel Vincent。Librimix: 用于通用语音分离的开源数据集。2020年。arXiv:2005.11262。")]. |' - en: '| [`HDEMUCS_HIGH_MUSDB_PLUS`](generated/torchaudio.pipelines.HDEMUCS_HIGH_MUSDB_PLUS.html#torchaudio.pipelines.HDEMUCS_HIGH_MUSDB_PLUS "torchaudio.pipelines.HDEMUCS_HIGH_MUSDB_PLUS") | Pre-trained music source separation pipeline with *Hybrid Demucs* [[Défossez, 2021](references.html#id50 "Alexandre Défossez. Hybrid spectrogram and waveform source separation. In Proceedings of the ISMIR 2021 Workshop on Music Source Separation. 2021.")] trained on both training and test sets of MUSDB-HQ [[Rafii *et al.*, 2019](references.html#id47 "Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. MUSDB18-HQ - an uncompressed version of musdb18\. December 2019\. URL: https://doi.org/10.5281/zenodo.3338373, doi:10.5281/zenodo.3338373.")] and an additional 150 extra songs from an internal database that was specifically produced for Meta. |' id: totrans-120 prefs: [] type: TYPE_TB zh: '| [`HDEMUCS_HIGH_MUSDB_PLUS`](generated/torchaudio.pipelines.HDEMUCS_HIGH_MUSDB_PLUS.html#torchaudio.pipelines.HDEMUCS_HIGH_MUSDB_PLUS "torchaudio.pipelines.HDEMUCS_HIGH_MUSDB_PLUS") | 使用*Hybrid Demucs*预训练的音乐源分离流水线[[Défossez, 2021](references.html#id50 "Alexandre Défossez. 混合频谱图和波形源分离。在ISMIR 2021音乐源分离研讨会论文集中。2021年。")],在MUSDB-HQ的训练集和测试集以及专门为Meta制作的内部数据库中的额外150首歌曲上进行训练。|' - en: '| [`HDEMUCS_HIGH_MUSDB`](generated/torchaudio.pipelines.HDEMUCS_HIGH_MUSDB.html#torchaudio.pipelines.HDEMUCS_HIGH_MUSDB "torchaudio.pipelines.HDEMUCS_HIGH_MUSDB") | Pre-trained music source separation pipeline with *Hybrid Demucs* [[Défossez, 2021](references.html#id50 "Alexandre Défossez. Hybrid spectrogram and waveform source separation. In Proceedings of the ISMIR 2021 Workshop on Music Source Separation. 2021.")] trained on the training set of MUSDB-HQ [[Rafii *et al.*, 2019](references.html#id47 "Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. MUSDB18-HQ - an uncompressed version of musdb18\. December 2019\. URL: https://doi.org/10.5281/zenodo.3338373, doi:10.5281/zenodo.3338373.")]. |' id: totrans-121 prefs: [] type: TYPE_TB zh: '| [`HDEMUCS_HIGH_MUSDB`](generated/torchaudio.pipelines.HDEMUCS_HIGH_MUSDB.html#torchaudio.pipelines.HDEMUCS_HIGH_MUSDB "torchaudio.pipelines.HDEMUCS_HIGH_MUSDB") | 使用*Hybrid Demucs*预训练的音乐源分离流水线[[Défossez, 2021](references.html#id50 "Alexandre Défossez. 混合频谱图和波形源分离。在ISMIR 2021音乐源分离研讨会论文集中。2021年。")],在MUSDB-HQ的训练集上进行训练[[Rafii等,2019](references.html#id47 "Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis和Rachel Bittner。MUSDB18-HQ - musdb18的未压缩版本。2019年12月。URL: https://doi.org/10.5281/zenodo.3338373, doi:10.5281/zenodo.3338373。")]. |' - en: Squim Objective[](#squim-objective "Permalink to this heading") id: totrans-122 prefs: - PREF_H2 type: TYPE_NORMAL zh: Squim目标[](#squim-objective "跳转到此标题") - en: Interface[](#id77 "Permalink to this heading") id: totrans-123 prefs: - PREF_H3 type: TYPE_NORMAL zh: 界面[](#id77 "跳转到此标题") - en: '[`SquimObjectiveBundle`](generated/torchaudio.pipelines.SquimObjectiveBundle.html#torchaudio.pipelines.SquimObjectiveBundle "torchaudio.pipelines.SquimObjectiveBundle") defines speech quality and intelligibility measurement (SQUIM) pipeline that can predict **objecive** metric scores given the input waveform.' id: totrans-124 prefs: [] type: TYPE_NORMAL zh: '[`SquimObjectiveBundle`](generated/torchaudio.pipelines.SquimObjectiveBundle.html#torchaudio.pipelines.SquimObjectiveBundle "torchaudio.pipelines.SquimObjectiveBundle")定义了语音质量和可懂度测量(SQUIM)流水线,可以根据输入波形预测**客观**度量分数。' - en: '| [`SquimObjectiveBundle`](generated/torchaudio.pipelines.SquimObjectiveBundle.html#torchaudio.pipelines.SquimObjectiveBundle "torchaudio.pipelines.SquimObjectiveBundle") | Data class that bundles associated information to use pretrained [`SquimObjective`](generated/torchaudio.models.SquimObjective.html#torchaudio.models.SquimObjective "torchaudio.models.SquimObjective") model. |' id: totrans-125 prefs: [] type: TYPE_TB zh: '| [`SquimObjectiveBundle`](generated/torchaudio.pipelines.SquimObjectiveBundle.html#torchaudio.pipelines.SquimObjectiveBundle "torchaudio.pipelines.SquimObjectiveBundle") | 封装了与预训练[`SquimObjective`](generated/torchaudio.models.SquimObjective.html#torchaudio.models.SquimObjective "torchaudio.models.SquimObjective")模型使用相关信息的数据类。 |' - en: Pretrained Models[](#id78 "Permalink to this heading") id: totrans-126 prefs: - PREF_H3 type: TYPE_NORMAL zh: 预训练模型[](#id78 "跳转到此标题") - en: '| [`SQUIM_OBJECTIVE`](generated/torchaudio.pipelines.SQUIM_OBJECTIVE.html#torchaudio.pipelines.SQUIM_OBJECTIVE "torchaudio.pipelines.SQUIM_OBJECTIVE") | SquimObjective pipeline trained using approach described in [[Kumar *et al.*, 2023](references.html#id69 "Anurag Kumar, Ke Tan, Zhaoheng Ni, Pranay Manocha, Xiaohui Zhang, Ethan Henderson, and Buye Xu. Torchaudio-squim: reference-less speech quality and intelligibility measures in torchaudio. arXiv preprint arXiv:2304.01448, 2023.")] on the *DNS 2020 Dataset* [[Reddy *et al.*, 2020](references.html#id65 "Chandan KA Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, and others. The interspeech 2020 deep noise suppression challenge: datasets, subjective testing framework, and challenge results. arXiv preprint arXiv:2005.13981, 2020.")]. |' id: totrans-127 prefs: [] type: TYPE_TB zh: '| [`SQUIM_OBJECTIVE`](generated/torchaudio.pipelines.SQUIM_OBJECTIVE.html#torchaudio.pipelines.SQUIM_OBJECTIVE "torchaudio.pipelines.SQUIM_OBJECTIVE") | 使用[[Kumar等人,2023年](references.html#id69 "Anurag Kumar, Ke Tan, Zhaoheng Ni, Pranay Manocha, Xiaohui Zhang, Ethan Henderson, and Buye Xu. Torchaudio-squim: reference-less speech quality and intelligibility measures in torchaudio. arXiv preprint arXiv:2304.01448, 2023.")]中描述的方法训练的SquimObjective管道,基于*DNS 2020数据集*[[Reddy等人,2020年](references.html#id65 "Chandan KA Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, and others. The interspeech 2020 deep noise suppression challenge: datasets, subjective testing framework, and challenge results. arXiv preprint arXiv:2005.13981, 2020.")]。 |' - en: Squim Subjective[](#squim-subjective "Permalink to this heading") id: totrans-128 prefs: - PREF_H2 type: TYPE_NORMAL zh: Squim Subjective[](#squim-subjective "跳转到此标题的永久链接") - en: Interface[](#id81 "Permalink to this heading") id: totrans-129 prefs: - PREF_H3 type: TYPE_NORMAL zh: 接口[](#id81 "跳转到此标题的永久链接") - en: '[`SquimSubjectiveBundle`](generated/torchaudio.pipelines.SquimSubjectiveBundle.html#torchaudio.pipelines.SquimSubjectiveBundle "torchaudio.pipelines.SquimSubjectiveBundle") defines speech quality and intelligibility measurement (SQUIM) pipeline that can predict **subjective** metric scores given the input waveform.' id: totrans-130 prefs: [] type: TYPE_NORMAL zh: '[`SquimSubjectiveBundle`](generated/torchaudio.pipelines.SquimSubjectiveBundle.html#torchaudio.pipelines.SquimSubjectiveBundle "torchaudio.pipelines.SquimSubjectiveBundle")定义了可以根据输入波形预测**主观**度量分数的语音质量和可懂度测量(SQUIM)管道。' - en: '| [`SquimSubjectiveBundle`](generated/torchaudio.pipelines.SquimSubjectiveBundle.html#torchaudio.pipelines.SquimSubjectiveBundle "torchaudio.pipelines.SquimSubjectiveBundle") | Data class that bundles associated information to use pretrained [`SquimSubjective`](generated/torchaudio.models.SquimSubjective.html#torchaudio.models.SquimSubjective "torchaudio.models.SquimSubjective") model. |' id: totrans-131 prefs: [] type: TYPE_TB zh: '| [`SquimSubjectiveBundle`](generated/torchaudio.pipelines.SquimSubjectiveBundle.html#torchaudio.pipelines.SquimSubjectiveBundle "torchaudio.pipelines.SquimSubjectiveBundle") | 数据类,捆绑了相关信息以使用预训练的[`SquimSubjective`](generated/torchaudio.models.SquimSubjective.html#torchaudio.models.SquimSubjective "torchaudio.models.SquimSubjective")模型。 |' - en: Pretrained Models[](#id82 "Permalink to this heading") id: totrans-132 prefs: - PREF_H3 type: TYPE_NORMAL zh: 预训练模型[](#id82 "跳转到此标题的永久链接") - en: '| [`SQUIM_SUBJECTIVE`](generated/torchaudio.pipelines.SQUIM_SUBJECTIVE.html#torchaudio.pipelines.SQUIM_SUBJECTIVE "torchaudio.pipelines.SQUIM_SUBJECTIVE") | SquimSubjective pipeline trained as described in [[Manocha and Kumar, 2022](references.html#id66 "Pranay Manocha and Anurag Kumar. Speech quality assessment through mos using non-matching references. arXiv preprint arXiv:2206.12285, 2022.")] and [[Kumar *et al.*, 2023](references.html#id69 "Anurag Kumar, Ke Tan, Zhaoheng Ni, Pranay Manocha, Xiaohui Zhang, Ethan Henderson, and Buye Xu. Torchaudio-squim: reference-less speech quality and intelligibility measures in torchaudio. arXiv preprint arXiv:2304.01448, 2023.")] on the *BVCC* [[Cooper and Yamagishi, 2021](references.html#id67 "Erica Cooper and Junichi Yamagishi. How do voices from past speech synthesis challenges compare today? arXiv preprint arXiv:2105.02373, 2021.")] and *DAPS* [[Mysore, 2014](references.html#id68 "Gautham J Mysore. Can we automatically transform speech recorded on common consumer devices in real-world environments into professional production quality speech?—a dataset, insights, and challenges. IEEE Signal Processing Letters, 22(8):1006–1010, 2014.")] datasets. |' id: totrans-133 prefs: [] type: TYPE_TB zh: '| [`SQUIM_SUBJECTIVE`](generated/torchaudio.pipelines.SQUIM_SUBJECTIVE.html#torchaudio.pipelines.SQUIM_SUBJECTIVE "torchaudio.pipelines.SQUIM_SUBJECTIVE") | 如[[Manocha和Kumar,2022年](references.html#id66 "Pranay Manocha and Anurag Kumar. Speech quality assessment through mos using non-matching references. arXiv preprint arXiv:2206.12285, 2022.")]和[[Kumar等人,2023年](references.html#id69 "Anurag Kumar, Ke Tan, Zhaoheng Ni, Pranay Manocha, Xiaohui Zhang, Ethan Henderson, and Buye Xu. Torchaudio-squim: reference-less speech quality and intelligibility measures in torchaudio. arXiv preprint arXiv:2304.01448, 2023.")]中描述的方法训练的SquimSubjective管道,基于*BVCC*[[Cooper和Yamagishi,2021年](references.html#id67 "Erica Cooper and Junichi Yamagishi. How do voices from past speech synthesis challenges compare today? arXiv preprint arXiv:2105.02373, 2021.")]和*DAPS*[[Mysore,2014年](references.html#id68 "Gautham J Mysore. Can we automatically transform speech recorded on common consumer devices in real-world environments into professional production quality speech?—a dataset, insights, and challenges. IEEE Signal Processing Letters, 22(8):1006–1010, 2014.")]数据集。 |'