- en: torchaudio.models id: totrans-0 prefs: - PREF_H1 type: TYPE_NORMAL zh: torchaudio.models - en: 原文:[https://pytorch.org/audio/stable/models.html](https://pytorch.org/audio/stable/models.html) id: totrans-1 prefs: - PREF_BQ type: TYPE_NORMAL zh: 原文:[https://pytorch.org/audio/stable/models.html](https://pytorch.org/audio/stable/models.html) - en: The `torchaudio.models` subpackage contains definitions of models for addressing common audio tasks. id: totrans-2 prefs: [] type: TYPE_NORMAL zh: '`torchaudio.models`子包含有用于处理常见音频任务的模型定义。' - en: Note id: totrans-3 prefs: [] type: TYPE_NORMAL zh: 注意 - en: For models with pre-trained parameters, please refer to [`torchaudio.pipelines`](pipelines.html#module-torchaudio.pipelines "torchaudio.pipelines") module. id: totrans-4 prefs: [] type: TYPE_NORMAL zh: 对于具有预训练参数的模型,请参考[`torchaudio.pipelines`](pipelines.html#module-torchaudio.pipelines "torchaudio.pipelines")模块。 - en: Model defintions are responsible for constructing computation graphs and executing them. id: totrans-5 prefs: [] type: TYPE_NORMAL zh: 模型定义负责构建计算图并执行它们。 - en: Some models have complex structure and variations. For such models, factory functions are provided. id: totrans-6 prefs: [] type: TYPE_NORMAL zh: 一些模型具有复杂的结构和变体。对于这样的模型,提供了工厂函数。 - en: '| [`Conformer`](generated/torchaudio.models.Conformer.html#torchaudio.models.Conformer "torchaudio.models.Conformer") | Conformer architecture introduced in *Conformer: Convolution-augmented Transformer for Speech Recognition* [[Gulati *et al.*, 2020](references.html#id21 "Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer: convolution-augmented transformer for speech recognition. 2020\. arXiv:2005.08100.")]. |' id: totrans-7 prefs: [] type: TYPE_TB zh: '| [`Conformer`](generated/torchaudio.models.Conformer.html#torchaudio.models.Conformer "torchaudio.models.Conformer") | Conformer架构介绍在*Conformer: Convolution-augmented Transformer for Speech Recognition* [[Gulati *et al.*, 2020](references.html#id21 "Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer: convolution-augmented transformer for speech recognition. 2020\. arXiv:2005.08100.")]. |' - en: '| [`ConvTasNet`](generated/torchaudio.models.ConvTasNet.html#torchaudio.models.ConvTasNet "torchaudio.models.ConvTasNet") | Conv-TasNet architecture introduced in *Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation* [[Luo and Mesgarani, 2019](references.html#id22 "Yi Luo and Nima Mesgarani. Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256–1266, Aug 2019\. URL: http://dx.doi.org/10.1109/TASLP.2019.2915167, doi:10.1109/taslp.2019.2915167.")]. |' id: totrans-8 prefs: [] type: TYPE_TB zh: '| [`ConvTasNet`](generated/torchaudio.models.ConvTasNet.html#torchaudio.models.ConvTasNet "torchaudio.models.ConvTasNet") | Conv-TasNet架构介绍在*Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation* [[Luo and Mesgarani, 2019](references.html#id22 "Yi Luo and Nima Mesgarani. Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256–1266, Aug 2019\. URL: http://dx.doi.org/10.1109/TASLP.2019.2915167, doi:10.1109/taslp.2019.2915167.")]. |' - en: '| [`DeepSpeech`](generated/torchaudio.models.DeepSpeech.html#torchaudio.models.DeepSpeech "torchaudio.models.DeepSpeech") | DeepSpeech architecture introduced in *Deep Speech: Scaling up end-to-end speech recognition* [[Hannun *et al.*, 2014](references.html#id17 "Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep speech: scaling up end-to-end speech recognition. 2014\. arXiv:1412.5567.")]. |' id: totrans-9 prefs: [] type: TYPE_TB zh: '| [`DeepSpeech`](generated/torchaudio.models.DeepSpeech.html#torchaudio.models.DeepSpeech "torchaudio.models.DeepSpeech") | DeepSpeech架构介绍在*Deep Speech: Scaling up end-to-end speech recognition* [[Hannun *et al.*, 2014](references.html#id17 "Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep speech: scaling up end-to-end speech recognition. 2014\. arXiv:1412.5567.")]. |' - en: '| [`Emformer`](generated/torchaudio.models.Emformer.html#torchaudio.models.Emformer "torchaudio.models.Emformer") | Emformer architecture introduced in *Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition* [[Shi *et al.*, 2021](references.html#id30 "Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching-Feng Yeh, Julian Chan, Frank Zhang, Duc Le, and Mike Seltzer. Emformer: efficient memory transformer based acoustic model for low latency streaming speech recognition. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6783-6787\. 2021.")]. |' id: totrans-10 prefs: [] type: TYPE_TB zh: '| [`Emformer`](generated/torchaudio.models.Emformer.html#torchaudio.models.Emformer "torchaudio.models.Emformer") | Emformer架构介绍在*Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition* [[Shi *et al.*, 2021](references.html#id30 "Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching-Feng Yeh, Julian Chan, Frank Zhang, Duc Le, and Mike Seltzer. Emformer: efficient memory transformer based acoustic model for low latency streaming speech recognition. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6783-6787\. 2021.")]. |' - en: '| [`HDemucs`](generated/torchaudio.models.HDemucs.html#torchaudio.models.HDemucs "torchaudio.models.HDemucs") | Hybrid Demucs model from *Hybrid Spectrogram and Waveform Source Separation* [[Défossez, 2021](references.html#id50 "Alexandre Défossez. Hybrid spectrogram and waveform source separation. In Proceedings of the ISMIR 2021 Workshop on Music Source Separation. 2021.")]. |' id: totrans-11 prefs: [] type: TYPE_TB zh: '| [`HDemucs`](generated/torchaudio.models.HDemucs.html#torchaudio.models.HDemucs "torchaudio.models.HDemucs") | 来自*Hybrid Spectrogram and Waveform Source Separation*的混合Demucs模型[[Défossez, 2021](references.html#id50 "Alexandre Défossez. Hybrid spectrogram and waveform source separation. In Proceedings of the ISMIR 2021 Workshop on Music Source Separation. 2021.")]. |' - en: '| [`HuBERTPretrainModel`](generated/torchaudio.models.HuBERTPretrainModel.html#torchaudio.models.HuBERTPretrainModel "torchaudio.models.HuBERTPretrainModel") | HuBERT model used for pretraining in *HuBERT* [[Hsu *et al.*, 2021](references.html#id16 "Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: self-supervised speech representation learning by masked prediction of hidden units. 2021\. arXiv:2106.07447.")]. |' id: totrans-12 prefs: [] type: TYPE_TB zh: '| [`HuBERTPretrainModel`](generated/torchaudio.models.HuBERTPretrainModel.html#torchaudio.models.HuBERTPretrainModel "torchaudio.models.HuBERTPretrainModel") | HuBERT模型用于*HuBERT*中的预训练 [[Hsu *et al.*, 2021](references.html#id16 "Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: self-supervised speech representation learning by masked prediction of hidden units. 2021\. arXiv:2106.07447.")]. |' - en: '| [`RNNT`](generated/torchaudio.models.RNNT.html#torchaudio.models.RNNT "torchaudio.models.RNNT") | Recurrent neural network transducer (RNN-T) model. |' id: totrans-13 prefs: [] type: TYPE_TB zh: '| [`RNNT`](generated/torchaudio.models.RNNT.html#torchaudio.models.RNNT "torchaudio.models.RNNT") | 循环神经网络转录器(RNN-T)模型。 |' - en: '| [`RNNTBeamSearch`](generated/torchaudio.models.RNNTBeamSearch.html#torchaudio.models.RNNTBeamSearch "torchaudio.models.RNNTBeamSearch") | Beam search decoder for RNN-T model. |' id: totrans-14 prefs: [] type: TYPE_TB zh: '| [`RNNTBeamSearch`](generated/torchaudio.models.RNNTBeamSearch.html#torchaudio.models.RNNTBeamSearch "torchaudio.models.RNNTBeamSearch") | RNN-T模型的束搜索解码器。 |' - en: '| [`SquimObjective`](generated/torchaudio.models.SquimObjective.html#torchaudio.models.SquimObjective "torchaudio.models.SquimObjective") | Speech Quality and Intelligibility Measures (SQUIM) model that predicts **objective** metric scores for speech enhancement (e.g., STOI, PESQ, and SI-SDR). |' id: totrans-15 prefs: [] type: TYPE_TB zh: '| [`SquimObjective`](generated/torchaudio.models.SquimObjective.html#torchaudio.models.SquimObjective "torchaudio.models.SquimObjective") | 预测语音增强的**客观**度量分数(例如,STOI、PESQ和SI-SDR)的语音质量和可懂度测量(SQUIM)模型。 |' - en: '| [`SquimSubjective`](generated/torchaudio.models.SquimSubjective.html#torchaudio.models.SquimSubjective "torchaudio.models.SquimSubjective") | Speech Quality and Intelligibility Measures (SQUIM) model that predicts **subjective** metric scores for speech enhancement (e.g., Mean Opinion Score (MOS)). |' id: totrans-16 prefs: [] type: TYPE_TB zh: '| [`SquimSubjective`](generated/torchaudio.models.SquimSubjective.html#torchaudio.models.SquimSubjective "torchaudio.models.SquimSubjective") | 预测语音增强的**主观**度量分数(例如,平均意见分数(MOS))的语音质量和可懂度测量(SQUIM)模型。 |' - en: '| [`Tacotron2`](generated/torchaudio.models.Tacotron2.html#torchaudio.models.Tacotron2 "torchaudio.models.Tacotron2") | Tacotron2 model from *Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions* [[Shen *et al.*, 2018](references.html#id27 "Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, and others. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779–4783\. IEEE, 2018.")] based on the implementation from [Nvidia Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples/). |' id: totrans-17 prefs: [] type: TYPE_TB zh: '| [`Tacotron2`](generated/torchaudio.models.Tacotron2.html#torchaudio.models.Tacotron2 "torchaudio.models.Tacotron2") | 基于《自然TTS合成:通过在Mel频谱图预测上对WaveNet进行条件化》[[Shen等,2018](references.html#id27 "Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan等。通过在mel频谱图预测上对WaveNet进行条件化的自然TTS合成。在2018年IEEE国际声学、语音和信号处理会议(ICASSP)上,4779-4783。IEEE,2018。")] 的Tacotron2模型,基于[Nvidia深度学习示例](https://github.com/NVIDIA/DeepLearningExamples/)的实现。 |' - en: '| [`Wav2Letter`](generated/torchaudio.models.Wav2Letter.html#torchaudio.models.Wav2Letter "torchaudio.models.Wav2Letter") | Wav2Letter model architecture from *Wav2Letter: an End-to-End ConvNet-based Speech Recognition System* [[Collobert *et al.*, 2016](references.html#id19 "Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. Wav2letter: an end-to-end convnet-based speech recognition system. 2016\. arXiv:1609.03193.")]. |' id: totrans-18 prefs: [] type: TYPE_TB zh: '| [`Wav2Letter`](generated/torchaudio.models.Wav2Letter.html#torchaudio.models.Wav2Letter "torchaudio.models.Wav2Letter") | 来自《Wav2Letter:基于端到端ConvNet的语音识别系统》[[Collobert等,2016](references.html#id19 "Ronan Collobert, Christian Puhrsch, Gabriel Synnaeve。Wav2letter:基于端到端ConvNet的语音识别系统。2016。arXiv:1609.03193。")] 的Wav2Letter模型架构。 |' - en: '| [`Wav2Vec2Model`](generated/torchaudio.models.Wav2Vec2Model.html#torchaudio.models.Wav2Vec2Model "torchaudio.models.Wav2Vec2Model") | Acoustic model used in *wav2vec 2.0* [[Baevski *et al.*, 2020](references.html#id15 "Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. Wav2vec 2.0: a framework for self-supervised learning of speech representations. 2020\. arXiv:2006.11477.")]. |' id: totrans-19 prefs: [] type: TYPE_TB zh: '| [`Wav2Vec2Model`](generated/torchaudio.models.Wav2Vec2Model.html#torchaudio.models.Wav2Vec2Model "torchaudio.models.Wav2Vec2Model") | *wav2vec 2.0*中使用的声学模型[[Baevski等,2020](references.html#id15 "Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli。Wav2vec 2.0:自监督学习语音表示的框架。2020。arXiv:2006.11477。")]。 |' - en: '| [`WaveRNN`](generated/torchaudio.models.WaveRNN.html#torchaudio.models.WaveRNN "torchaudio.models.WaveRNN") | WaveRNN model from *Efficient Neural Audio Synthesis* [[Kalchbrenner *et al.*, 2018](references.html#id3 "Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. CoRR, 2018\. URL: http://arxiv.org/abs/1802.08435, arXiv:1802.08435.")] based on the implementation from [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN). |' id: totrans-20 prefs: [] type: TYPE_TB zh: '| [`WaveRNN`](generated/torchaudio.models.WaveRNN.html#torchaudio.models.WaveRNN "torchaudio.models.WaveRNN") | 基于《高效神经音频合成》[[Kalchbrenner等,2018](references.html#id3 "Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, Koray Kavukcuoglu等。高效神经音频合成。CoRR,2018。URL:http://arxiv.org/abs/1802.08435,arXiv:1802.08435。")] 的WaveRNN模型,基于[fatchord/WaveRNN](https://github.com/fatchord/WaveRNN)的实现。 |'