未验证 提交 2b28ff75 编写于 作者: K KP 提交者: GitHub

Merge branch 'develop' into add_seeinthedark_module

......@@ -30,3 +30,9 @@
- --show-source
- --statistics
files: \.py$
- repo: https://github.com/asottile/reorder_python_imports
rev: v2.4.0
hooks:
- id: reorder-python-imports
exclude: (?=third_party).*(\.py)$
......@@ -4,7 +4,7 @@ English | [简体中文](README_ch.md)
<img src="./docs/imgs/paddlehub_logo.jpg" align="middle">
<p align="center">
<div align="center">
<h3> <a href=#QuickStart> QuickStart </a> | <a href="https://paddlehub.readthedocs.io/en/release-v2.1"> Tutorial </a> | <a href="https://www.paddlepaddle.org.cn/hublist"> Models List </a> | <a href="https://www.paddlepaddle.org.cn/hub"> Demos </a> </h3>
<h3> <a href=#QuickStart> QuickStart </a> | <a href="https://paddlehub.readthedocs.io/en/release-v2.1"> Tutorial </a> | <a href="./modules"> Models List </a> | <a href="https://www.paddlepaddle.org.cn/hub"> Demos </a> </h3>
</div>
------------------------------------------------------------------------------------------
......@@ -28,7 +28,7 @@ English | [简体中文](README_ch.md)
## Introduction and Features
- **PaddleHub** aims to provide developers with rich, high-quality, and directly usable pre-trained models.
- **Abundant Pre-trained Models**: 300+ pre-trained models cover the 5 major categories, including Image, Text, Audio, Video, and Industrial application. All of them are free for download and offline usage.
- **Abundant Pre-trained Models**: 360+ pre-trained models cover the 5 major categories, including Image, Text, Audio, Video, and Industrial application. All of them are free for download and offline usage.
- **No Need for Deep Learning Background**: you can use AI models quickly and enjoy the dividends of the artificial intelligence era.
- **Quick Model Prediction**: model prediction can be realized through a few lines of scripts to quickly experience the model effect.
- **Model As Service**: one-line command to build deep learning model API service deployment capabilities.
......@@ -36,6 +36,7 @@ English | [简体中文](README_ch.md)
- **Cross-platform**: support Linux, Windows, MacOS and other operating systems.
### Recent updates
- **2021.12.22**,The v2.2.0 version is released. [1]More than 100 new models released,including dialog, speech, segmentation, OCR, text processing, GANs, and many other categories. The total number of pre-trained models reaches [**【360】**](https://www.paddlepaddle.org.cn/hublist). [2]Add an [indexed file](./modules/README.md) including useful information of pretrained models supported by PaddleHub. [3]Refactor README of pretrained models.
- **2021.05.12:** Add an open-domain dialogue system, i.e., [plato-mini](https://www.paddlepaddle.org.cn/hubdetail?name=plato-mini&en_category=TextGeneration), to make it easy to build a chatbot in wechat with the help of the wechaty, [See Demo](https://github.com/KPatr1ck/paddlehub-wechaty-demo)
- **2021.04.27:** The v2.1.0 version is released. [1] Add supports for five new models, including two high-precision semantic segmentation models based on VOC dataset and three voice classification models. [2] Enforce the transfer learning capabilities for image semantic segmentation, text semantic matching and voice classification on related datasets. [3] Add the export function APIs for two kinds of model formats, i.,e, ONNX and PaddleInference. [4] Add the support for [BentoML](https://github.com/bentoml/BentoML/), which is a cloud native framework for serving deployment. Users can easily serve pre-trained models from PaddleHub by following the [Tutorial notebooks](https://github.com/PaddlePaddle/PaddleHub/blob/release/v2.1/demo/serving/bentoml/cloud-native-model-serving-with-bentoml.ipynb). Also, see this announcement and [Release note](https://github.com/bentoml/BentoML/releases/tag/v0.12.1) from BentoML. (Many thanks to @[parano](https://github.com/parano) @[cqvu](https://github.com/cqvu) @[deehrlic](https://github.com/deehrlic) for contributing this feature in PaddleHub). [5] The total number of pre-trained models reaches **【300】**.
- **2021.02.18:** The v2.0.0 version is released, making model development and debugging easier, and the finetune task is more flexible and easy to use.The ability to transfer learning for visual tasks is fully upgraded, supporting various tasks such as image classification, image coloring, and style transfer; Transformer models such as BERT, ERNIE, and RoBERTa are upgraded to dynamic graphs, supporting Fine-Tune capabilities for text classification and sequence labeling; Optimize the Serving capability, support multi-card prediction, automatic load balancing, and greatly improve performance; the new automatic data enhancement capability Auto Augment can efficiently search for data enhancement strategy combinations suitable for data sets. 61 new word vector models were added, including 51 Chinese models and 10 English models; add 4 image segmentation models, 2 depth models, 7 image generation models, and 3 text generation models, the total number of pre-trained models reaches **【274】**.
......@@ -44,8 +45,8 @@ English | [简体中文](README_ch.md)
## Visualization Demo [[More]](./docs/docs_en/visualization.md)
### **Computer Vision (161 models)**
## Visualization Demo [[More]](./docs/docs_en/visualization.md) [[ModelList]](./modules)
### **[Computer Vision (212 models)](./modules#Image)**
<div align="center">
<img src="./docs/imgs/Readme_Related/Image_all.gif" width = "530" height = "400" />
</div>
......@@ -53,7 +54,7 @@ English | [简体中文](README_ch.md)
- Many thanks to CopyRight@[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)[PaddleGAN](https://github.com/PaddlePaddle/PaddleGAN)[AnimeGAN](https://github.com/TachibanaYoshino/AnimeGANv2)[openpose](https://github.com/CMU-Perceptual-Computing-Lab/openpose)[PaddleSeg](https://github.com/PaddlePaddle/PaddleSeg)[Zhengxia Zou](https://github.com/jiupinjia/SkyAR)[PaddleClas](https://github.com/PaddlePaddle/PaddleClas) for the pre-trained models, you can try to train your models with them.
### **Natural Language Processing (129 models)**
### **[Natural Language Processing (130 models)](./modules#Text)**
<div align="center">
<img src="./docs/imgs/Readme_Related/Text_all.gif" width = "640" height = "240" />
</div>
......@@ -62,9 +63,37 @@ English | [简体中文](README_ch.md)
### Speech (3 models)
### [Speech (15 models)](./modules#Audio)
- ASR speech recognition algorithm, multiple algorithms are available.
- The speech recognition effect is as follows:
<div align="center">
<table>
<thead>
<tr>
<th width=250> Input Audio </th>
<th width=550> Recognition Result </th>
</tr>
</thead>
<tbody>
<tr>
<td align = "center">
<a href="https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav" rel="nofollow">
<img align="center" src="./docs/imgs/Readme_Related/audio_icon.png" width=250 ></a><br>
</td>
<td >I knocked at the door on the ancient side of the building.</td>
</tr>
<tr>
<td align = "center">
<a href="https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav" rel="nofollow">
<img align="center" src="./docs/imgs/Readme_Related/audio_icon.png" width=250></a><br>
</td>
<td>我认为跑步最重要的就是给我带来了身体健康。</td>
</tr>
</tbody>
</table>
</div>
- TTS speech synthesis algorithm, multiple algorithms are available.
- Many thanks to CopyRight@[Parakeet](https://github.com/PaddlePaddle/Parakeet) for the pre-trained models, you can try to train your models with Parakeet.
- Input: `Life was like a box of chocolates, you never know what you're gonna get.`
- The synthesis effect is as follows:
<div align="center">
......@@ -95,7 +124,9 @@ English | [简体中文](README_ch.md)
</table>
</div>
### Video (8 models)
- Many thanks to CopyRight@[PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech) for the pre-trained models, you can try to train your models with PaddleSpeech.
### [Video (8 models)](./modules#Video)
- Short video classification trained via large-scale video datasets, supports 3000+ tag types prediction for short Form Videos.
- Many thanks to CopyRight@[PaddleVideo](https://github.com/PaddlePaddle/PaddleVideo) for the pre-trained model, you can try to train your models with PaddleVideo.
- `Example: Input a short video of swimming, the algorithm can output the result of "swimming"`
......@@ -231,3 +262,4 @@ We welcome you to contribute code to PaddleHub, and thank you for your feedback.
* Many thanks to [zl1271](https://github.com/zl1271) for fixing serving docs typo
* Many thanks to [AK391](https://github.com/AK391) for adding the webdemo of UGATIT and deoldify models in Hugging Face spaces
* Many thanks to [itegel](https://github.com/itegel) for fixing quick start docs typo
* Many thanks to [AK391](https://github.com/AK391) for adding the webdemo of Photo2Cartoon model in Hugging Face spaces
......@@ -4,7 +4,7 @@
<img src="./docs/imgs/paddlehub_logo.jpg" align="middle">
<p align="center">
<div align="center">
<h3> <a href=#QuickStart> 快速开始 </a> | <a href="https://paddlehub.readthedocs.io/zh_CN/release-v2.1//"> 教程文档 </a> | <a href="https://www.paddlepaddle.org.cn/hublist"> 模型搜索 </a> | <a href="https://www.paddlepaddle.org.cn/hub"> 演示Demo </a>
<h3> <a href=#QuickStart> 快速开始 </a> | <a href="https://paddlehub.readthedocs.io/zh_CN/release-v2.1//"> 教程文档 </a> | <a href="./modules/README_ch.md"> 模型库 </a> | <a href="https://www.paddlepaddle.org.cn/hub"> 演示Demo </a>
</h3>
</div>
......@@ -30,7 +30,7 @@
## 简介与特性
- PaddleHub旨在为开发者提供丰富的、高质量的、直接可用的预训练模型
- **【模型种类丰富】**: 涵盖CV、NLP、Audio、Video、工业应用主流五大品类的 300+ 预训练模型,全部开源下载,离线可运行
- **【模型种类丰富】**: 涵盖CV、NLP、Audio、Video、工业应用主流五大品类的 **360+** 预训练模型,全部开源下载,离线可运行
- **【超低使用门槛】**:无需深度学习背景、无需数据与训练过程,可快速使用AI模型
- **【一键模型快速预测】**:通过一行命令行或者极简的Python API实现模型调用,可快速体验模型效果
- **【一键模型转服务化】**:一行命令,搭建深度学习模型API服务化部署能力
......@@ -38,6 +38,7 @@
- **【跨平台兼容性】**:可运行于Linux、Windows、MacOS等多种操作系统
## 近期更新
- **2021.12.22**,发布v2.2.0版本。【1】新增100+高质量模型,涵盖对话、语音处理、语义分割、文字识别、文本处理、图像生成等多个领域,预训练模型总量达到[**【360+】**](https://www.paddlepaddle.org.cn/hublist);【2】新增模型[检索列表](./modules/README_ch.md),包含模型名称、网络、数据集和使用场景等信息,快速定位用户所需的模型;【3】模型文档排版优化,呈现数据集、指标、模型大小等更多实用信息。
- **2021.05.12**,新增轻量级中文对话模型[plato-mini](https://www.paddlepaddle.org.cn/hubdetail?name=plato-mini&en_category=TextGeneration),可以配合使用wechaty实现微信闲聊机器人,[参考demo](https://github.com/KPatr1ck/paddlehub-wechaty-demo)
- **2021.04.27**,发布v2.1.0版本。【1】新增基于VOC数据集的高精度语义分割模型2个,语音分类模型3个。【2】新增图像语义分割、文本语义匹配、语音分类等相关任务的Fine-Tune能力以及相关任务数据集;完善部署能力:【3】新增ONNX和PaddleInference等模型格式的导出功能。【4】新增[BentoML](https://github.com/bentoml/BentoML) 云原生服务化部署能力,可以支持统一的多框架模型管理和模型部署的工作流,[详细教程](https://github.com/PaddlePaddle/PaddleHub/blob/release/v2.1/demo/serving/bentoml/cloud-native-model-serving-with-bentoml.ipynb). 更多内容可以参考BentoML 最新 v0.12.1 [Releasenote](https://github.com/bentoml/BentoML/releases/tag/v0.12.1).(感谢@[parano](https://github.com/parano) @[cqvu](https://github.com/cqvu) @[deehrlic](https://github.com/deehrlic))的贡献与支持。【5】预训练模型总量达到[**【300】**](https://www.paddlepaddle.org.cn/hublist)个。
- **2021.02.18**,发布v2.0.0版本,【1】模型开发调试更简单,finetune接口更加灵活易用。视觉类任务迁移学习能力全面升级,支持[图像分类](./demo/image_classification/README.md)[图像着色](./demo/colorization/README.md)[风格迁移](./demo/style_transfer/README.md)等多种任务;BERT、ERNIE、RoBERTa等Transformer类模型升级至动态图,支持[文本分类](./demo/text_classification/README.md)[序列标注](./demo/sequence_labeling/README.md)的Fine-Tune能力;【2】优化服务化部署Serving能力,支持多卡预测、自动负载均衡,性能大幅度提升;【3】新增自动数据增强能力[Auto Augment](./demo/autoaug/README.md),能高效地搜索适合数据集的数据增强策略组合。【4】新增[词向量模型](./modules/text/embedding)61个,其中包含中文模型51个,英文模型10个;新增[图像分割](./modules/thirdparty/image/semantic_segmentation)模型4个、[深度模型](./modules/thirdparty/image/depth_estimation)2个、[图像生成](./modules/thirdparty/image/Image_gan/style_transfer)模型7个、[文本生成](./modules/thirdparty/text/text_generation)模型3个。【5】预训练模型总量达到[**【274】**](https://www.paddlepaddle.org.cn/hublist) 个。
......@@ -47,9 +48,9 @@
## **精品模型效果展示[【更多】](./docs/docs_ch/visualization.md)**
## **精品模型效果展示[【更多】](./docs/docs_ch/visualization.md)[【模型库】](./modules/README_ch.md)**
### **图像类(161个)**
### **[图像类(212个)](./modules/README_ch.md#图像)**
- 包括图像分类、人脸检测、口罩检测、车辆检测、人脸/人体/手部关键点检测、人像分割、80+语言文本识别、图像超分/上色/动漫化等
<div align="center">
<img src="./docs/imgs/Readme_Related/Image_all.gif" width = "530" height = "400" />
......@@ -58,7 +59,7 @@
- 感谢CopyRight@[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)[PaddleGAN](https://github.com/PaddlePaddle/PaddleGAN)[AnimeGAN](https://github.com/TachibanaYoshino/AnimeGANv2)[openpose](https://github.com/CMU-Perceptual-Computing-Lab/openpose)[PaddleSeg](https://github.com/PaddlePaddle/PaddleSeg)[Zhengxia Zou](https://github.com/jiupinjia/SkyAR)[PaddleClas](https://github.com/PaddlePaddle/PaddleClas) 提供相关预训练模型,训练能力开放,欢迎体验。
### **文本类(129个)**
### **[文本类(130个)](./modules/README_ch.md#文本)**
- 包括中文分词、词性标注与命名实体识别、句法分析、AI写诗/对联/情话/藏头诗、中文的评论情感分析、中文色情文本审核等
<div align="center">
<img src="./docs/imgs/Readme_Related/Text_all.gif" width = "640" height = "240" />
......@@ -67,9 +68,37 @@
- 感谢CopyRight@[ERNIE](https://github.com/PaddlePaddle/ERNIE)[LAC](https://github.com/baidu/LAC)[DDParser](https://github.com/baidu/DDParser)提供相关预训练模型,训练能力开放,欢迎体验。
### **语音类(3个)**
### **[语音类(15个)](./modules/README_ch.md#语音)**
- ASR语音识别算法,多种算法可选
- 语音识别效果如下:
<div align="center">
<table>
<thead>
<tr>
<th width=250> Input Audio </th>
<th width=550> Recognition Result </th>
</tr>
</thead>
<tbody>
<tr>
<td align = "center">
<a href="https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav" rel="nofollow">
<img align="center" src="./docs/imgs/Readme_Related/audio_icon.png" width=250 ></a><br>
</td>
<td >I knocked at the door on the ancient side of the building.</td>
</tr>
<tr>
<td align = "center">
<a href="https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav" rel="nofollow">
<img align="center" src="./docs/imgs/Readme_Related/audio_icon.png" width=250></a><br>
</td>
<td>我认为跑步最重要的就是给我带来了身体健康。</td>
</tr>
</tbody>
</table>
</div>
- TTS语音合成算法,多种算法可选
- 感谢CopyRight@[Parakeet](https://github.com/PaddlePaddle/Parakeet)提供预训练模型,训练能力开放,欢迎体验。
- 输入:`Life was like a box of chocolates, you never know what you're gonna get.`
- 合成效果如下:
<div align="center">
......@@ -100,7 +129,9 @@
</table>
</div>
### **视频类(8个)**
- 感谢CopyRight@[PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech)提供预训练模型,训练能力开放,欢迎体验。
### **[视频类(8个)](./modules/README_ch.md#视频)**
- 包含短视频分类,支持3000+标签种类,可输出TOP-K标签,多种算法可选。
- 感谢CopyRight@[PaddleVideo](https://github.com/PaddlePaddle/PaddleVideo)提供预训练模型,训练能力开放,欢迎体验。
- `举例:输入一段游泳的短视频,算法可以输出"游泳"结果`
......@@ -247,3 +278,4 @@ print(results)
* 非常感谢[zl1271](https://github.com/zl1271)修复了serving文档中的错别字
* 非常感谢[AK391](https://github.com/AK391)在Hugging Face spaces中添加了UGATIT和deoldify模型的web demo
* 非常感谢[itegel](https://github.com/itegel)修复了快速开始文档中的错别字
* 非常感谢[AK391](https://github.com/AK391)在Hugging Face spaces中添加了Photo2Cartoon模型的web demo
......@@ -50,6 +50,8 @@
**UGATIT Selfie2anime Huggingface Web Demo**: Integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See demo: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/akhaliq/U-GAT-IT-selfie2anime)
**Photo2Cartoon Huggingface Web Demo**: Integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See demo: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/akhaliq/photo2cartoon)
### Object Detection
- Pedestrian detection, vehicle detection, and more industrial-grade ultra-large-scale pretrained models are provided.
......
此差异已折叠。
此差异已折叠。
......@@ -3,7 +3,7 @@
|模型名称|u2_conformer_aishell|
| :--- | :---: |
|类别|语音-语音识别|
|网络|DeepSpeech2|
|网络|Conformer|
|数据集|AISHELL-1|
|是否支持Fine-tuning|否|
|模型大小|284MB|
......
......@@ -3,7 +3,7 @@
|模型名称|u2_conformer_librispeech|
| :--- | :---: |
|类别|语音-语音识别|
|网络|DeepSpeech2|
|网络|Conformer|
|数据集|LibriSpeech|
|是否支持Fine-tuning|否|
|模型大小|191MB|
......
# u2_conformer_wenetspeech
|模型名称|u2_conformer_wenetspeech|
| :--- | :---: |
|类别|语音-语音识别|
|网络|Conformer|
|数据集|WenetSpeech|
|是否支持Fine-tuning|否|
|模型大小|494MB|
|最新更新日期|2021-12-10|
|数据指标|中文CER 0.087 |
## 一、模型基本信息
### 模型介绍
U2 Conformer模型是一种适用于英文和中文的end-to-end语音识别模型。u2_conformer_wenetspeech采用了conformer的encoder和transformer的decoder的模型结构,并且使用了ctc-prefix beam search的方式进行一遍打分,再利用attention decoder进行二次打分的方式进行解码来得到最终结果。
u2_conformer_wenetspeech在中文普通话开源语音数据集[WenetSpeech](https://wenet-e2e.github.io/WenetSpeech/)进行了预训练,该模型在其DEV测试集上的CER指标是0.087。
<p align="center">
<img src="https://paddlehub.bj.bcebos.com/paddlehub-img/conformer.png" hspace='10'/> <br />
</p>
<p align="center">
<img src="https://paddlehub.bj.bcebos.com/paddlehub-img/u2_conformer.png" hspace='10'/> <br />
</p>
更多详情请参考:
- [Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition](https://arxiv.org/abs/2012.05481)
- [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
- [WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition](https://arxiv.org/abs/2110.03370)
## 二、安装
- ### 1、系统依赖
- libsndfile
- Linux
```shell
$ sudo apt-get install libsndfile
or
$ sudo yum install libsndfile
```
- MacOs
```
$ brew install libsndfile
```
- ### 2、环境依赖
- paddlepaddle >= 2.2.0
- paddlehub >= 2.1.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst)
- ### 3、安装
- ```shell
$ hub install u2_conformer_wenetspeech
```
- 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
| [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
## 三、模型API预测
- ### 1、预测代码示例
```python
import paddlehub as hub
# 采样率为16k,格式为wav的中文语音音频
wav_file = '/PATH/TO/AUDIO'
model = hub.Module(
name='u2_conformer_wenetspeech',
version='1.0.0')
text = model.speech_recognize(wav_file)
print(text)
```
- ### 2、API
- ```python
def check_audio(audio_file)
```
- 检查输入音频格式和采样率是否满足为16000,如果不满足,则重新采样至16000并将新的音频文件保存至相同目录。
- **参数**
- `audio_file`:本地音频文件(*.wav)的路径,如`/path/to/input.wav`
- ```python
def speech_recognize(
audio_file,
device='cpu',
)
```
- 将输入的音频识别成文字
- **参数**
- `audio_file`:本地音频文件(*.wav)的路径,如`/path/to/input.wav`
- `device`:预测时使用的设备,默认为`cpu`,如需使用gpu预测,请设置为`gpu`。
- **返回**
- `text`:str类型,返回输入音频的识别文字结果。
## 四、服务部署
- PaddleHub Serving可以部署一个在线的语音识别服务。
- ### 第一步:启动PaddleHub Serving
- ```shell
$ hub serving start -m u2_conformer_wenetspeech
```
- 这样就完成了一个语音识别服务化API的部署,默认端口号为8866。
- **NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA_VISIBLE_DEVICES环境变量,否则不用设置。
- ### 第二步:发送预测请求
- 配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果
- ```python
import requests
import json
# 需要识别的音频的存放路径,确保部署服务的机器可访问
file = '/path/to/input.wav'
# 以key的方式指定text传入预测方法的时的参数,此例中为"audio_file"
data = {"audio_file": file}
# 发送post请求,content-type类型应指定json方式,url中的ip地址需改为对应机器的ip
url = "http://127.0.0.1:8866/predict/u2_conformer_wenetspeech"
# 指定post请求的headers为application/json方式
headers = {"Content-Type": "application/json"}
r = requests.post(url=url, headers=headers, data=json.dumps(data))
print(r.json())
```
## 五、更新历史
* 1.0.0
初始发布
```shell
$ hub install u2_conformer_wenetspeech
```
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import paddle
from paddleaudio import load, save_wav
from paddlespeech.cli import ASRExecutor
from paddlehub.module.module import moduleinfo, serving
from paddlehub.utils.log import logger
@moduleinfo(
name="u2_conformer_wenetspeech", version="1.0.0", summary="", author="Wenet", author_email="", type="audio/asr")
class U2Conformer(paddle.nn.Layer):
def __init__(self):
super(U2Conformer, self).__init__()
self.asr_executor = ASRExecutor()
self.asr_kw_args = {
'model': 'conformer_wenetspeech',
'lang': 'zh',
'sample_rate': 16000,
'config': None, # Set `config` and `ckpt_path` to None to use pretrained model.
'ckpt_path': None,
}
@staticmethod
def check_audio(audio_file):
assert audio_file.endswith('.wav'), 'Input file must be a wave file `*.wav`.'
sig, sample_rate = load(audio_file)
if sample_rate != 16000:
sig, _ = load(audio_file, 16000)
audio_file_16k = audio_file[:audio_file.rindex('.')] + '_16k.wav'
logger.info('Resampling to 16000 sample rate to new audio file: {}'.format(audio_file_16k))
save_wav(sig, 16000, audio_file_16k)
return audio_file_16k
else:
return audio_file
@serving
def speech_recognize(self, audio_file, device='cpu'):
assert os.path.isfile(audio_file), 'File not exists: {}'.format(audio_file)
audio_file = self.check_audio(audio_file)
text = self.asr_executor(audio_file=audio_file, device=device, **self.asr_kw_args)
return text
# kwmlp_speech_commands
|模型名称|kwmlp_speech_commands|
| :--- | :---: |
|类别|语音-语言识别|
|网络|Keyword-MLP|
|数据集|Google Speech Commands V2|
|是否支持Fine-tuning|否|
|模型大小|1.6MB|
|最新更新日期|2022-01-04|
|数据指标|ACC 97.56%|
## 一、模型基本信息
### 模型介绍
kwmlp_speech_commands采用了 [Keyword-MLP](https://arxiv.org/pdf/2110.07749v1.pdf) 的轻量级模型结构,并在 [Google Speech Commands V2](https://arxiv.org/abs/1804.03209) 数据集上进行了预训练,在其测试集的测试结果为 ACC 97.56%。
<p align="center">
<img src="https://d3i71xaburhd42.cloudfront.net/fa690a97f76ba119ca08fb02fa524a546c47f031/2-Figure1-1.png" hspace='10' height="550"/> <br />
</p>
更多详情请参考
- [Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition](https://arxiv.org/abs/1804.03209)
- [ATTENTION-FREE KEYWORD SPOTTING](https://arxiv.org/pdf/2110.07749v1.pdf)
- [Keyword-MLP](https://github.com/AI-Research-BD/Keyword-MLP)
## 二、安装
- ### 1、环境依赖
- paddlepaddle >= 2.2.0
- paddlehub >= 2.2.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst)
- ### 2、安装
- ```shell
$ hub install kwmlp_speech_commands
```
- 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
| [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
## 三、模型API预测
- ### 1、预测代码示例
```python
import paddlehub as hub
model = hub.Module(
name='kwmlp_speech_commands',
version='1.0.0')
# 通过下列链接可下载示例音频
# https://paddlehub.bj.bcebos.com/paddlehub_dev/go.wav
# Keyword spotting
score, label = model.keyword_recognize('no.wav')
print(score, label)
# [0.89498246] no
score, label = model.keyword_recognize('go.wav')
print(score, label)
# [0.8997176] go
score, label = model.keyword_recognize('one.wav')
print(score, label)
# [0.88598305] one
```
- ### 2、API
- ```python
def keyword_recognize(
wav: os.PathLike,
)
```
- 检测音频中包含的关键词。
- **参数**
- `wav`:输入的包含关键词的音频文件,格式为`*.wav`。
- **返回**
- 输出结果的得分和对应的关键词标签。
## 四、更新历史
* 1.0.0
初始发布
```shell
$ hub install kwmlp_speech_commands
```
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import numpy as np
import paddle
import paddleaudio
def create_dct(n_mfcc: int, n_mels: int, norm: str = 'ortho'):
n = paddle.arange(float(n_mels))
k = paddle.arange(float(n_mfcc)).unsqueeze(1)
dct = paddle.cos(math.pi / float(n_mels) * (n + 0.5) * k) # size (n_mfcc, n_mels)
if norm is None:
dct *= 2.0
else:
assert norm == "ortho"
dct[0] *= 1.0 / math.sqrt(2.0)
dct *= math.sqrt(2.0 / float(n_mels))
return dct.t()
def compute_mfcc(
x: paddle.Tensor,
sr: int = 16000,
n_mels: int = 40,
n_fft: int = 480,
win_length: int = 480,
hop_length: int = 160,
f_min: float = 0.0,
f_max: float = None,
center: bool = False,
top_db: float = 80.0,
norm: str = 'ortho',
):
fbank = paddleaudio.features.spectrum.MelSpectrogram(
sr=sr,
n_mels=n_mels,
n_fft=n_fft,
win_length=win_length,
hop_length=hop_length,
f_min=0.0,
f_max=f_max,
center=center)(x) # waveforms batch ~ (B, T)
log_fbank = paddleaudio.features.spectrum.power_to_db(fbank, top_db=top_db)
dct_matrix = create_dct(n_mfcc=n_mels, n_mels=n_mels, norm=norm)
mfcc = paddle.matmul(log_fbank.transpose((0, 2, 1)), dct_matrix).transpose((0, 2, 1)) # (B, n_mels, L)
return mfcc
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
class Residual(nn.Layer):
def __init__(self, fn):
super().__init__()
self.fn = fn
def forward(self, x):
return self.fn(x) + x
class PreNorm(nn.Layer):
def __init__(self, dim, fn):
super().__init__()
self.fn = fn
self.norm = nn.LayerNorm(dim)
def forward(self, x, **kwargs):
x = self.norm(x)
return self.fn(x, **kwargs)
class PostNorm(nn.Layer):
def __init__(self, dim, fn):
super().__init__()
self.norm = nn.LayerNorm(dim)
self.fn = fn
def forward(self, x, **kwargs):
return self.norm(self.fn(x, **kwargs))
class SpatialGatingUnit(nn.Layer):
def __init__(self, dim, dim_seq, act=nn.Identity(), init_eps=1e-3):
super().__init__()
dim_out = dim // 2
self.norm = nn.LayerNorm(dim_out)
self.proj = nn.Conv1D(dim_seq, dim_seq, 1)
self.act = act
init_eps /= dim_seq
def forward(self, x):
res, gate = x.split(2, axis=-1)
gate = self.norm(gate)
weight, bias = self.proj.weight, self.proj.bias
gate = F.conv1d(gate, weight, bias)
return self.act(gate) * res
class gMLPBlock(nn.Layer):
def __init__(self, *, dim, dim_ff, seq_len, act=nn.Identity()):
super().__init__()
self.proj_in = nn.Sequential(nn.Linear(dim, dim_ff), nn.GELU())
self.sgu = SpatialGatingUnit(dim_ff, seq_len, act)
self.proj_out = nn.Linear(dim_ff // 2, dim)
def forward(self, x):
x = self.proj_in(x)
x = self.sgu(x)
x = self.proj_out(x)
return x
class Rearrange(nn.Layer):
def __init__(self):
super().__init__()
def forward(self, x):
x = x.transpose([0, 1, 3, 2]).squeeze(1)
return x
class Reduce(nn.Layer):
def __init__(self, axis=1):
super().__init__()
self.axis = axis
def forward(self, x):
x = x.mean(axis=self.axis, keepdim=False)
return x
class KW_MLP(nn.Layer):
"""Keyword-MLP."""
def __init__(self,
input_res=[40, 98],
patch_res=[40, 1],
num_classes=35,
dim=64,
depth=12,
ff_mult=4,
channels=1,
prob_survival=0.9,
pre_norm=False,
**kwargs):
super().__init__()
image_height, image_width = input_res
patch_height, patch_width = patch_res
assert (image_height % patch_height) == 0 and (
image_width % patch_width) == 0, 'image height and width must be divisible by patch size'
num_patches = (image_height // patch_height) * (image_width // patch_width)
P_Norm = PreNorm if pre_norm else PostNorm
dim_ff = dim * ff_mult
self.to_patch_embed = nn.Sequential(Rearrange(), nn.Linear(channels * patch_height * patch_width, dim))
self.prob_survival = prob_survival
self.layers = nn.LayerList(
[Residual(P_Norm(dim, gMLPBlock(dim=dim, dim_ff=dim_ff, seq_len=num_patches))) for i in range(depth)])
self.to_logits = nn.Sequential(nn.LayerNorm(dim), Reduce(axis=1), nn.Linear(dim, num_classes))
def forward(self, x):
x = self.to_patch_embed(x)
layers = self.layers
x = nn.Sequential(*layers)(x)
return self.to_logits(x)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import numpy as np
import paddle
import paddleaudio
from .feature import compute_mfcc
from .kwmlp import KW_MLP
from paddlehub.module.module import moduleinfo
from paddlehub.utils.log import logger
@moduleinfo(
name="kwmlp_speech_commands",
version="1.0.0",
summary="",
author="paddlepaddle",
author_email="",
type="audio/language_identification")
class KWS(paddle.nn.Layer):
def __init__(self):
super(KWS, self).__init__()
ckpt_path = os.path.join(self.directory, 'assets', 'model.pdparams')
label_path = os.path.join(self.directory, 'assets', 'label.txt')
self.label_list = []
with open(label_path, 'r') as f:
for l in f:
self.label_list.append(l.strip())
self.sr = 16000
model_conf = {
'input_res': [40, 98],
'patch_res': [40, 1],
'num_classes': 35,
'channels': 1,
'dim': 64,
'depth': 12,
'pre_norm': False,
'prob_survival': 0.9,
}
self.model = KW_MLP(**model_conf)
self.model.set_state_dict(paddle.load(ckpt_path))
self.model.eval()
def load_audio(self, wav):
wav = os.path.abspath(os.path.expanduser(wav))
assert os.path.isfile(wav), 'Please check wav file: {}'.format(wav)
waveform, _ = paddleaudio.load(wav, sr=self.sr, mono=True, normal=False)
return waveform
def keyword_recognize(self, wav):
waveform = self.load_audio(wav)
# fix_length to 1s
if len(waveform) > self.sr:
waveform = waveform[:self.sr]
else:
waveform = np.pad(waveform, (0, self.sr - len(waveform)))
logits = self(paddle.to_tensor(waveform)).reshape([-1])
probs = paddle.nn.functional.softmax(logits)
idx = paddle.argmax(probs)
return probs[idx].numpy(), self.label_list[idx]
def forward(self, x):
if len(x.shape) == 1: # x: waveform tensors with (B, T) shape
x = x.unsqueeze(0)
mfcc = compute_mfcc(x).unsqueeze(1) # (B, C, n_mels, L)
logits = self.model(mfcc).squeeze(1)
return logits
# ecapa_tdnn_common_language
|模型名称|ecapa_tdnn_common_language|
| :--- | :---: |
|类别|语音-语言识别|
|网络|ECAPA-TDNN|
|数据集|CommonLanguage|
|是否支持Fine-tuning|否|
|模型大小|79MB|
|最新更新日期|2021-12-30|
|数据指标|ACC 84.9%|
## 一、模型基本信息
### 模型介绍
ecapa_tdnn_common_language采用了[ECAPA-TDNN](https://arxiv.org/abs/2005.07143)的模型结构,并在[CommonLanguage](https://zenodo.org/record/5036977/)数据集上进行了预训练,在其测试集的测试结果为 ACC 84.9%。
<p align="center">
<img src="https://d3i71xaburhd42.cloudfront.net/9609f4817a7e769f5e3e07084db35e46696e82cd/3-Figure2-1.png" hspace='10' height="550"/> <br />
</p>
更多详情请参考
- [CommonLanguage](https://zenodo.org/record/5036977#.Yc19b5Mzb0o)
- [ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification](https://arxiv.org/pdf/2005.07143.pdf)
- [The SpeechBrain Toolkit](https://github.com/speechbrain/speechbrain)
## 二、安装
- ### 1、环境依赖
- paddlepaddle >= 2.2.0
- paddlehub >= 2.2.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst)
- ### 2、安装
- ```shell
$ hub install ecapa_tdnn_common_language
```
- 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
| [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
## 三、模型API预测
- ### 1、预测代码示例
```python
import paddlehub as hub
model = hub.Module(
name='ecapa_tdnn_common_language',
version='1.0.0')
# 通过下列链接可下载示例音频
# https://paddlehub.bj.bcebos.com/paddlehub_dev/zh.wav
# https://paddlehub.bj.bcebos.com/paddlehub_dev/en.wav
# https://paddlehub.bj.bcebos.com/paddlehub_dev/it.wav
# Language Identification
score, label = model.speaker_verify('zh.wav')
print(score, label)
# array([0.6214552], dtype=float32), 'Chinese_China'
score, label = model.speaker_verify('en.wav')
print(score, label)
# array([0.37193954], dtype=float32), 'English'
score, label = model.speaker_verify('it.wav')
print(score, label)
# array([0.46913534], dtype=float32), 'Italian'
```
- ### 2、API
- ```python
def language_identify(
wav: os.PathLike,
)
```
- 判断输入人声音频的语言类别。
- **参数**
- `wav`:输入的说话人的音频文件,格式为`*.wav`。
- **返回**
- 输出结果的得分和对应的语言类别。
## 四、更新历史
* 1.0.0
初始发布
```shell
$ hub install ecapa_tdnn_common_language
```
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import os
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
def length_to_mask(length, max_len=None, dtype=None):
assert len(length.shape) == 1
if max_len is None:
max_len = length.max().astype('int').item() # using arange to generate mask
mask = paddle.arange(max_len, dtype=length.dtype).expand((len(length), max_len)) < length.unsqueeze(1)
if dtype is None:
dtype = length.dtype
mask = paddle.to_tensor(mask, dtype=dtype)
return mask
class Conv1d(nn.Layer):
def __init__(
self,
in_channels,
out_channels,
kernel_size,
stride=1,
padding="same",
dilation=1,
groups=1,
bias=True,
padding_mode="reflect",
):
super(Conv1d, self).__init__()
self.kernel_size = kernel_size
self.stride = stride
self.dilation = dilation
self.padding = padding
self.padding_mode = padding_mode
self.conv = nn.Conv1D(
in_channels,
out_channels,
self.kernel_size,
stride=self.stride,
padding=0,
dilation=self.dilation,
groups=groups,
bias_attr=bias,
)
def forward(self, x):
if self.padding == "same":
x = self._manage_padding(x, self.kernel_size, self.dilation, self.stride)
else:
raise ValueError("Padding must be 'same'. Got {self.padding}")
return self.conv(x)
def _manage_padding(self, x, kernel_size: int, dilation: int, stride: int):
L_in = x.shape[-1] # Detecting input shape
padding = self._get_padding_elem(L_in, stride, kernel_size, dilation) # Time padding
x = F.pad(x, padding, mode=self.padding_mode, data_format="NCL") # Applying padding
return x
def _get_padding_elem(self, L_in: int, stride: int, kernel_size: int, dilation: int):
if stride > 1:
n_steps = math.ceil(((L_in - kernel_size * dilation) / stride) + 1)
L_out = stride * (n_steps - 1) + kernel_size * dilation
padding = [kernel_size // 2, kernel_size // 2]
else:
L_out = (L_in - dilation * (kernel_size - 1) - 1) // stride + 1
padding = [(L_in - L_out) // 2, (L_in - L_out) // 2]
return padding
class BatchNorm1d(nn.Layer):
def __init__(
self,
input_size,
eps=1e-05,
momentum=0.9,
weight_attr=None,
bias_attr=None,
data_format='NCL',
use_global_stats=None,
):
super(BatchNorm1d, self).__init__()
self.norm = nn.BatchNorm1D(
input_size,
epsilon=eps,
momentum=momentum,
weight_attr=weight_attr,
bias_attr=bias_attr,
data_format=data_format,
use_global_stats=use_global_stats,
)
def forward(self, x):
x_n = self.norm(x)
return x_n
class TDNNBlock(nn.Layer):
def __init__(
self,
in_channels,
out_channels,
kernel_size,
dilation,
activation=nn.ReLU,
):
super(TDNNBlock, self).__init__()
self.conv = Conv1d(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
dilation=dilation,
)
self.activation = activation()
self.norm = BatchNorm1d(input_size=out_channels)
def forward(self, x):
return self.norm(self.activation(self.conv(x)))
class Res2NetBlock(nn.Layer):
def __init__(self, in_channels, out_channels, scale=8, dilation=1):
super(Res2NetBlock, self).__init__()
assert in_channels % scale == 0
assert out_channels % scale == 0
in_channel = in_channels // scale
hidden_channel = out_channels // scale
self.blocks = nn.LayerList(
[TDNNBlock(in_channel, hidden_channel, kernel_size=3, dilation=dilation) for i in range(scale - 1)])
self.scale = scale
def forward(self, x):
y = []
for i, x_i in enumerate(paddle.chunk(x, self.scale, axis=1)):
if i == 0:
y_i = x_i
elif i == 1:
y_i = self.blocks[i - 1](x_i)
else:
y_i = self.blocks[i - 1](x_i + y_i)
y.append(y_i)
y = paddle.concat(y, axis=1)
return y
class SEBlock(nn.Layer):
def __init__(self, in_channels, se_channels, out_channels):
super(SEBlock, self).__init__()
self.conv1 = Conv1d(in_channels=in_channels, out_channels=se_channels, kernel_size=1)
self.relu = paddle.nn.ReLU()
self.conv2 = Conv1d(in_channels=se_channels, out_channels=out_channels, kernel_size=1)
self.sigmoid = paddle.nn.Sigmoid()
def forward(self, x, lengths=None):
L = x.shape[-1]
if lengths is not None:
mask = length_to_mask(lengths * L, max_len=L)
mask = mask.unsqueeze(1)
total = mask.sum(axis=2, keepdim=True)
s = (x * mask).sum(axis=2, keepdim=True) / total
else:
s = x.mean(axis=2, keepdim=True)
s = self.relu(self.conv1(s))
s = self.sigmoid(self.conv2(s))
return s * x
class AttentiveStatisticsPooling(nn.Layer):
def __init__(self, channels, attention_channels=128, global_context=True):
super().__init__()
self.eps = 1e-12
self.global_context = global_context
if global_context:
self.tdnn = TDNNBlock(channels * 3, attention_channels, 1, 1)
else:
self.tdnn = TDNNBlock(channels, attention_channels, 1, 1)
self.tanh = nn.Tanh()
self.conv = Conv1d(in_channels=attention_channels, out_channels=channels, kernel_size=1)
def forward(self, x, lengths=None):
C, L = x.shape[1], x.shape[2] # KP: (N, C, L)
def _compute_statistics(x, m, axis=2, eps=self.eps):
mean = (m * x).sum(axis)
std = paddle.sqrt((m * (x - mean.unsqueeze(axis)).pow(2)).sum(axis).clip(eps))
return mean, std
if lengths is None:
lengths = paddle.ones([x.shape[0]])
# Make binary mask of shape [N, 1, L]
mask = length_to_mask(lengths * L, max_len=L)
mask = mask.unsqueeze(1)
# Expand the temporal context of the pooling layer by allowing the
# self-attention to look at global properties of the utterance.
if self.global_context:
total = mask.sum(axis=2, keepdim=True).astype('float32')
mean, std = _compute_statistics(x, mask / total)
mean = mean.unsqueeze(2).tile((1, 1, L))
std = std.unsqueeze(2).tile((1, 1, L))
attn = paddle.concat([x, mean, std], axis=1)
else:
attn = x
# Apply layers
attn = self.conv(self.tanh(self.tdnn(attn)))
# Filter out zero-paddings
attn = paddle.where(mask.tile((1, C, 1)) == 0, paddle.ones_like(attn) * float("-inf"), attn)
attn = F.softmax(attn, axis=2)
mean, std = _compute_statistics(x, attn)
# Append mean and std of the batch
pooled_stats = paddle.concat((mean, std), axis=1)
pooled_stats = pooled_stats.unsqueeze(2)
return pooled_stats
class SERes2NetBlock(nn.Layer):
def __init__(
self,
in_channels,
out_channels,
res2net_scale=8,
se_channels=128,
kernel_size=1,
dilation=1,
activation=nn.ReLU,
):
super(SERes2NetBlock, self).__init__()
self.out_channels = out_channels
self.tdnn1 = TDNNBlock(
in_channels,
out_channels,
kernel_size=1,
dilation=1,
activation=activation,
)
self.res2net_block = Res2NetBlock(out_channels, out_channels, res2net_scale, dilation)
self.tdnn2 = TDNNBlock(
out_channels,
out_channels,
kernel_size=1,
dilation=1,
activation=activation,
)
self.se_block = SEBlock(out_channels, se_channels, out_channels)
self.shortcut = None
if in_channels != out_channels:
self.shortcut = Conv1d(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=1,
)
def forward(self, x, lengths=None):
residual = x
if self.shortcut:
residual = self.shortcut(x)
x = self.tdnn1(x)
x = self.res2net_block(x)
x = self.tdnn2(x)
x = self.se_block(x, lengths)
return x + residual
class ECAPA_TDNN(nn.Layer):
def __init__(
self,
input_size,
lin_neurons=192,
activation=nn.ReLU,
channels=[512, 512, 512, 512, 1536],
kernel_sizes=[5, 3, 3, 3, 1],
dilations=[1, 2, 3, 4, 1],
attention_channels=128,
res2net_scale=8,
se_channels=128,
global_context=True,
):
super(ECAPA_TDNN, self).__init__()
assert len(channels) == len(kernel_sizes)
assert len(channels) == len(dilations)
self.channels = channels
self.blocks = nn.LayerList()
self.emb_size = lin_neurons
# The initial TDNN layer
self.blocks.append(TDNNBlock(
input_size,
channels[0],
kernel_sizes[0],
dilations[0],
activation,
))
# SE-Res2Net layers
for i in range(1, len(channels) - 1):
self.blocks.append(
SERes2NetBlock(
channels[i - 1],
channels[i],
res2net_scale=res2net_scale,
se_channels=se_channels,
kernel_size=kernel_sizes[i],
dilation=dilations[i],
activation=activation,
))
# Multi-layer feature aggregation
self.mfa = TDNNBlock(
channels[-1],
channels[-1],
kernel_sizes[-1],
dilations[-1],
activation,
)
# Attentive Statistical Pooling
self.asp = AttentiveStatisticsPooling(
channels[-1],
attention_channels=attention_channels,
global_context=global_context,
)
self.asp_bn = BatchNorm1d(input_size=channels[-1] * 2)
# Final linear transformation
self.fc = Conv1d(
in_channels=channels[-1] * 2,
out_channels=self.emb_size,
kernel_size=1,
)
def forward(self, x, lengths=None):
xl = []
for layer in self.blocks:
try:
x = layer(x, lengths=lengths)
except TypeError:
x = layer(x)
xl.append(x)
# Multi-layer feature aggregation
x = paddle.concat(xl[1:], axis=1)
x = self.mfa(x)
# Attentive Statistical Pooling
x = self.asp(x, lengths=lengths)
x = self.asp_bn(x)
# Final linear transformation
x = self.fc(x)
return x
class Classifier(nn.Layer):
def __init__(self, backbone, num_class, dtype=paddle.float32):
super(Classifier, self).__init__()
self.backbone = backbone
self.params = nn.ParameterList(
[paddle.create_parameter(shape=[num_class, self.backbone.emb_size], dtype=dtype)])
def forward(self, x):
emb = self.backbone(x.transpose([0, 2, 1])).transpose([0, 2, 1])
logits = F.linear(F.normalize(emb.squeeze(1)), F.normalize(self.params[0]).transpose([1, 0]))
return logits
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle
import paddleaudio
from paddleaudio.features.spectrum import hz_to_mel
from paddleaudio.features.spectrum import mel_to_hz
from paddleaudio.features.spectrum import power_to_db
from paddleaudio.features.spectrum import Spectrogram
from paddleaudio.features.window import get_window
def compute_fbank_matrix(sample_rate: int = 16000,
n_fft: int = 400,
n_mels: int = 80,
f_min: int = 0.0,
f_max: int = 8000.0):
mel = paddle.linspace(hz_to_mel(f_min, htk=True), hz_to_mel(f_max, htk=True), n_mels + 2, dtype=paddle.float32)
hz = mel_to_hz(mel, htk=True)
band = hz[1:] - hz[:-1]
band = band[:-1]
f_central = hz[1:-1]
n_stft = n_fft // 2 + 1
all_freqs = paddle.linspace(0, sample_rate // 2, n_stft)
all_freqs_mat = all_freqs.tile([f_central.shape[0], 1])
f_central_mat = f_central.tile([all_freqs_mat.shape[1], 1]).transpose([1, 0])
band_mat = band.tile([all_freqs_mat.shape[1], 1]).transpose([1, 0])
slope = (all_freqs_mat - f_central_mat) / band_mat
left_side = slope + 1.0
right_side = -slope + 1.0
fbank_matrix = paddle.maximum(paddle.zeros_like(left_side), paddle.minimum(left_side, right_side))
return fbank_matrix
def compute_log_fbank(
x: paddle.Tensor,
sample_rate: int = 16000,
n_fft: int = 400,
hop_length: int = 160,
win_length: int = 400,
n_mels: int = 80,
window: str = 'hamming',
center: bool = True,
pad_mode: str = 'constant',
f_min: float = 0.0,
f_max: float = None,
top_db: float = 80.0,
):
if f_max is None:
f_max = sample_rate / 2
spect = Spectrogram(
n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=center, pad_mode=pad_mode)(x)
fbank_matrix = compute_fbank_matrix(
sample_rate=sample_rate,
n_fft=n_fft,
n_mels=n_mels,
f_min=f_min,
f_max=f_max,
)
fbank = paddle.matmul(fbank_matrix, spect)
log_fbank = power_to_db(fbank, top_db=top_db).transpose([0, 2, 1])
return log_fbank
def compute_stats(x: paddle.Tensor, mean_norm: bool = True, std_norm: bool = False, eps: float = 1e-10):
if mean_norm:
current_mean = paddle.mean(x, axis=0)
else:
current_mean = paddle.to_tensor([0.0])
if std_norm:
current_std = paddle.std(x, axis=0)
else:
current_std = paddle.to_tensor([1.0])
current_std = paddle.maximum(current_std, eps * paddle.ones_like(current_std))
return current_mean, current_std
def normalize(
x: paddle.Tensor,
global_mean: paddle.Tensor = None,
global_std: paddle.Tensor = None,
):
for i in range(x.shape[0]): # (B, ...)
if global_mean is None and global_std is None:
mean, std = compute_stats(x[i])
x[i] = (x[i] - mean) / std
else:
x[i] = (x[i] - global_mean) / global_std
return x
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
from typing import List
from typing import Union
import numpy as np
import paddle
import paddleaudio
from .ecapa_tdnn import Classifier
from .ecapa_tdnn import ECAPA_TDNN
from .feature import compute_log_fbank
from .feature import normalize
from paddlehub.module.module import moduleinfo
from paddlehub.utils.log import logger
@moduleinfo(
name="ecapa_tdnn_common_language",
version="1.0.0",
summary="",
author="paddlepaddle",
author_email="",
type="audio/language_identification")
class LanguageIdentification(paddle.nn.Layer):
def __init__(self):
super(LanguageIdentification, self).__init__()
ckpt_path = os.path.join(self.directory, 'assets', 'model.pdparams')
label_path = os.path.join(self.directory, 'assets', 'label.txt')
self.label_list = []
with open(label_path, 'r') as f:
for l in f:
self.label_list.append(l.strip())
self.sr = 16000
model_conf = {
'input_size': 80,
'channels': [1024, 1024, 1024, 1024, 3072],
'kernel_sizes': [5, 3, 3, 3, 1],
'dilations': [1, 2, 3, 4, 1],
'attention_channels': 128,
'lin_neurons': 192
}
self.model = Classifier(
backbone=ECAPA_TDNN(**model_conf),
num_class=45,
)
self.model.set_state_dict(paddle.load(ckpt_path))
self.model.eval()
def load_audio(self, wav):
wav = os.path.abspath(os.path.expanduser(wav))
assert os.path.isfile(wav), 'Please check wav file: {}'.format(wav)
waveform, _ = paddleaudio.load(wav, sr=self.sr, mono=True, normal=False)
return waveform
def language_identify(self, wav):
waveform = self.load_audio(wav)
logits = self(paddle.to_tensor(waveform)).reshape([-1])
idx = paddle.argmax(logits)
return logits[idx].numpy(), self.label_list[idx]
def forward(self, x):
if len(x.shape) == 1:
x = x.unsqueeze(0)
fbank = compute_log_fbank(x) # x: waveform tensors with (B, T) shape
norm_fbank = normalize(fbank)
logits = self.model(norm_fbank).squeeze(1)
return logits
# ecapa_tdnn_voxceleb
|模型名称|ecapa_tdnn_voxceleb|
| :--- | :---: |
|类别|语音-声纹识别|
|网络|ECAPA-TDNN|
|数据集|VoxCeleb|
|是否支持Fine-tuning|否|
|模型大小|79MB|
|最新更新日期|2021-12-30|
|数据指标|EER 0.69%|
## 一、模型基本信息
### 模型介绍
ecapa_tdnn_voxceleb采用了[ECAPA-TDNN](https://arxiv.org/abs/2005.07143)的模型结构,并在[VoxCeleb](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/)数据集上进行了预训练,在VoxCeleb1的声纹识别测试集([veri_test.txt](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test.txt))上的测试结果为 EER 0.69%,达到了该数据集的SOTA。
<p align="center">
<img src="https://d3i71xaburhd42.cloudfront.net/9609f4817a7e769f5e3e07084db35e46696e82cd/3-Figure2-1.png" hspace='10' height="550"/> <br />
</p>
更多详情请参考
- [VoxCeleb: a large-scale speaker identification dataset](https://www.robots.ox.ac.uk/~vgg/publications/2017/Nagrani17/nagrani17.pdf)
- [ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification](https://arxiv.org/pdf/2005.07143.pdf)
- [The SpeechBrain Toolkit](https://github.com/speechbrain/speechbrain)
## 二、安装
- ### 1、环境依赖
- paddlepaddle >= 2.2.0
- paddlehub >= 2.2.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst)
- ### 2、安装
- ```shell
$ hub install ecapa_tdnn_voxceleb
```
- 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
| [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
## 三、模型API预测
- ### 1、预测代码示例
```python
import paddlehub as hub
model = hub.Module(
name='ecapa_tdnn_voxceleb',
threshold=0.25,
version='1.0.0')
# 通过下列链接可下载示例音频
# https://paddlehub.bj.bcebos.com/paddlehub_dev/sv1.wav
# https://paddlehub.bj.bcebos.com/paddlehub_dev/sv2.wav
# Speaker Embedding
embedding = model.speaker_embedding('sv1.wav')
print(embedding.shape)
# (192,)
# Speaker Verification
score, pred = model.speaker_verify('sv1.wav', 'sv2.wav')
print(score, pred)
# [0.16354457], [False]
```
- ### 2、API
- ```python
def __init__(
threshold: float,
)
```
- 初始化声纹模型,确定判别阈值。
- **参数**
- `threshold`:设定模型判别声纹相似度的得分阈值,默认为 0.25。
- ```python
def speaker_embedding(
wav: os.PathLike,
)
```
- 获取输入音频的声纹特征
- **参数**
- `wav`:输入的说话人的音频文件,格式为`*.wav`。
- **返回**
- 输出纬度为 (192,) 的声纹特征向量。
- ```python
def speaker_verify(
wav1: os.PathLike,
wav2: os.PathLike,
)
```
- 对比两段音频,分别计算其声纹特征的相似度得分,并判断是否为同一说话人。
- **参数**
- `wav1`:输入的说话人1的音频文件,格式为`*.wav`。
- `wav2`:输入的说话人2的音频文件,格式为`*.wav`。
- **返回**
- 返回声纹相似度得分[-1, 1]和预测结果。
## 四、更新历史
* 1.0.0
初始发布
```shell
$ hub install ecapa_tdnn_voxceleb
```
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import os
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
def length_to_mask(length, max_len=None, dtype=None):
assert len(length.shape) == 1
if max_len is None:
max_len = length.max().astype('int').item() # using arange to generate mask
mask = paddle.arange(max_len, dtype=length.dtype).expand((len(length), max_len)) < length.unsqueeze(1)
if dtype is None:
dtype = length.dtype
mask = paddle.to_tensor(mask, dtype=dtype)
return mask
class Conv1d(nn.Layer):
def __init__(
self,
in_channels,
out_channels,
kernel_size,
stride=1,
padding="same",
dilation=1,
groups=1,
bias=True,
padding_mode="reflect",
):
super(Conv1d, self).__init__()
self.kernel_size = kernel_size
self.stride = stride
self.dilation = dilation
self.padding = padding
self.padding_mode = padding_mode
self.conv = nn.Conv1D(
in_channels,
out_channels,
self.kernel_size,
stride=self.stride,
padding=0,
dilation=self.dilation,
groups=groups,
bias_attr=bias,
)
def forward(self, x):
if self.padding == "same":
x = self._manage_padding(x, self.kernel_size, self.dilation, self.stride)
else:
raise ValueError("Padding must be 'same'. Got {self.padding}")
return self.conv(x)
def _manage_padding(self, x, kernel_size: int, dilation: int, stride: int):
L_in = x.shape[-1] # Detecting input shape
padding = self._get_padding_elem(L_in, stride, kernel_size, dilation) # Time padding
x = F.pad(x, padding, mode=self.padding_mode, data_format="NCL") # Applying padding
return x
def _get_padding_elem(self, L_in: int, stride: int, kernel_size: int, dilation: int):
if stride > 1:
n_steps = math.ceil(((L_in - kernel_size * dilation) / stride) + 1)
L_out = stride * (n_steps - 1) + kernel_size * dilation
padding = [kernel_size // 2, kernel_size // 2]
else:
L_out = (L_in - dilation * (kernel_size - 1) - 1) // stride + 1
padding = [(L_in - L_out) // 2, (L_in - L_out) // 2]
return padding
class BatchNorm1d(nn.Layer):
def __init__(
self,
input_size,
eps=1e-05,
momentum=0.9,
weight_attr=None,
bias_attr=None,
data_format='NCL',
use_global_stats=None,
):
super(BatchNorm1d, self).__init__()
self.norm = nn.BatchNorm1D(
input_size,
epsilon=eps,
momentum=momentum,
weight_attr=weight_attr,
bias_attr=bias_attr,
data_format=data_format,
use_global_stats=use_global_stats,
)
def forward(self, x):
x_n = self.norm(x)
return x_n
class TDNNBlock(nn.Layer):
def __init__(
self,
in_channels,
out_channels,
kernel_size,
dilation,
activation=nn.ReLU,
):
super(TDNNBlock, self).__init__()
self.conv = Conv1d(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
dilation=dilation,
)
self.activation = activation()
self.norm = BatchNorm1d(input_size=out_channels)
def forward(self, x):
return self.norm(self.activation(self.conv(x)))
class Res2NetBlock(nn.Layer):
def __init__(self, in_channels, out_channels, scale=8, dilation=1):
super(Res2NetBlock, self).__init__()
assert in_channels % scale == 0
assert out_channels % scale == 0
in_channel = in_channels // scale
hidden_channel = out_channels // scale
self.blocks = nn.LayerList(
[TDNNBlock(in_channel, hidden_channel, kernel_size=3, dilation=dilation) for i in range(scale - 1)])
self.scale = scale
def forward(self, x):
y = []
for i, x_i in enumerate(paddle.chunk(x, self.scale, axis=1)):
if i == 0:
y_i = x_i
elif i == 1:
y_i = self.blocks[i - 1](x_i)
else:
y_i = self.blocks[i - 1](x_i + y_i)
y.append(y_i)
y = paddle.concat(y, axis=1)
return y
class SEBlock(nn.Layer):
def __init__(self, in_channels, se_channels, out_channels):
super(SEBlock, self).__init__()
self.conv1 = Conv1d(in_channels=in_channels, out_channels=se_channels, kernel_size=1)
self.relu = paddle.nn.ReLU()
self.conv2 = Conv1d(in_channels=se_channels, out_channels=out_channels, kernel_size=1)
self.sigmoid = paddle.nn.Sigmoid()
def forward(self, x, lengths=None):
L = x.shape[-1]
if lengths is not None:
mask = length_to_mask(lengths * L, max_len=L)
mask = mask.unsqueeze(1)
total = mask.sum(axis=2, keepdim=True)
s = (x * mask).sum(axis=2, keepdim=True) / total
else:
s = x.mean(axis=2, keepdim=True)
s = self.relu(self.conv1(s))
s = self.sigmoid(self.conv2(s))
return s * x
class AttentiveStatisticsPooling(nn.Layer):
def __init__(self, channels, attention_channels=128, global_context=True):
super().__init__()
self.eps = 1e-12
self.global_context = global_context
if global_context:
self.tdnn = TDNNBlock(channels * 3, attention_channels, 1, 1)
else:
self.tdnn = TDNNBlock(channels, attention_channels, 1, 1)
self.tanh = nn.Tanh()
self.conv = Conv1d(in_channels=attention_channels, out_channels=channels, kernel_size=1)
def forward(self, x, lengths=None):
C, L = x.shape[1], x.shape[2] # KP: (N, C, L)
def _compute_statistics(x, m, axis=2, eps=self.eps):
mean = (m * x).sum(axis)
std = paddle.sqrt((m * (x - mean.unsqueeze(axis)).pow(2)).sum(axis).clip(eps))
return mean, std
if lengths is None:
lengths = paddle.ones([x.shape[0]])
# Make binary mask of shape [N, 1, L]
mask = length_to_mask(lengths * L, max_len=L)
mask = mask.unsqueeze(1)
# Expand the temporal context of the pooling layer by allowing the
# self-attention to look at global properties of the utterance.
if self.global_context:
total = mask.sum(axis=2, keepdim=True).astype('float32')
mean, std = _compute_statistics(x, mask / total)
mean = mean.unsqueeze(2).tile((1, 1, L))
std = std.unsqueeze(2).tile((1, 1, L))
attn = paddle.concat([x, mean, std], axis=1)
else:
attn = x
# Apply layers
attn = self.conv(self.tanh(self.tdnn(attn)))
# Filter out zero-paddings
attn = paddle.where(mask.tile((1, C, 1)) == 0, paddle.ones_like(attn) * float("-inf"), attn)
attn = F.softmax(attn, axis=2)
mean, std = _compute_statistics(x, attn)
# Append mean and std of the batch
pooled_stats = paddle.concat((mean, std), axis=1)
pooled_stats = pooled_stats.unsqueeze(2)
return pooled_stats
class SERes2NetBlock(nn.Layer):
def __init__(
self,
in_channels,
out_channels,
res2net_scale=8,
se_channels=128,
kernel_size=1,
dilation=1,
activation=nn.ReLU,
):
super(SERes2NetBlock, self).__init__()
self.out_channels = out_channels
self.tdnn1 = TDNNBlock(
in_channels,
out_channels,
kernel_size=1,
dilation=1,
activation=activation,
)
self.res2net_block = Res2NetBlock(out_channels, out_channels, res2net_scale, dilation)
self.tdnn2 = TDNNBlock(
out_channels,
out_channels,
kernel_size=1,
dilation=1,
activation=activation,
)
self.se_block = SEBlock(out_channels, se_channels, out_channels)
self.shortcut = None
if in_channels != out_channels:
self.shortcut = Conv1d(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=1,
)
def forward(self, x, lengths=None):
residual = x
if self.shortcut:
residual = self.shortcut(x)
x = self.tdnn1(x)
x = self.res2net_block(x)
x = self.tdnn2(x)
x = self.se_block(x, lengths)
return x + residual
class ECAPA_TDNN(nn.Layer):
def __init__(
self,
input_size,
lin_neurons=192,
activation=nn.ReLU,
channels=[512, 512, 512, 512, 1536],
kernel_sizes=[5, 3, 3, 3, 1],
dilations=[1, 2, 3, 4, 1],
attention_channels=128,
res2net_scale=8,
se_channels=128,
global_context=True,
):
super(ECAPA_TDNN, self).__init__()
assert len(channels) == len(kernel_sizes)
assert len(channels) == len(dilations)
self.channels = channels
self.blocks = nn.LayerList()
self.emb_size = lin_neurons
# The initial TDNN layer
self.blocks.append(TDNNBlock(
input_size,
channels[0],
kernel_sizes[0],
dilations[0],
activation,
))
# SE-Res2Net layers
for i in range(1, len(channels) - 1):
self.blocks.append(
SERes2NetBlock(
channels[i - 1],
channels[i],
res2net_scale=res2net_scale,
se_channels=se_channels,
kernel_size=kernel_sizes[i],
dilation=dilations[i],
activation=activation,
))
# Multi-layer feature aggregation
self.mfa = TDNNBlock(
channels[-1],
channels[-1],
kernel_sizes[-1],
dilations[-1],
activation,
)
# Attentive Statistical Pooling
self.asp = AttentiveStatisticsPooling(
channels[-1],
attention_channels=attention_channels,
global_context=global_context,
)
self.asp_bn = BatchNorm1d(input_size=channels[-1] * 2)
# Final linear transformation
self.fc = Conv1d(
in_channels=channels[-1] * 2,
out_channels=self.emb_size,
kernel_size=1,
)
def forward(self, x, lengths=None):
xl = []
for layer in self.blocks:
try:
x = layer(x, lengths=lengths)
except TypeError:
x = layer(x)
xl.append(x)
# Multi-layer feature aggregation
x = paddle.concat(xl[1:], axis=1)
x = self.mfa(x)
# Attentive Statistical Pooling
x = self.asp(x, lengths=lengths)
x = self.asp_bn(x)
# Final linear transformation
x = self.fc(x)
return x
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle
import paddleaudio
from paddleaudio.features.spectrum import hz_to_mel
from paddleaudio.features.spectrum import mel_to_hz
from paddleaudio.features.spectrum import power_to_db
from paddleaudio.features.spectrum import Spectrogram
from paddleaudio.features.window import get_window
def compute_fbank_matrix(sample_rate: int = 16000,
n_fft: int = 400,
n_mels: int = 80,
f_min: int = 0.0,
f_max: int = 8000.0):
mel = paddle.linspace(hz_to_mel(f_min, htk=True), hz_to_mel(f_max, htk=True), n_mels + 2, dtype=paddle.float32)
hz = mel_to_hz(mel, htk=True)
band = hz[1:] - hz[:-1]
band = band[:-1]
f_central = hz[1:-1]
n_stft = n_fft // 2 + 1
all_freqs = paddle.linspace(0, sample_rate // 2, n_stft)
all_freqs_mat = all_freqs.tile([f_central.shape[0], 1])
f_central_mat = f_central.tile([all_freqs_mat.shape[1], 1]).transpose([1, 0])
band_mat = band.tile([all_freqs_mat.shape[1], 1]).transpose([1, 0])
slope = (all_freqs_mat - f_central_mat) / band_mat
left_side = slope + 1.0
right_side = -slope + 1.0
fbank_matrix = paddle.maximum(paddle.zeros_like(left_side), paddle.minimum(left_side, right_side))
return fbank_matrix
def compute_log_fbank(
x: paddle.Tensor,
sample_rate: int = 16000,
n_fft: int = 400,
hop_length: int = 160,
win_length: int = 400,
n_mels: int = 80,
window: str = 'hamming',
center: bool = True,
pad_mode: str = 'constant',
f_min: float = 0.0,
f_max: float = None,
top_db: float = 80.0,
):
if f_max is None:
f_max = sample_rate / 2
spect = Spectrogram(
n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=center, pad_mode=pad_mode)(x)
fbank_matrix = compute_fbank_matrix(
sample_rate=sample_rate,
n_fft=n_fft,
n_mels=n_mels,
f_min=f_min,
f_max=f_max,
)
fbank = paddle.matmul(fbank_matrix, spect)
log_fbank = power_to_db(fbank, top_db=top_db).transpose([0, 2, 1])
return log_fbank
def compute_stats(x: paddle.Tensor, mean_norm: bool = True, std_norm: bool = False, eps: float = 1e-10):
if mean_norm:
current_mean = paddle.mean(x, axis=0)
else:
current_mean = paddle.to_tensor([0.0])
if std_norm:
current_std = paddle.std(x, axis=0)
else:
current_std = paddle.to_tensor([1.0])
current_std = paddle.maximum(current_std, eps * paddle.ones_like(current_std))
return current_mean, current_std
def normalize(
x: paddle.Tensor,
global_mean: paddle.Tensor = None,
global_std: paddle.Tensor = None,
):
for i in range(x.shape[0]): # (B, ...)
if global_mean is None and global_std is None:
mean, std = compute_stats(x[i])
x[i] = (x[i] - mean) / std
else:
x[i] = (x[i] - global_mean) / global_std
return x
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
from typing import List
from typing import Union
import numpy as np
import paddle
import paddleaudio
from .ecapa_tdnn import ECAPA_TDNN
from .feature import compute_log_fbank
from .feature import normalize
from paddlehub.module.module import moduleinfo
from paddlehub.utils.log import logger
@moduleinfo(
name="ecapa_tdnn_voxceleb",
version="1.0.0",
summary="",
author="paddlepaddle",
author_email="",
type="audio/speaker_recognition")
class SpeakerRecognition(paddle.nn.Layer):
def __init__(self, threshold=0.25):
super(SpeakerRecognition, self).__init__()
global_stats_path = os.path.join(self.directory, 'assets', 'global_embedding_stats.npy')
ckpt_path = os.path.join(self.directory, 'assets', 'model.pdparams')
self.sr = 16000
self.threshold = threshold
model_conf = {
'input_size': 80,
'channels': [1024, 1024, 1024, 1024, 3072],
'kernel_sizes': [5, 3, 3, 3, 1],
'dilations': [1, 2, 3, 4, 1],
'attention_channels': 128,
'lin_neurons': 192
}
self.model = ECAPA_TDNN(**model_conf)
self.model.set_state_dict(paddle.load(ckpt_path))
self.model.eval()
global_embedding_stats = np.load(global_stats_path, allow_pickle=True)
self.global_emb_mean = paddle.to_tensor(global_embedding_stats.item().get('global_emb_mean'))
self.global_emb_std = paddle.to_tensor(global_embedding_stats.item().get('global_emb_std'))
self.similarity = paddle.nn.CosineSimilarity(axis=-1, eps=1e-6)
def load_audio(self, wav):
wav = os.path.abspath(os.path.expanduser(wav))
assert os.path.isfile(wav), 'Please check wav file: {}'.format(wav)
waveform, _ = paddleaudio.load(wav, sr=self.sr, mono=True, normal=False)
return waveform
def speaker_embedding(self, wav):
waveform = self.load_audio(wav)
embedding = self(paddle.to_tensor(waveform)).reshape([-1])
return embedding.numpy()
def speaker_verify(self, wav1, wav2):
waveform1 = self.load_audio(wav1)
embedding1 = self(paddle.to_tensor(waveform1)).reshape([-1])
waveform2 = self.load_audio(wav2)
embedding2 = self(paddle.to_tensor(waveform2)).reshape([-1])
score = self.similarity(embedding1, embedding2).numpy()
return score, score > self.threshold
def forward(self, x):
if len(x.shape) == 1:
x = x.unsqueeze(0)
fbank = compute_log_fbank(x) # x: waveform tensors with (B, T) shape
norm_fbank = normalize(fbank)
embedding = self.model(norm_fbank.transpose([0, 2, 1])).transpose([0, 2, 1])
norm_embedding = normalize(x=embedding, global_mean=self.global_emb_mean, global_std=self.global_emb_std)
return norm_embedding
## 概述
# deepvoice3_ljspeech
|模型名称|deepvoice3_ljspeech|
| :--- | :---: |
|类别|语音-语音合成|
|网络|DeepVoice3|
|数据集|LJSpeech-1.1|
|是否支持Fine-tuning|否|
|模型大小|58MB|
|最新更新日期|2020-10-27|
|数据指标|-|
## 一、模型基本信息
### 模型介绍
Deep Voice 3是百度研究院2017年发布的端到端的TTS模型(论文录用于ICLR 2018)。它是一个基于卷积神经网络和注意力机制的seq2seq模型,由于不包含循环神经网络,它可以并行训练,远快于基于循环神经网络的模型。Deep Voice 3可以学习到多个说话人的特征,也支持搭配多种声码器使用。deepvoice3_ljspeech是基于ljspeech英文语音数据集预训练得到的英文TTS模型,仅支持预测。
<p align="center">
<img src="https://github.com/PaddlePaddle/Parakeet/blob/develop/examples/deepvoice3/images/model_architecture.png" hspace='10'/> <br />
<img src="https://raw.githubusercontent.com/PaddlePaddle/Parakeet/release/v0.1/examples/deepvoice3/images/model_architecture.png" hspace='10'/> <br/>
</p>
更多详情参考论文[Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654)
## 命令行预测
```shell
$ hub run deepvoice3_ljspeech --input_text='Simple as this proposition is, it is necessary to be stated' --use_gpu True --vocoder griffin-lim
```
## 二、安装
## API
- ### 1、系统依赖
```python
def synthesize(texts, use_gpu=False, vocoder="griffin-lim"):
```
对于Ubuntu用户,请执行:
```
sudo apt-get install libsndfile1
```
对于Centos用户,请执行:
```
sudo yum install libsndfile
```
预测API,由输入文本合成对应音频波形。
- ### 2、环境依赖
**参数**
- 2.0.0 > paddlepaddle >= 1.8.2
* texts (list\[str\]): 待预测文本;
* use\_gpu (bool): 是否使用 GPU;**若使用GPU,请先设置CUDA\_VISIBLE\_DEVICES环境变量**
* vocoder: 指定声码器,可选 "griffin-lim"或"waveflow"
- 2.0.0 > paddlehub >= 1.7.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst)
**返回**
- ### 3、安装
* wavs (list): 语音合成结果列表,列表中每一个元素为对应输入文本的音频波形,可使用`soundfile.write`进一步处理或保存。
* sample\_rate (int): 合成音频的采样率。
- ```shell
$ hub install deepvoice3_ljspeech
```
- 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
| [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
**代码示例**
```python
import paddlehub as hub
import soundfile as sf
## 三、模型API预测
# Load deepvoice3_ljspeech module.
module = hub.Module(name="deepvoice3_ljspeech")
- ### 1、命令行预测
# Predict sentiment label
test_texts = ['Simple as this proposition is, it is necessary to be stated',
'Parakeet stands for Paddle PARAllel text-to-speech toolkit']
wavs, sample_rate = module.synthesize(texts=test_texts)
for index, wav in enumerate(wavs):
sf.write(f"{index}.wav", wav, sample_rate)
```
- ```shell
$ hub run deepvoice3_ljspeech --input_text='Simple as this proposition is, it is necessary to be stated' --use_gpu True --vocoder griffin-lim
```
- 通过命令行方式实现语音合成模型的调用,更多请见[PaddleHub命令行指令](https://github.com/shinichiye/PaddleHub/blob/release/v2.1/docs/docs_ch/tutorial/cmd_usage.rst)
## 服务部署
- ### 2、预测代码示例
PaddleHub Serving 可以部署在线服务。
- ```python
import paddlehub as hub
import soundfile as sf
### 第一步:启动PaddleHub Serving
# Load deepvoice3_ljspeech module.
module = hub.Module(name="deepvoice3_ljspeech")
运行启动命令:
```shell
$ hub serving start -m deepvoice3_ljspeech
```
# Predict sentiment label
test_texts = ['Simple as this proposition is, it is necessary to be stated',
'Parakeet stands for Paddle PARAllel text-to-speech toolkit']
wavs, sample_rate = module.synthesize(texts=test_texts)
for index, wav in enumerate(wavs):
sf.write(f"{index}.wav", wav, sample_rate)
```
这样就完成了一个服务化API的部署,默认端口号为8866。
- ### 3、API
**NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA\_VISIBLE\_DEVICES环境变量,否则不用设置。
- ```python
def synthesize(texts, use_gpu=False, vocoder="griffin-lim"):
```
### 第二步:发送预测请求
- 预测API,由输入文本合成对应音频波形。
配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果
- **参数**
- texts (list\[str\]): 待预测文本;
- use\_gpu (bool): 是否使用 GPU;**若使用GPU,请先设置CUDA\_VISIBLE\_DEVICES环境变量**;
- vocoder: 指定声码器,可选 "griffin-lim"或"waveflow"
```python
import requests
import json
- **返回**
- wavs (list): 语音合成结果列表,列表中每一个元素为对应输入文本的音频波形,可使用`soundfile.write`进一步处理或保存。
- sample\_rate (int): 合成音频的采样率。
import soundfile as sf
# 发送HTTP请求
## 四、服务部署
data = {'texts':['Simple as this proposition is, it is necessary to be stated',
'Parakeet stands for Paddle PARAllel text-to-speech toolkit'],
'use_gpu':False}
headers = {"Content-type": "application/json"}
url = "http://127.0.0.1:8866/predict/deepvoice3_ljspeech"
r = requests.post(url=url, headers=headers, data=json.dumps(data))
- PaddleHub Serving可以部署一个在线语音合成服务,可以将此接口用于在线web应用。
# 保存结果
result = r.json()["results"]
wavs = result["wavs"]
sample_rate = result["sample_rate"]
for index, wav in enumerate(wavs):
sf.write(f"{index}.wav", wav, sample_rate)
```
- ### 第一步:启动PaddleHub Serving
## 查看代码
- 运行启动命令
- ```shell
$ hub serving start -m deepvoice3_ljspeech
```
- 这样就完成了服务化API的部署,默认端口号为8866。
- **NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA\_VISIBLE\_DEVICES环境变量,否则不用设置。
https://github.com/PaddlePaddle/Parakeet
- ### 第二步:发送预测请求
### 依赖
- 配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果
paddlepaddle >= 1.8.2
- ```python
import requests
import json
paddlehub >= 1.7.0
import soundfile as sf
**NOTE:** 除了python依赖外还必须安装libsndfile库
# 发送HTTP请求
对于Ubuntu用户,请执行:
```
sudo apt-get install libsndfile1
```
对于Centos用户,请执行:
```
sudo yum install libsndfile
```
data = {'texts':['Simple as this proposition is, it is necessary to be stated',
'Parakeet stands for Paddle PARAllel text-to-speech toolkit'],
'use_gpu':False}
headers = {"Content-type": "application/json"}
url = "http://127.0.0.1:8866/predict/deepvoice3_ljspeech"
r = requests.post(url=url, headers=headers, data=json.dumps(data))
## 更新历史
# 保存结果
result = r.json()["results"]
wavs = result["wavs"]
sample_rate = result["sample_rate"]
for index, wav in enumerate(wavs):
sf.write(f"{index}.wav", wav, sample_rate)
```
## 五、更新历史
* 1.0.0
初始发布
```shell
$ hub install deepvoice3_ljspeech
```
## 概述
# fastspeech_ljspeech
|模型名称|fastspeech_ljspeech|
| :--- | :---: |
|类别|语音-语音合成|
|网络|FastSpeech|
|数据集|LJSpeech-1.1|
|是否支持Fine-tuning|否|
|模型大小|320MB|
|最新更新日期|2020-10-27|
|数据指标|-|
## 一、模型基本信息
### 模型介绍
FastSpeech是基于Transformer的前馈神经网络,作者从encoder-decoder结构的teacher model中提取attention对角线来做发音持续时间预测,即使用长度调节器对文本序列进行扩展来匹配目标梅尔频谱的长度,以便并行生成梅尔频谱。该模型基本上消除了复杂情况下的跳词和重复的问题,并且可以平滑地调整语音速度,更重要的是,该模型大幅度提升了梅尔频谱的生成速度。fastspeech_ljspeech是基于ljspeech英文语音数据集预训练得到的英文TTS模型,仅支持预测。
<p align="center">
<img src="https://github.com/PaddlePaddle/Parakeet/blob/develop/examples/fastspeech/images/model_architecture.png" hspace='10'/> <br />
<img src="https://raw.githubusercontent.com/PaddlePaddle/Parakeet/release/v0.1/examples/fastspeech/images/model_architecture.png" hspace='10'/> <br/>
</p>
更多详情参考论文[FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263)
## 命令行预测
```shell
$ hub run fastspeech_ljspeech --input_text='Simple as this proposition is, it is necessary to be stated' --use_gpu True --vocoder griffin-lim
```
## 二、安装
## API
- ### 1、系统依赖
```python
def synthesize(texts, use_gpu=False, speed=1.0, vocoder="griffin-lim"):
```
对于Ubuntu用户,请执行:
```
sudo apt-get install libsndfile1
```
对于Centos用户,请执行:
```
sudo yum install libsndfile
```
预测API,由输入文本合成对应音频波形。
- ### 2、环境依赖
**参数**
- 2.0.0 > paddlepaddle >= 1.8.2
* texts (list\[str\]): 待预测文本;
* use\_gpu (bool): 是否使用 GPU;**若使用GPU,请先设置CUDA\_VISIBLE\_DEVICES环境变量**
* speed(float): 音频速度,1.0表示以原速输出。
* vocoder: 指定声码器,可选 "griffin-lim"或"waveflow"
- 2.0.0 > paddlehub >= 1.7.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst)
**返回**
- ### 3、安装
* wavs (list): 语音合成结果列表,列表中每一个元素为对应输入文本的音频波形,可使用`soundfile.write`进一步处理或保存。
* sample\_rate (int): 合成音频的采样率。
- ```shell
$ hub install fastspeech_ljspeech
```
- 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
| [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
**代码示例**
```python
import paddlehub as hub
import soundfile as sf
## 三、模型API预测
# Load fastspeech_ljspeech module.
module = hub.Module(name="fastspeech_ljspeech")
- ### 1、命令行预测
# Predict sentiment label
test_texts = ['Simple as this proposition is, it is necessary to be stated',
'Parakeet stands for Paddle PARAllel text-to-speech toolkit']
wavs, sample_rate = module.synthesize(texts=test_texts)
for index, wav in enumerate(wavs):
sf.write(f"{index}.wav", wav, sample_rate)
```
- ```shell
$ hub run fastspeech_ljspeech --input_text='Simple as this proposition is, it is necessary to be stated' --use_gpu True --vocoder griffin-lim
```
- 通过命令行方式实现语音合成模型的调用,更多请见[PaddleHub命令行指令](https://github.com/shinichiye/PaddleHub/blob/release/v2.1/docs/docs_ch/tutorial/cmd_usage.rst)
## 服务部署
- ### 2、预测代码示例
PaddleHub Serving 可以部署在线服务。
- ```python
import paddlehub as hub
import soundfile as sf
### 第一步:启动PaddleHub Serving
# Load fastspeech_ljspeech module.
module = hub.Module(name="fastspeech_ljspeech")
运行启动命令:
```shell
$ hub serving start -m fastspeech_ljspeech
```
# Predict sentiment label
test_texts = ['Simple as this proposition is, it is necessary to be stated',
'Parakeet stands for Paddle PARAllel text-to-speech toolkit']
wavs, sample_rate = module.synthesize(texts=test_texts)
for index, wav in enumerate(wavs):
sf.write(f"{index}.wav", wav, sample_rate)
```
这样就完成了一个服务化API的部署,默认端口号为8866。
- ### 3、API
**NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA\_VISIBLE\_DEVICES环境变量,否则不用设置。
- ```python
def synthesize(texts, use_gpu=False, speed=1.0, vocoder="griffin-lim"):
```
### 第二步:发送预测请求
- 预测API,由输入文本合成对应音频波形。
配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果
- **参数**
- texts (list\[str\]): 待预测文本;
- use\_gpu (bool): 是否使用 GPU;**若使用GPU,请先设置CUDA\_VISIBLE\_DEVICES环境变量**;
- speed(float): 音频速度,1.0表示以原速输出。
- vocoder: 指定声码器,可选 "griffin-lim"或"waveflow"
```python
import requests
import json
- **返回**
- wavs (list): 语音合成结果列表,列表中每一个元素为对应输入文本的音频波形,可使用`soundfile.write`进一步处理或保存。
- sample\_rate (int): 合成音频的采样率。
import soundfile as sf
# 发送HTTP请求
## 四、服务部署
data = {'texts':['Simple as this proposition is, it is necessary to be stated',
'Parakeet stands for Paddle PARAllel text-to-speech toolkit'],
'use_gpu':False}
headers = {"Content-type": "application/json"}
url = "http://127.0.0.1:8866/predict/fastspeech_ljspeech"
r = requests.post(url=url, headers=headers, data=json.dumps(data))
- PaddleHub Serving可以部署一个在线语音合成服务,可以将此接口用于在线web应用。
# 保存结果
result = r.json()["results"]
wavs = result["wavs"]
sample_rate = result["sample_rate"]
for index, wav in enumerate(wavs):
sf.write(f"{index}.wav", wav, sample_rate)
```
- ### 第一步:启动PaddleHub Serving
## 查看代码
- 运行启动命令
- ```shell
$ hub serving start -m fastspeech_ljspeech
```
- 这样就完成了服务化API的部署,默认端口号为8866。
- **NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA\_VISIBLE\_DEVICES环境变量,否则不用设置。
https://github.com/PaddlePaddle/Parakeet
- ### 第二步:发送预测请求
### 依赖
- 配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果
paddlepaddle >= 1.8.2
- ```python
import requests
import json
paddlehub >= 1.7.0
import soundfile as sf
**NOTE:** 除了python依赖外还必须安装libsndfile库
# 发送HTTP请求
对于Ubuntu用户,请执行:
```
sudo apt-get install libsndfile1
```
对于Centos用户,请执行:
```
sudo yum install libsndfile
```
data = {'texts':['Simple as this proposition is, it is necessary to be stated',
'Parakeet stands for Paddle PARAllel text-to-speech toolkit'],
'use_gpu':False}
headers = {"Content-type": "application/json"}
url = "http://127.0.0.1:8866/predict/fastspeech_ljspeech"
r = requests.post(url=url, headers=headers, data=json.dumps(data))
## 更新历史
# 保存结果
result = r.json()["results"]
wavs = result["wavs"]
sample_rate = result["sample_rate"]
for index, wav in enumerate(wavs):
sf.write(f"{index}.wav", wav, sample_rate)
```
## 五、更新历史
* 1.0.0
初始发布
```shell
$ hub install fastspeech_ljspeech
```
## 概述
# transformer_tts_ljspeech
|模型名称|transformer_tts_ljspeech|
| :--- | :---: |
|类别|语音-语音合成|
|网络|Transformer|
|数据集|LJSpeech-1.1|
|是否支持Fine-tuning|否|
|模型大小|54MB|
|最新更新日期|2020-10-27|
|数据指标|-|
## 一、模型基本信息
### 模型介绍
TansformerTTS 是使用了 Transformer 结构的端到端语音合成模型,对 Transformer 和 Tacotron2 进行了融合,取得了令人满意的效果。因为删除了 RNN 的循环连接,可并行的提供 decoder 的输入,进行并行训练,大大提升了模型的训练速度。transformer_tts_ljspeech是基于ljspeech英文语音数据集预训练得到的英文TTS模型,仅支持预测。
<p align="center">
<img src="https://github.com/PaddlePaddle/Parakeet/blob/develop/examples/transformer_tts/images/model_architecture.jpg" hspace='10'/> <br />
<img src="https://raw.githubusercontent.com/PaddlePaddle/Parakeet/release/v0.1/examples/transformer_tts/images/model_architecture.jpg" hspace='10'/> <br/>
</p>
更多详情参考论文[Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895)
## 命令行预测
```shell
$ hub run transformer_tts_ljspeech --input_text="Life was like a box of chocolates, you never know what you're gonna get." --use_gpu True --vocoder griffin-lim
```
## 二、安装
- ### 1、系统依赖
## API
对于Ubuntu用户,请执行:
```
sudo apt-get install libsndfile1
```
对于Centos用户,请执行:
```
sudo yum install libsndfile
```
```python
def synthesize(texts, use_gpu=False, vocoder="griffin-lim"):
```
- ### 2、环境依赖
预测API,由输入文本合成对应音频波形。
- 2.0.0 > paddlepaddle >= 1.8.2
**参数**
- 2.0.0 > paddlehub >= 1.7.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst)
* texts (list\[str\]): 待预测文本;
* use\_gpu (bool): 是否使用 GPU;**若使用GPU,请先设置CUDA\_VISIBLE\_DEVICES环境变量**
* vocoder: 指定声码器,可选 "griffin-lim"或"waveflow"
- ### 3、安装
**返回**
- ```shell
$ hub install transformer_tts_ljspeech
```
- 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
| [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
* wavs (list): 语音合成结果列表,列表中每一个元素为对应输入文本的音频波形,可使用`soundfile.write`进一步处理或保存。
* sample\_rate (int): 合成音频的采样率。
**代码示例**
## 三、模型API预测
```python
import paddlehub as hub
import soundfile as sf
- ### 1、命令行预测
# Load transformer_tts_ljspeech module.
module = hub.Module(name="transformer_tts_ljspeech")
- ```shell
$ hub run transformer_tts_ljspeech --input_text="Life was like a box of chocolates, you never know what you're gonna get." --use_gpu True --vocoder griffin-lim
```
- 通过命令行方式实现语音合成模型的调用,更多请见[PaddleHub命令行指令](https://github.com/shinichiye/PaddleHub/blob/release/v2.1/docs/docs_ch/tutorial/cmd_usage.rst)
# Predict sentiment label
test_texts = ["Life was like a box of chocolates, you never know what you're gonna get."]
wavs, sample_rate = module.synthesize(texts=test_texts, use_gpu=True, vocoder="waveflow")
for index, wav in enumerate(wavs):
sf.write(f"{index}.wav", wav, sample_rate)
```
- ### 2、预测代码示例
## 服务部署
- ```python
import paddlehub as hub
import soundfile as sf
PaddleHub Serving 可以部署在线服务。
# Load transformer_tts_ljspeech module.
module = hub.Module(name="transformer_tts_ljspeech")
### 第一步:启动PaddleHub Serving
# Predict sentiment label
test_texts = ["Life was like a box of chocolates, you never know what you're gonna get."]
wavs, sample_rate = module.synthesize(texts=test_texts, use_gpu=True, vocoder="waveflow")
for index, wav in enumerate(wavs):
sf.write(f"{index}.wav", wav, sample_rate)
```
运行启动命令:
```shell
$ hub serving start -m transformer_tts_ljspeech
```
- ### 3、API
这样就完成了一个服务化API的部署,默认端口号为8866。
- ```python
def synthesize(texts, use_gpu=False, vocoder="griffin-lim"):
```
**NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA\_VISIBLE\_DEVICES环境变量,否则不用设置
- 预测API,由输入文本合成对应音频波形
### 第二步:发送预测请求
- **参数**
- texts (list\[str\]): 待预测文本;
- use\_gpu (bool): 是否使用 GPU;**若使用GPU,请先设置CUDA\_VISIBLE\_DEVICES环境变量**;
- vocoder: 指定声码器,可选 "griffin-lim"或"waveflow"
配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果
- **返回**
- wavs (list): 语音合成结果列表,列表中每一个元素为对应输入文本的音频波形,可使用`soundfile.write`进一步处理或保存。
- sample\_rate (int): 合成音频的采样率。
```python
import requests
import json
import soundfile as sf
## 四、服务部署
# 发送HTTP请求
- PaddleHub Serving可以部署一个在线语音合成服务,可以将此接口用于在线web应用。
data = {'texts':['Simple as this proposition is, it is necessary to be stated',
'Parakeet stands for Paddle PARAllel text-to-speech toolkit'],
'use_gpu':False}
headers = {"Content-type": "application/json"}
url = "http://127.0.0.1:8866/predict/transformer_tts_ljspeech"
r = requests.post(url=url, headers=headers, data=json.dumps(data))
- ### 第一步:启动PaddleHub Serving
# 保存结果
result = r.json()["results"]
wavs = result["wavs"]
sample_rate = result["sample_rate"]
for index, wav in enumerate(wavs):
sf.write(f"{index}.wav", wav, sample_rate)
```
- 运行启动命令
## 查看代码
- ```shell
$ hub serving start -m transformer_tts_ljspeech
```
- 这样就完成了服务化API的部署,默认端口号为8866。
- **NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA\_VISIBLE\_DEVICES环境变量,否则不用设置。
https://github.com/PaddlePaddle/Parakeet
- ### 第二步:发送预测请求
### 依赖
- 配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果
paddlepaddle >= 1.8.2
- ```python
import requests
import json
paddlehub >= 1.7.0
import soundfile as sf
**NOTE:** 除了python依赖外还必须安装libsndfile库
# 发送HTTP请求
对于Ubuntu用户,请执行:
```
sudo apt-get install libsndfile1
```
对于Centos用户,请执行:
```
sudo yum install libsndfile
```
data = {'texts':['Simple as this proposition is, it is necessary to be stated',
'Parakeet stands for Paddle PARAllel text-to-speech toolkit'],
'use_gpu':False}
headers = {"Content-type": "application/json"}
url = "http://127.0.0.1:8866/predict/transformer_tts_ljspeech"
r = requests.post(url=url, headers=headers, data=json.dumps(data))
## 更新历史
# 保存结果
result = r.json()["results"]
wavs = result["wavs"]
sample_rate = result["sample_rate"]
for index, wav in enumerate(wavs):
sf.write(f"{index}.wav", wav, sample_rate)
```
## 五、更新历史
* 1.0.0
初始发布
```shell
$ hub install transformer_tts_ljspeech
```
# ge2e_fastspeech2_pwgan
|模型名称|ge2e_fastspeech2_pwgan|
| :--- | :---: |
|类别|语音-声音克隆|
|网络|FastSpeech2|
|数据集|AISHELL-3|
|是否支持Fine-tuning|否|
|模型大小|462MB|
|最新更新日期|2021-12-17|
|数据指标|-|
## 一、模型基本信息
### 模型介绍
声音克隆是指使用特定的音色,结合文字的读音合成音频,使得合成后的音频具有目标说话人的特征,从而达到克隆的目的。
在训练语音克隆模型时,目标音色作为Speaker Encoder的输入,模型会提取这段语音的说话人特征(音色)作为Speaker Embedding。接着,在训练模型重新合成此类音色的语音时,除了输入的目标文本外,说话人的特征也将成为额外条件加入模型的训练。
在预测时,选取一段新的目标音色作为Speaker Encoder的输入,并提取其说话人特征,最终实现输入为一段文本和一段目标音色,模型生成目标音色说出此段文本的语音片段。
![](https://ai-studio-static-online.cdn.bcebos.com/982ab955b87244d3bae3b003aff8e28d9ec159ff0d6246a79757339076dfe7d4)
`ge2e_fastspeech2_pwgan`是一个支持中文的语音克隆模型,分别使用了LSTMSpeakerEncoder、FastSpeech2和PWGan模型分别用于语音特征提取、目标音频特征合成和语音波形转换。
关于模型的详请可参考[PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech)
## 二、安装
- ### 1、环境依赖
- paddlepaddle >= 2.2.0
- paddlehub >= 2.1.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst)
- ### 2、安装
- ```shell
$ hub install ge2e_fastspeech2_pwgan
```
- 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
| [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
## 三、模型API预测
- ### 1、预测代码示例
- ```python
import paddlehub as hub
model = hub.Module(name='ge2e_fastspeech2_pwgan', output_dir='./', speaker_audio='/data/man.wav') # 指定目标音色音频文件
texts = [
'语音的表现形式在未来将变得越来越重要$',
'今天的天气怎么样$', ]
wavs = model.generate(texts, use_gpu=True)
for text, wav in zip(texts, wavs):
print('='*30)
print(f'Text: {text}')
print(f'Wav: {wav}')
```
- ### 2、API
- ```python
def __init__(speaker_audio: str = None,
output_dir: str = './')
```
- 初始化module,可配置模型的目标音色的音频文件和输出的路径。
- **参数**
- `speaker_audio`(str): 目标说话人语音音频文件(*.wav)的路径,默认为None(使用默认的女声作为目标音色)。
- `output_dir`(str): 合成音频的输出文件,默认为当前目录。
- ```python
def get_speaker_embedding()
```
- 获取模型的目标说话人特征。
- **返回**
- `results`(numpy.ndarray): 长度为256的numpy数组,代表目标说话人的特征。
- ```python
def set_speaker_embedding(speaker_audio: str)
```
- 设置模型的目标说话人特征。
- **参数**
- `speaker_audio`(str): 必填,目标说话人语音音频文件(*.wav)的路径。
- ```python
def generate(data: Union[str, List[str]], use_gpu: bool = False):
```
- 根据输入文字,合成目标说话人的语音音频文件。
- **参数**
- `data`(Union[str, List[str]]): 必填,目标音频的内容文本列表,目前只支持中文,不支持添加标点符号。
- `use_gpu`(bool): 是否使用gpu执行计算,默认为False。
## 四、更新历史
* 1.0.0
初始发布。
```shell
$ hub install ge2e_fastspeech2_pwgan
```
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from typing import List, Union
import numpy as np
import paddle
import soundfile as sf
import yaml
from yacs.config import CfgNode
from paddlehub.env import MODULE_HOME
from paddlehub.module.module import moduleinfo, serving
from paddlehub.utils.log import logger
from paddlespeech.t2s.frontend.zh_frontend import Frontend
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
from paddlespeech.t2s.models.parallel_wavegan import PWGInference
from paddlespeech.t2s.modules.normalizer import ZScore
from paddlespeech.vector.exps.ge2e.audio_processor import SpeakerVerificationPreprocessor
from paddlespeech.vector.models.lstm_speaker_encoder import LSTMSpeakerEncoder
@moduleinfo(
name="ge2e_fastspeech2_pwgan",
version="1.0.0",
summary="",
author="paddlepaddle",
author_email="",
type="audio/voice_cloning",
)
class VoiceCloner(paddle.nn.Layer):
def __init__(self, speaker_audio: str = None, output_dir: str = './'):
super(VoiceCloner, self).__init__()
speaker_encoder_ckpt = os.path.join(MODULE_HOME, 'ge2e_fastspeech2_pwgan', 'assets',
'ge2e_ckpt_0.3/step-3000000.pdparams')
synthesizer_res_dir = os.path.join(MODULE_HOME, 'ge2e_fastspeech2_pwgan', 'assets',
'fastspeech2_nosil_aishell3_vc1_ckpt_0.5')
vocoder_res_dir = os.path.join(MODULE_HOME, 'ge2e_fastspeech2_pwgan', 'assets', 'pwg_aishell3_ckpt_0.5')
# Speaker encoder
self.speaker_processor = SpeakerVerificationPreprocessor(
sampling_rate=16000,
audio_norm_target_dBFS=-30,
vad_window_length=30,
vad_moving_average_width=8,
vad_max_silence_length=6,
mel_window_length=25,
mel_window_step=10,
n_mels=40,
partial_n_frames=160,
min_pad_coverage=0.75,
partial_overlap_ratio=0.5)
self.speaker_encoder = LSTMSpeakerEncoder(n_mels=40, num_layers=3, hidden_size=256, output_size=256)
self.speaker_encoder.set_state_dict(paddle.load(speaker_encoder_ckpt))
self.speaker_encoder.eval()
# Voice synthesizer
with open(os.path.join(synthesizer_res_dir, 'default.yaml'), 'r') as f:
fastspeech2_config = CfgNode(yaml.safe_load(f))
with open(os.path.join(synthesizer_res_dir, 'phone_id_map.txt'), 'r') as f:
phn_id = [line.strip().split() for line in f.readlines()]
model = FastSpeech2(idim=len(phn_id), odim=fastspeech2_config.n_mels, **fastspeech2_config["model"])
model.set_state_dict(paddle.load(os.path.join(synthesizer_res_dir, 'snapshot_iter_96400.pdz'))["main_params"])
model.eval()
stat = np.load(os.path.join(synthesizer_res_dir, 'speech_stats.npy'))
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
fastspeech2_normalizer = ZScore(mu, std)
self.sample_rate = fastspeech2_config.fs
self.fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
self.fastspeech2_inference.eval()
# Vocoder
with open(os.path.join(vocoder_res_dir, 'default.yaml')) as f:
pwg_config = CfgNode(yaml.safe_load(f))
vocoder = PWGGenerator(**pwg_config["generator_params"])
vocoder.set_state_dict(
paddle.load(os.path.join(vocoder_res_dir, 'snapshot_iter_1000000.pdz'))["generator_params"])
vocoder.remove_weight_norm()
vocoder.eval()
stat = np.load(os.path.join(vocoder_res_dir, 'feats_stats.npy'))
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
pwg_normalizer = ZScore(mu, std)
self.pwg_inference = PWGInference(pwg_normalizer, vocoder)
self.pwg_inference.eval()
# Text frontend
self.frontend = Frontend(phone_vocab_path=os.path.join(synthesizer_res_dir, 'phone_id_map.txt'))
# Speaking embedding
self._speaker_embedding = None
if speaker_audio is None or not os.path.isfile(speaker_audio):
speaker_audio = os.path.join(MODULE_HOME, 'ge2e_fastspeech2_pwgan', 'assets', 'voice_cloning.wav')
logger.warning(f'Due to no speaker audio is specified, speaker encoder will use defult '
f'waveform({speaker_audio}) to extract speaker embedding. You can use '
'"set_speaker_embedding()" method to reset a speaker audio for voice cloning.')
self.set_speaker_embedding(speaker_audio)
self.output_dir = os.path.abspath(output_dir)
if not os.path.exists(self.output_dir):
os.makedirs(self.output_dir)
def get_speaker_embedding(self):
return self._speaker_embedding.numpy()
@paddle.no_grad()
def set_speaker_embedding(self, speaker_audio: str):
assert os.path.exists(speaker_audio), f'Speaker audio file: {speaker_audio} does not exists.'
mel_sequences = self.speaker_processor.extract_mel_partials(
self.speaker_processor.preprocess_wav(speaker_audio))
self._speaker_embedding = self.speaker_encoder.embed_utterance(paddle.to_tensor(mel_sequences))
logger.info(f'Speaker embedding has been set from file: {speaker_audio}')
@paddle.no_grad()
def generate(self, data: Union[str, List[str]], use_gpu: bool = False):
assert self._speaker_embedding is not None, f'Set speaker embedding before voice cloning.'
if isinstance(data, str):
data = [data]
elif isinstance(data, list):
assert len(data) > 0 and isinstance(data[0],
str) and len(data[0]) > 0, f'Input data should be str of List[str].'
else:
raise Exception(f'Input data should be str of List[str].')
paddle.set_device('gpu') if use_gpu else paddle.set_device('cpu')
files = []
for idx, text in enumerate(data):
phone_ids = self.frontend.get_input_ids(text, merge_sentences=True)["phone_ids"][0]
wav = self.pwg_inference(self.fastspeech2_inference(phone_ids, spk_emb=self._speaker_embedding))
output_wav = os.path.join(self.output_dir, f'{idx+1}.wav')
sf.write(output_wav, wav.numpy(), samplerate=self.sample_rate)
files.append(output_wav)
return files
```shell
$ hub install lstm_tacotron2==1.0.0
```
# lstm_tacotron2
|模型名称|lstm_tacotron2|
| :--- | :---: |
|类别|语音-语音合成|
|网络|LSTM、Tacotron2、WaveFlow|
|数据集|AISHELL-3|
|是否支持Fine-tuning|否|
|模型大小|327MB|
|最新更新日期|2021-06-15|
|数据指标|-|
## 一、模型基本信息
## 概述
### 模型介绍
声音克隆是指使用特定的音色,结合文字的读音合成音频,使得合成后的音频具有目标说话人的特征,从而达到克隆的目的。
......@@ -10,93 +20,107 @@ $ hub install lstm_tacotron2==1.0.0
在预测时,选取一段新的目标音色作为Speaker Encoder的输入,并提取其说话人特征,最终实现输入为一段文本和一段目标音色,模型生成目标音色说出此段文本的语音片段。
![](https://ai-studio-static-online.cdn.bcebos.com/982ab955b87244d3bae3b003aff8e28d9ec159ff0d6246a79757339076dfe7d4)
<p align="center">
<img src="https://ai-studio-static-online.cdn.bcebos.com/982ab955b87244d3bae3b003aff8e28d9ec159ff0d6246a79757339076dfe7d4" hspace='10'/> <br/>
</p>
`lstm_tacotron2`是一个支持中文的语音克隆模型,分别使用了LSTMSpeakerEncoder、Tacotron2和WaveFlow模型分别用于语音特征提取、目标音频特征合成和语音波形转换。
关于模型的详请可参考[Parakeet](https://github.com/PaddlePaddle/Parakeet/tree/release/v0.3/parakeet/models)
更多详情请参考:
- [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf)
- [Parakeet](https://github.com/PaddlePaddle/Parakeet/tree/release/v0.3/parakeet/models)
## API
## 二、安装
```python
def __init__(speaker_audio: str = None,
output_dir: str = './')
```
初始化module,可配置模型的目标音色的音频文件和输出的路径。
- ### 1、环境依赖
**参数**
- `speaker_audio`(str): 目标说话人语音音频文件(*.wav)的路径,默认为None(使用默认的女声作为目标音色)。
- `output_dir`(str): 合成音频的输出文件,默认为当前目录。
- paddlepaddle >= 2.0.0
- paddlehub >= 2.1.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst)
```python
def get_speaker_embedding()
```
获取模型的目标说话人特征。
- ### 2、安装
**返回**
* `results`(numpy.ndarray): 长度为256的numpy数组,代表目标说话人的特征。
- ```shell
$ hub install lstm_tacotron2
```
- 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
| [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
```python
def set_speaker_embedding(speaker_audio: str)
```
设置模型的目标说话人特征。
**参数**
- `speaker_audio`(str): 必填,目标说话人语音音频文件(*.wav)的路径。
## 三、模型API预测
```python
def generate(data: List[str], batch_size: int = 1, use_gpu: bool = False):
```
根据输入文字,合成目标说话人的语音音频文件。
- ### 1、预测代码示例
**参数**
- `data`(List[str]): 必填,目标音频的内容文本列表,目前只支持中文,不支持添加标点符号。
- `batch_size`(int): 可选,模型合成语音时的batch_size,默认为1。
- `use_gpu`(bool): 是否使用gpu执行计算,默认为False。
- ```python
import paddlehub as hub
model = hub.Module(name='lstm_tacotron2', output_dir='/data', speaker_audio='/data/man.wav') # 指定目标音色音频文件
texts = [
'语音的表现形式在未来将变得越来越重要$',
'今天的天气怎么样$', ]
wavs = model.generate(texts, use_gpu=True)
**代码示例**
for text, wav in zip(texts, wavs):
print('='*30)
print(f'Text: {text}')
print(f'Wav: {wav}')
```
```
==============================
Text: 语音的表现形式在未来将变得越来越重要$
Wav: /data/1.wav
==============================
Text: 今天的天气怎么样$
Wav: /data/2.wav
```
```python
import paddlehub as hub
- ### 2、API
model = hub.Module(name='lstm_tacotron2', output_dir='./', speaker_audio='/data/man.wav') # 指定目标音色音频文件
texts = [
'语音的表现形式在未来将变得越来越重要$',
'今天的天气怎么样$', ]
wavs = model.generate(texts, use_gpu=True)
- ```python
def __init__(speaker_audio: str = None,
output_dir: str = './')
```
- 初始化module,可配置模型的目标音色的音频文件和输出的路径。
for text, wav in zip(texts, wavs):
print('='*30)
print(f'Text: {text}')
print(f'Wav: {wav}')
```
- **参数**
- `speaker_audio`(str): 目标说话人语音音频文件(*.wav)的路径,默认为None(使用默认的女声作为目标音色)。
- `output_dir`(str): 合成音频的输出文件,默认为当前目录。
输出
```
==============================
Text: 语音的表现形式在未来将变得越来越重要$
Wav: /data/1.wav
==============================
Text: 今天的天气怎么样$
Wav: /data/2.wav
```
- ```python
def get_speaker_embedding()
```
- 获取模型的目标说话人特征。
- **返回**
- `results`(numpy.ndarray): 长度为256的numpy数组,代表目标说话人的特征。
## 查看代码
- ```python
def set_speaker_embedding(speaker_audio: str)
```
- 设置模型的目标说话人特征。
https://github.com/PaddlePaddle/Parakeet
- **参数**
- `speaker_audio`(str): 必填,目标说话人语音音频文件(*.wav)的路径。
## 依赖
- ```python
def generate(data: List[str], batch_size: int = 1, use_gpu: bool = False):
```
- 根据输入文字,合成目标说话人的语音音频文件。
paddlepaddle >= 2.0.0
- **参数**
- `data`(List[str]): 必填,目标音频的内容文本列表,目前只支持中文,不支持添加标点符号。
- `batch_size`(int): 可选,模型合成语音时的batch_size,默认为1。
- `use_gpu`(bool): 是否使用gpu执行计算,默认为False。
paddlehub >= 2.1.0
## 更新历史
## 四、更新历史
* 1.0.0
初始发布
```shell
$ hub install lstm_tacotron2==1.0.0
```
......@@ -53,14 +53,14 @@
## 三、模型API预测
- ### 1、代码示例
- ### 1、预测代码示例
```python
import paddlehub as hub
- ```python
import paddlehub as hub
model = hub.Module(name='deoldify')
model.predict('/PATH/TO/IMAGE/OR/VIDEO')
```
model = hub.Module(name='deoldify')
model.predict('/PATH/TO/IMAGE/OR/VIDEO')
```
- ### 2、API
......
# deoldify
| Module Name |deoldify|
| :--- | :---: |
|Category|Image editing|
|Network |NoGAN|
|Dataset|ILSVRC 2012|
|Fine-tuning supported or not |No|
|Module Size |834MB|
|Data indicators|-|
|Latest update date |2021-04-13|
## I. Basic Information
- ### Application Effect Display
- Sample results:
<p align="center">
<img src="https://user-images.githubusercontent.com/35907364/130886749-668dfa38-42ed-4a09-8d4a-b18af0475375.jpg" width = "450" height = "300" hspace='10'/> <img src="https://user-images.githubusercontent.com/35907364/130886685-76221736-839a-46a2-8415-e5e0dd3b345e.png" width = "450" height = "300" hspace='10'/>
</p>
- ### Module Introduction
- Deoldify is a color rendering model for images and videos, which can restore color for black and white photos and videos.
- For more information, please refer to: [deoldify](https://github.com/jantic/DeOldify)
## II. Installation
- ### 1、Environmental Dependence
- paddlepaddle >= 2.0.0
- paddlehub >= 2.0.0
- NOTE: This Module relies on ffmpeg, Please install ffmpeg before using this Module.
```shell
$ conda install x264=='1!152.20180717' ffmpeg=4.0.2 -c conda-forge
```
- ### 2、Installation
- ```shell
$ hub install deoldify
```
- In case of any problems during installation, please refer to:[Windows_Quickstart](../../../../docs/docs_en/get_start/windows_quickstart.md)
| [Linux_Quickstart](../../../../docs/docs_en/get_start/linux_quickstart.md) | [Mac_Quickstart](../../../../docs/docs_en/get_start/mac_quickstart.md)
## III. Module API Prediction
- ### 1、Prediction Code Example
- ```python
import paddlehub as hub
model = hub.Module(name='deoldify')
model.predict('/PATH/TO/IMAGE/OR/VIDEO')
```
- ### 2、API
- ```python
def predict(self, input):
```
- Prediction API.
- **Parameter**
- input (str): Image path.
- **Return**
- If input is image path, the output is:
- pred_img(np.ndarray): image data, ndarray.shape is in the format [H, W, C], BGR.
- out_path(str): save path of images.
- If input is video path, the output is :
- frame_pattern_combined(str): save path of frames from output video.
- vid_out_path(str): save path of output video.
- ```python
def run_image(self, img):
```
- Prediction API for image.
- **Parameter**
- img (str|np.ndarray): Image data, str or ndarray. ndarray.shape is in the format [H, W, C], BGR.
- **Return**
- pred_img(np.ndarray): Ndarray.shape is in the format [H, W, C], BGR.
- ```python
def run_video(self, video):
```
- Prediction API for video.
- **Parameter**
- video(str): Video path.
- **Return**
- frame_pattern_combined(str): Save path of frames from output video.
- vid_out_path(str): Save path of output video.
## IV. Server Deployment
- PaddleHub Serving can deploy an online service of coloring old photos or videos.
- ### Step 1: Start PaddleHub Serving
- Run the startup command:
- ```shell
$ hub serving start -m deoldify
```
- The servitization API is now deployed and the default port number is 8866.
- **NOTE:** If GPU is used for prediction, set CUDA_VISIBLE_DEVICES environment variable before the service, otherwise it need not be set.
- ### Step 2: Send a predictive request
- With a configured server, use the following lines of code to send the prediction request and obtain the result.
- ```python
import requests
import json
import base64
import cv2
import numpy as np
def cv2_to_base64(image):
data = cv2.imencode('.jpg', image)[1]
return base64.b64encode(data.tostring()).decode('utf8')
def base64_to_cv2(b64str):
data = base64.b64decode(b64str.encode('utf8'))
data = np.fromstring(data, np.uint8)
data = cv2.imdecode(data, cv2.IMREAD_COLOR)
return data
# Send an HTTP request
org_im = cv2.imread('/PATH/TO/ORIGIN/IMAGE')
data = {'images':cv2_to_base64(org_im)}
headers = {"Content-type": "application/json"}
url = "http://127.0.0.1:8866/predict/deoldify"
r = requests.post(url=url, headers=headers, data=json.dumps(data))
img = base64_to_cv2(r.json()["results"])
cv2.imwrite('/PATH/TO/SAVE/IMAGE', img)
```
## V. Release Note
- 1.0.0
First release
- 1.0.1
Adapt to paddlehub2.0
......@@ -51,17 +51,17 @@
## 三、模型API预测
- ### 1、代码示例
- ### 1、预测代码示例
```python
import cv2
import paddlehub as hub
- ```python
import cv2
import paddlehub as hub
model = hub.Module(name='photo_restoration', visualization=True)
im = cv2.imread('/PATH/TO/IMAGE')
res = model.run_image(im)
model = hub.Module(name='photo_restoration', visualization=True)
im = cv2.imread('/PATH/TO/IMAGE')
res = model.run_image(im)
```
```
- ### 2、API
......
# photo_restoration
|Module Name|photo_restoration|
| :--- | :---: |
|Category|Image editing|
|Network|deoldify and realsr|
|Fine-tuning supported or not|No|
|Module Size |64MB+834MB|
|Data indicators|-|
|Latest update date|2021-08-19|
## I. Basic Information
- ### Application Effect Display
- Sample results:
<p align="center">
<img src="https://user-images.githubusercontent.com/35907364/130897828-d0c86b81-63d1-4e9a-8095-bc000b8c7ca8.jpg" width = "260" height = "400" hspace='10'/> <img src="https://user-images.githubusercontent.com/35907364/130897762-5c9fa711-62bc-4067-8d44-f8feff8c574c.png" width = "260" height = "400" hspace='10'/>
</p>
- ### Module Introduction
- Photo_restoration can restore old photos. It mainly consists of two parts: coloring and super-resolution. The coloring model is deoldify
, and super resolution model is realsr. Therefore, when using this model, please install deoldify and realsr in advance.
## II. Installation
- ### 1、Environmental Dependence
- paddlepaddle >= 2.0.0
- paddlehub >= 2.0.0
- NOTE: This Module relies on ffmpeg, Please install ffmpeg before using this Module.
```shell
$ conda install x264=='1!152.20180717' ffmpeg=4.0.2 -c conda-forge
```
- ### 2、Installation
- ```shell
$ hub install photo_restoration
```
- In case of any problems during installation, please refer to:[Windows_Quickstart](../../../../docs/docs_en/get_start/windows_quickstart.md)
| [Linux_Quickstart](../../../../docs/docs_en/get_start/linux_quickstart.md) | [Mac_Quickstart](../../../../docs/docs_en/get_start/mac_quickstart.md)
## III. Module API Prediction
- ### 1、Prediction Code Example
- ```python
import cv2
import paddlehub as hub
model = hub.Module(name='photo_restoration', visualization=True)
im = cv2.imread('/PATH/TO/IMAGE')
res = model.run_image(im)
```
- ### 2、API
- ```python
def run_image(self,
input,
model_select= ['Colorization', 'SuperResolution'],
save_path = 'photo_restoration'):
```
- Predicition API, produce repaired photos.
- **Parameter**
- input (numpy.ndarray|str): Image data,numpy.ndarray or str. ndarray.shape is in the format [H, W, C], BGR.
- model_select (list\[str\]): Mode selection,\['Colorization'\] only colorize the input image, \['SuperResolution'\] only increase the image resolution;
default is \['Colorization', 'SuperResolution'\]。
- save_path (str): Save path, default is 'photo_restoration'.
- **Return**
- output (numpy.ndarray): Restoration result,ndarray.shape is in the format [H, W, C], BGR.
## IV. Server Deployment
- PaddleHub Serving can deploy an online service of photo restoration.
- ### Step 1: Start PaddleHub Serving
- Run the startup command:
- ```shell
$ hub serving start -m photo_restoration
```
- The servitization API is now deployed and the default port number is 8866.
- **NOTE:** If GPU is used for prediction, set CUDA_VISIBLE_DEVICES environment variable before the service, otherwise it need not be set.
- ### Step 2: Send a predictive request
- With a configured server, use the following lines of code to send the prediction request and obtain the result
- ```python
import requests
import json
import base64
import cv2
import numpy as np
def cv2_to_base64(image):
data = cv2.imencode('.jpg', image)[1]
return base64.b64encode(data.tostring()).decode('utf8')
def base64_to_cv2(b64str):
data = base64.b64decode(b64str.encode('utf8'))
data = np.fromstring(data, np.uint8)
data = cv2.imdecode(data, cv2.IMREAD_COLOR)
return data
# Send an HTTP request
org_im = cv2.imread('PATH/TO/IMAGE')
data = {'images':cv2_to_base64(org_im), 'model_select': ['Colorization', 'SuperResolution']}
headers = {"Content-type": "application/json"}
url = "http://127.0.0.1:8866/predict/photo_restoration"
r = requests.post(url=url, headers=headers, data=json.dumps(data))
img = base64_to_cv2(r.json()["results"])
cv2.imwrite('PATH/TO/SAVE/IMAGE', img)
```
## V. Release Note
- 1.0.0
First release
- 1.0.1
Adapt to paddlehub2.0
......@@ -22,7 +22,7 @@
<img src="https://user-images.githubusercontent.com/35907364/136653401-6644bd46-d280-4c15-8d48-680b7eb152cb.png" width = "300" height = "450" hspace='10'/> <img src="https://user-images.githubusercontent.com/35907364/136648959-40493c9c-08ec-46cd-a2a2-5e2038dcbfa7.png" width = "300" height = "450" hspace='10'/>
</p>
- user_guided_colorization 是基于''Real-Time User-Guided Image Colorization with Learned Deep Priors"的着色模型,该模型利用预先提供的着色块对图像进行着色。
- user_guided_colorization 是基于"Real-Time User-Guided Image Colorization with Learned Deep Priors"的着色模型,该模型利用预先提供的着色块对图像进行着色。
## 二、安装
......
......@@ -51,7 +51,7 @@
- ```
$ hub run falsr_c --input_path "/PATH/TO/IMAGE"
```
- ### 代码示例
- ### 2、预测代码示例
```python
import cv2
......@@ -65,7 +65,7 @@
sr_model.save_inference_model()
```
- ### 2、API
- ### 3、API
- ```python
def reconstruct(self,
......
......@@ -57,7 +57,7 @@
## 三、模型API预测
- ### 1、代码示例
- ### 1、预测代码示例
```python
import paddlehub as hub
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import cv2
import numpy as np
import paddle
from PIL import Image
from PIL import ImageOps
from ppgan.models.generators import SPADEGenerator
from ppgan.utils.filesystem import load
from ppgan.utils.photopen import data_onehot_pro
class PhotoPenPredictor:
def __init__(self, weight_path, gen_cfg):
# 初始化模型
gen = SPADEGenerator(
gen_cfg.ngf,
gen_cfg.num_upsampling_layers,
gen_cfg.crop_size,
gen_cfg.aspect_ratio,
gen_cfg.norm_G,
gen_cfg.semantic_nc,
gen_cfg.use_vae,
gen_cfg.nef,
)
gen.eval()
para = load(weight_path)
if 'net_gen' in para:
gen.set_state_dict(para['net_gen'])
else:
gen.set_state_dict(para)
self.gen = gen
self.gen_cfg = gen_cfg
def run(self, image):
sem = Image.fromarray(image).convert('L')
sem = sem.resize((self.gen_cfg.crop_size, self.gen_cfg.crop_size), Image.NEAREST)
sem = np.array(sem).astype('float32')
sem = paddle.to_tensor(sem)
sem = sem.reshape([1, 1, self.gen_cfg.crop_size, self.gen_cfg.crop_size])
one_hot = data_onehot_pro(sem, self.gen_cfg)
predicted = self.gen(one_hot)
pic = predicted.numpy()[0].reshape((3, 256, 256)).transpose((1, 2, 0))
pic = ((pic + 1.) / 2. * 255).astype('uint8')
return pic
此差异已折叠。
此差异已折叠。
import base64
import cv2
import numpy as np
def base64_to_cv2(b64str):
data = base64.b64decode(b64str.encode('utf8'))
data = np.fromstring(data, np.uint8)
data = cv2.imdecode(data, cv2.IMREAD_COLOR)
return data
此差异已折叠。
此差异已折叠。
此差异已折叠。
import base64
import cv2
import numpy as np
def base64_to_cv2(b64str):
data = base64.b64decode(b64str.encode('utf8'))
data = np.fromstring(data, np.uint8)
data = cv2.imdecode(data, cv2.IMREAD_COLOR)
return data
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册