diff --git a/demos/speech_server/README.md b/demos/speech_server/README.md index 39007f6caacf8fa7924a2f0d74bfc734277f6a61..ac5cc4b00ac81d7c1b05ad94a4e3ceff428ff8ce 100644 --- a/demos/speech_server/README.md +++ b/demos/speech_server/README.md @@ -15,6 +15,17 @@ You can choose one way from easy, meduim and hard to install paddlespeech. ### 2. Prepare config File The configuration file contains the service-related configuration files and the model configuration related to the voice tasks contained in the service. They are all under the `conf` folder. +**Note: The configuration of `engine_backend` in `application.yaml` represents all speech tasks included in the started service. ** +If the service you want to start contains only a certain speech task, then you need to comment out the speech tasks that do not need to be included. For example, if you only want to use the speech recognition (ASR) service, then you can comment out the speech synthesis (TTS) service, as in the following example: +```bash +engine_backend: + asr: 'conf/asr/asr.yaml' + #tts: 'conf/tts/tts.yaml' +``` + +**Note: The configuration file of `engine_backend` in `application.yaml` needs to match the configuration type of `engine_type`. ** +When the configuration file of `engine_backend` is `XXX.yaml`, the configuration type of `engine_type` needs to be set to `python`; when the configuration file of `engine_backend` is `XXX_pd.yaml`, the configuration of `engine_type` needs to be set type is `inference`; + The input of ASR client demo should be a WAV file(`.wav`), and the sample rate must be the same as the model. Here are sample files for thisASR client demo that can be downloaded: @@ -76,6 +87,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ### 4. ASR Client Usage +**Note:** The response time will be slightly longer when using the client for the first time - Command Line (Recommended) ``` paddlespeech_client asr --server_ip 127.0.0.1 --port 8090 --input ./zh.wav @@ -122,6 +134,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` ### 5. TTS Client Usage +**Note:** The response time will be slightly longer when using the client for the first time - Command Line (Recommended) ```bash paddlespeech_client tts --server_ip 127.0.0.1 --port 8090 --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav @@ -147,8 +160,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee [2022-02-23 15:20:37,875] [ INFO] - Save synthesized audio successfully on output.wav. [2022-02-23 15:20:37,875] [ INFO] - Audio duration: 3.612500 s. [2022-02-23 15:20:37,875] [ INFO] - Response time: 0.348050 s. - [2022-02-23 15:20:37,875] [ INFO] - RTF: 0.096346 - ``` @@ -174,51 +185,13 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee Save synthesized audio successfully on ./output.wav. Audio duration: 3.612500 s. Response time: 0.388317 s. - RTF: 0.107493 ``` -## Pretrained Models +## Models supported by the service ### ASR model -Here is a list of [ASR pretrained models](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/speech_recognition/README.md#4pretrained-models) released by PaddleSpeech, both command line and python interfaces are available: - -| Model | Language | Sample Rate -| :--- | :---: | :---: | -| conformer_wenetspeech| zh| 16000 -| transformer_librispeech| en| 16000 +Get all models supported by the ASR service via `paddlespeech_server stats --task asr`, where static models can be used for paddle inference inference. ### TTS model -Here is a list of [TTS pretrained models](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/text_to_speech/README.md#4-pretrained-models) released by PaddleSpeech, both command line and python interfaces are available: - -- Acoustic model - | Model | Language - | :--- | :---: | - | speedyspeech_csmsc| zh - | fastspeech2_csmsc| zh - | fastspeech2_aishell3| zh - | fastspeech2_ljspeech| en - | fastspeech2_vctk| en - -- Vocoder - | Model | Language - | :--- | :---: | - | pwgan_csmsc| zh - | pwgan_aishell3| zh - | pwgan_ljspeech| en - | pwgan_vctk| en - | mb_melgan_csmsc| zh - -Here is a list of **TTS pretrained static models** released by PaddleSpeech, both command line and python interfaces are available: -- Acoustic model - | Model | Language - | :--- | :---: | - | speedyspeech_csmsc| zh - | fastspeech2_csmsc| zh - -- Vocoder - | Model | Language - | :--- | :---: | - | pwgan_csmsc| zh - | mb_melgan_csmsc| zh - | hifigan_csmsc| zh +Get all models supported by the TTS service via `paddlespeech_server stats --task tts`, where static models can be used for paddle inference inference. diff --git a/demos/speech_server/README_cn.md b/demos/speech_server/README_cn.md index f56660705800f9d2061b222ec6cd412c7319b759..f202a30cd3ee3891231f81cd789bc89712baf2ec 100644 --- a/demos/speech_server/README_cn.md +++ b/demos/speech_server/README_cn.md @@ -14,6 +14,15 @@ ### 2. 准备配置文件 配置文件包含服务相关的配置文件和服务中包含的语音任务相关的模型配置。 它们都在 `conf` 文件夹下。 +**注意:`application.yaml` 中 `engine_backend` 的配置表示启动的服务中包含的所有语音任务。** +如果你想启动的服务中只包含某项语音任务,那么你需要注释掉不需要包含的语音任务。例如你只想使用语音识别(ASR)服务,那么你可以将语音合成(TTS)服务注释掉,如下示例: +```bash +engine_backend: + asr: 'conf/asr/asr.yaml' + #tts: 'conf/tts/tts.yaml' +``` +**注意:`application.yaml` 中 `engine_backend` 的配置文件需要和 `engine_type` 的配置类型匹配。** +当`engine_backend` 的配置文件为`XXX.yaml`时,需要设置`engine_type`的配置类型为`python`;当`engine_backend` 的配置文件为`XXX_pd.yaml`时,需要设置`engine_type`的配置类型为`inference`; 这个 ASR client 的输入应该是一个 WAV 文件(`.wav`),并且采样率必须与模型的采样率相同。 @@ -75,6 +84,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` ### 4. ASR客户端使用方法 +**注意:**初次使用客户端时响应时间会略长 - 命令行 (推荐使用) ``` paddlespeech_client asr --server_ip 127.0.0.1 --port 8090 --input ./zh.wav @@ -123,6 +133,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` ### 5. TTS客户端使用方法 +**注意:**初次使用客户端时响应时间会略长 ```bash paddlespeech_client tts --server_ip 127.0.0.1 --port 8090 --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav ``` @@ -148,7 +159,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee [2022-02-23 15:20:37,875] [ INFO] - Save synthesized audio successfully on output.wav. [2022-02-23 15:20:37,875] [ INFO] - Audio duration: 3.612500 s. [2022-02-23 15:20:37,875] [ INFO] - Response time: 0.348050 s. - [2022-02-23 15:20:37,875] [ INFO] - RTF: 0.096346 ``` - Python API @@ -173,50 +183,12 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee Save synthesized audio successfully on ./output.wav. Audio duration: 3.612500 s. Response time: 0.388317 s. - RTF: 0.107493 ``` -## Pretrained Models -### ASR model -下面是PaddleSpeech发布的[ASR预训练模型](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/speech_recognition/README.md#4pretrained-models)列表,命令行和python接口均可用: - -| Model | Language | Sample Rate -| :--- | :---: | :---: | -| conformer_wenetspeech| zh| 16000 -| transformer_librispeech| en| 16000 - -### TTS model -下面是PaddleSpeech发布的 [TTS预训练模型](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/text_to_speech/README.md#4-pretrained-models) 列表,命令行和python接口均可用: - -- Acoustic model - | Model | Language - | :--- | :---: | - | speedyspeech_csmsc| zh - | fastspeech2_csmsc| zh - | fastspeech2_aishell3| zh - | fastspeech2_ljspeech| en - | fastspeech2_vctk| en - -- Vocoder - | Model | Language - | :--- | :---: | - | pwgan_csmsc| zh - | pwgan_aishell3| zh - | pwgan_ljspeech| en - | pwgan_vctk| en - | mb_melgan_csmsc| zh - -下面是PaddleSpeech发布的 **TTS预训练静态模型** 列表,命令行和python接口均可用: -- Acoustic model - | Model | Language - | :--- | :---: | - | speedyspeech_csmsc| zh - | fastspeech2_csmsc| zh - -- Vocoder - | Model | Language - | :--- | :---: | - | pwgan_csmsc| zh - | mb_melgan_csmsc| zh - | hifigan_csmsc| zh +## 服务支持的模型 +### ASR支持的模型 +通过 `paddlespeech_server stats --task asr` 获取ASR服务支持的所有模型,其中静态模型可用于 paddle inference 推理。 + +### TTS支持的模型 +通过 `paddlespeech_server stats --task tts` 获取TTS服务支持的所有模型,其中静态模型可用于 paddle inference 推理。 diff --git a/demos/speech_server/conf/application.yaml b/demos/speech_server/conf/application.yaml index fd4f5f37567486b05b026568dc09a2973491b12e..6dcae74a944fdf129a35b991b23be5c724d5df16 100644 --- a/demos/speech_server/conf/application.yaml +++ b/demos/speech_server/conf/application.yaml @@ -3,23 +3,25 @@ ################################################################## # SERVER SETTING # ################################################################## -host: '0.0.0.0' +host: '127.0.0.1' port: 8090 ################################################################## # CONFIG FILE # ################################################################## +# add engine backend type (Options: asr, tts) and config file here. +# Adding a speech task to engine_backend means starting the service. +engine_backend: + asr: 'conf/asr/asr.yaml' + tts: 'conf/tts/tts.yaml' + # The engine_type of speech task needs to keep the same type as the config file of speech task. # E.g: The engine_type of asr is 'python', the engine_backend of asr is 'XX/asr.yaml' # E.g: The engine_type of asr is 'inference', the engine_backend of asr is 'XX/asr_pd.yaml' # # add engine type (Options: python, inference) engine_type: - asr: 'inference' - tts: 'inference' + asr: 'python' + tts: 'python' + -# add engine backend type (Options: asr, tts) and config file here. -# Adding a speech task to engine_backend means starting the service. -engine_backend: - asr: 'conf/asr/asr_pd.yaml' - tts: 'conf/tts/tts_pd.yaml' diff --git a/demos/speech_server/conf/asr/asr.yaml b/demos/speech_server/conf/asr/asr.yaml index 1a805142a9a1a85b2dfd67a22e216c236bcc9664..a6743b77513e504f2bcd374ea8235d8e39a7c98c 100644 --- a/demos/speech_server/conf/asr/asr.yaml +++ b/demos/speech_server/conf/asr/asr.yaml @@ -5,4 +5,4 @@ cfg_path: # [optional] ckpt_path: # [optional] decode_method: 'attention_rescoring' force_yes: True -device: 'cpu' # set 'gpu:id' or 'cpu' +device: # set 'gpu:id' or 'cpu' diff --git a/demos/speech_server/conf/asr/asr_pd.yaml b/demos/speech_server/conf/asr/asr_pd.yaml index 6cddb4503fc253ba98585d5e0a9d8a079a26aeaf..4c415ac791edeab2d9832e8db2e9a66411aaed06 100644 --- a/demos/speech_server/conf/asr/asr_pd.yaml +++ b/demos/speech_server/conf/asr/asr_pd.yaml @@ -15,9 +15,10 @@ decode_method: force_yes: True am_predictor_conf: - device: 'cpu' # set 'gpu:id' or 'cpu' - enable_mkldnn: True + device: # set 'gpu:id' or 'cpu' switch_ir_optim: True + glog_info: False # True -> print glog + summary: True # False -> do not show predictor config ################################################################## diff --git a/demos/speech_server/conf/tts/tts.yaml b/demos/speech_server/conf/tts/tts.yaml index 19e8874e31c04d99cef2cfb66ab1f86f6605d12e..19207f0b03579a906c80ba6eff356792974eeefd 100644 --- a/demos/speech_server/conf/tts/tts.yaml +++ b/demos/speech_server/conf/tts/tts.yaml @@ -29,4 +29,4 @@ voc_stat: # OTHERS # ################################################################## lang: 'zh' -device: 'cpu' # set 'gpu:id' or 'cpu' +device: # set 'gpu:id' or 'cpu' diff --git a/demos/speech_server/conf/tts/tts_pd.yaml b/demos/speech_server/conf/tts/tts_pd.yaml index 97df526132a8f12210db91c49fb51258ab976c35..e27b9665bbe1ee8b5d5c39fd3e5f87d841dd64de 100644 --- a/demos/speech_server/conf/tts/tts_pd.yaml +++ b/demos/speech_server/conf/tts/tts_pd.yaml @@ -15,9 +15,10 @@ speaker_dict: spk_id: 0 am_predictor_conf: - device: 'cpu' # set 'gpu:id' or 'cpu' - enable_mkldnn: False - switch_ir_optim: False + device: # set 'gpu:id' or 'cpu' + switch_ir_optim: True + glog_info: False # True -> print glog + summary: True # False -> do not show predictor config ################################################################## @@ -30,9 +31,10 @@ voc_params: # the pdiparams file of your vocoder static model (XX.pdipparams) voc_sample_rate: 24000 voc_predictor_conf: - device: 'cpu' # set 'gpu:id' or 'cpu' - enable_mkldnn: False - switch_ir_optim: False + device: # set 'gpu:id' or 'cpu' + switch_ir_optim: True + glog_info: False # True -> print glog + summary: True # False -> do not show predictor config ################################################################## # OTHERS # diff --git a/paddlespeech/cli/__init__.py b/paddlespeech/cli/__init__.py index 12ff9919a29f453a11853571eb3dad836f824556..b526a3849b0ed5deddd519e7a0573a592c743d2f 100644 --- a/paddlespeech/cli/__init__.py +++ b/paddlespeech/cli/__init__.py @@ -18,8 +18,8 @@ from .base_commands import BaseCommand from .base_commands import HelpCommand from .cls import CLSExecutor from .st import STExecutor +from .stats import StatsExecutor from .text import TextExecutor from .tts import TTSExecutor -from .stats import StatsExecutor _locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8']) diff --git a/paddlespeech/cli/tts/infer.py b/paddlespeech/cli/tts/infer.py index ba15d652415d33abfe3ae3b2252675cd22b54aba..8423dfa8d1cbf7fc651ff5e538d0ec0993ca2e9f 100644 --- a/paddlespeech/cli/tts/infer.py +++ b/paddlespeech/cli/tts/infer.py @@ -13,6 +13,7 @@ # limitations under the License. import argparse import os +import time from collections import OrderedDict from typing import Any from typing import List @@ -621,6 +622,7 @@ class TTSExecutor(BaseExecutor): am_dataset = am[am.rindex('_') + 1:] get_tone_ids = False merge_sentences = False + frontend_st = time.time() if am_name == 'speedyspeech': get_tone_ids = True if lang == 'zh': @@ -637,9 +639,13 @@ class TTSExecutor(BaseExecutor): phone_ids = input_ids["phone_ids"] else: print("lang should in {'zh', 'en'}!") + self.frontend_time = time.time() - frontend_st + self.am_time = 0 + self.voc_time = 0 flags = 0 for i in range(len(phone_ids)): + am_st = time.time() part_phone_ids = phone_ids[i] # am if am_name == 'speedyspeech': @@ -653,13 +659,16 @@ class TTSExecutor(BaseExecutor): part_phone_ids, spk_id=paddle.to_tensor(spk_id)) else: mel = self.am_inference(part_phone_ids) + self.am_time += (time.time() - am_st) # voc + voc_st = time.time() wav = self.voc_inference(mel) if flags == 0: wav_all = wav flags = 1 else: wav_all = paddle.concat([wav_all, wav]) + self.voc_time += (time.time() - voc_st) self._outputs['wav'] = wav_all def postprocess(self, output: str='output.wav') -> Union[str, os.PathLike]: diff --git a/paddlespeech/server/bin/paddlespeech_client.py b/paddlespeech/server/bin/paddlespeech_client.py index 853d272fb4a40ebe10890b6717e433aacb768ea0..ee6ab7ad764b873a899d0503550a2ad51cd7eadf 100644 --- a/paddlespeech/server/bin/paddlespeech_client.py +++ b/paddlespeech/server/bin/paddlespeech_client.py @@ -121,7 +121,6 @@ class TTSClientExecutor(BaseExecutor): (args.output)) logger.info("Audio duration: %f s." % (duration)) logger.info("Response time: %f s." % (time_consume)) - logger.info("RTF: %f " % (time_consume / duration)) return True except BaseException: diff --git a/paddlespeech/server/conf/application.yaml b/paddlespeech/server/conf/application.yaml index cc08665eabde72596373cbfdc13bef3f9d4ad314..6dcae74a944fdf129a35b991b23be5c724d5df16 100644 --- a/paddlespeech/server/conf/application.yaml +++ b/paddlespeech/server/conf/application.yaml @@ -3,12 +3,18 @@ ################################################################## # SERVER SETTING # ################################################################## -host: '0.0.0.0' +host: '127.0.0.1' port: 8090 ################################################################## # CONFIG FILE # ################################################################## +# add engine backend type (Options: asr, tts) and config file here. +# Adding a speech task to engine_backend means starting the service. +engine_backend: + asr: 'conf/asr/asr.yaml' + tts: 'conf/tts/tts.yaml' + # The engine_type of speech task needs to keep the same type as the config file of speech task. # E.g: The engine_type of asr is 'python', the engine_backend of asr is 'XX/asr.yaml' # E.g: The engine_type of asr is 'inference', the engine_backend of asr is 'XX/asr_pd.yaml' @@ -18,8 +24,4 @@ engine_type: asr: 'python' tts: 'python' -# add engine backend type (Options: asr, tts) and config file here. -# Adding a speech task to engine_backend means starting the service. -engine_backend: - asr: 'conf/asr/asr.yaml' - tts: 'conf/tts/tts.yaml' + diff --git a/paddlespeech/server/conf/asr/asr.yaml b/paddlespeech/server/conf/asr/asr.yaml index 1a805142a9a1a85b2dfd67a22e216c236bcc9664..a6743b77513e504f2bcd374ea8235d8e39a7c98c 100644 --- a/paddlespeech/server/conf/asr/asr.yaml +++ b/paddlespeech/server/conf/asr/asr.yaml @@ -5,4 +5,4 @@ cfg_path: # [optional] ckpt_path: # [optional] decode_method: 'attention_rescoring' force_yes: True -device: 'cpu' # set 'gpu:id' or 'cpu' +device: # set 'gpu:id' or 'cpu' diff --git a/paddlespeech/server/conf/asr/asr_pd.yaml b/paddlespeech/server/conf/asr/asr_pd.yaml index 6cddb4503fc253ba98585d5e0a9d8a079a26aeaf..4c415ac791edeab2d9832e8db2e9a66411aaed06 100644 --- a/paddlespeech/server/conf/asr/asr_pd.yaml +++ b/paddlespeech/server/conf/asr/asr_pd.yaml @@ -15,9 +15,10 @@ decode_method: force_yes: True am_predictor_conf: - device: 'cpu' # set 'gpu:id' or 'cpu' - enable_mkldnn: True + device: # set 'gpu:id' or 'cpu' switch_ir_optim: True + glog_info: False # True -> print glog + summary: True # False -> do not show predictor config ################################################################## diff --git a/paddlespeech/server/conf/tts/tts.yaml b/paddlespeech/server/conf/tts/tts.yaml index 19e8874e31c04d99cef2cfb66ab1f86f6605d12e..19207f0b03579a906c80ba6eff356792974eeefd 100644 --- a/paddlespeech/server/conf/tts/tts.yaml +++ b/paddlespeech/server/conf/tts/tts.yaml @@ -29,4 +29,4 @@ voc_stat: # OTHERS # ################################################################## lang: 'zh' -device: 'cpu' # set 'gpu:id' or 'cpu' +device: # set 'gpu:id' or 'cpu' diff --git a/paddlespeech/server/conf/tts/tts_pd.yaml b/paddlespeech/server/conf/tts/tts_pd.yaml index 019c7ed6a96c97a32fc7b474ab82d8b72d4b4006..e27b9665bbe1ee8b5d5c39fd3e5f87d841dd64de 100644 --- a/paddlespeech/server/conf/tts/tts_pd.yaml +++ b/paddlespeech/server/conf/tts/tts_pd.yaml @@ -8,16 +8,17 @@ am: 'fastspeech2_csmsc' am_model: # the pdmodel file of your am static model (XX.pdmodel) am_params: # the pdiparams file of your am static model (XX.pdipparams) -am_sample_rate: 24000 # must match the model +am_sample_rate: 24000 phones_dict: tones_dict: speaker_dict: spk_id: 0 am_predictor_conf: - device: 'cpu' # set 'gpu:id' or 'cpu' - enable_mkldnn: False - switch_ir_optim: False + device: # set 'gpu:id' or 'cpu' + switch_ir_optim: True + glog_info: False # True -> print glog + summary: True # False -> do not show predictor config ################################################################## @@ -27,12 +28,13 @@ am_predictor_conf: voc: 'pwgan_csmsc' voc_model: # the pdmodel file of your vocoder static model (XX.pdmodel) voc_params: # the pdiparams file of your vocoder static model (XX.pdipparams) -voc_sample_rate: 24000 #must match the model +voc_sample_rate: 24000 voc_predictor_conf: - device: 'cpu' # set 'gpu:id' or 'cpu' - enable_mkldnn: False - switch_ir_optim: False + device: # set 'gpu:id' or 'cpu' + switch_ir_optim: True + glog_info: False # True -> print glog + summary: True # False -> do not show predictor config ################################################################## # OTHERS # diff --git a/paddlespeech/server/engine/asr/paddleinference/asr_engine.py b/paddlespeech/server/engine/asr/paddleinference/asr_engine.py index 5d4c4fa6aba15d3c8501687435559daf26de1445..cb973e924efb5bcd7de440f97a27c0d29fda29c0 100644 --- a/paddlespeech/server/engine/asr/paddleinference/asr_engine.py +++ b/paddlespeech/server/engine/asr/paddleinference/asr_engine.py @@ -13,6 +13,7 @@ # limitations under the License. import io import os +import time from typing import Optional import paddle @@ -197,7 +198,6 @@ class ASREngine(BaseEngine): self.executor = ASRServerExecutor() self.config = get_config(config_file) - paddle.set_device(paddle.get_device()) self.executor._init_from_path( model_type=self.config.model_type, am_model=self.config.am_model, @@ -223,13 +223,18 @@ class ASREngine(BaseEngine): logger.info("start running asr engine") self.executor.preprocess(self.config.model_type, io.BytesIO(audio_data)) + st = time.time() self.executor.infer(self.config.model_type) + infer_time = time.time() - st self.output = self.executor.postprocess() # Retrieve result of asr. logger.info("end inferring asr engine") else: logger.info("file check failed!") self.output = None + logger.info("inference time: {}".format(infer_time)) + logger.info("asr engine type: paddle inference") + def postprocess(self): """postprocess """ diff --git a/paddlespeech/server/engine/asr/python/asr_engine.py b/paddlespeech/server/engine/asr/python/asr_engine.py index 9fac487d777a684abf609e87da2c93e00dd83cb8..1e2c5cc270dab1f82caa9c0810411211c8cdbe2e 100644 --- a/paddlespeech/server/engine/asr/python/asr_engine.py +++ b/paddlespeech/server/engine/asr/python/asr_engine.py @@ -12,6 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. import io +import time import paddle @@ -53,16 +54,24 @@ class ASREngine(BaseEngine): self.executor = ASRServerExecutor() self.config = get_config(config_file) - if self.config.device is None: - paddle.set_device(paddle.get_device()) - else: - paddle.set_device(self.config.device) + try: + if self.config.device: + self.device = self.config.device + else: + self.device = paddle.get_device() + paddle.set_device(self.device) + except BaseException: + logger.error( + "Set device failed, please check if device is already used and the parameter 'device' in the yaml file" + ) + self.executor._init_from_path( self.config.model, self.config.lang, self.config.sample_rate, self.config.cfg_path, self.config.decode_method, self.config.ckpt_path) - logger.info("Initialize ASR server engine successfully.") + logger.info("Initialize ASR server engine successfully on device: %s." % + (self.device)) return True def run(self, audio_data): @@ -76,12 +85,17 @@ class ASREngine(BaseEngine): self.config.force_yes): logger.info("start run asr engine") self.executor.preprocess(self.config.model, io.BytesIO(audio_data)) + st = time.time() self.executor.infer(self.config.model) + infer_time = time.time() - st self.output = self.executor.postprocess() # Retrieve result of asr. else: logger.info("file check failed!") self.output = None + logger.info("inference time: {}".format(infer_time)) + logger.info("asr engine type: python") + def postprocess(self): """postprocess """ diff --git a/paddlespeech/server/engine/tts/paddleinference/tts_engine.py b/paddlespeech/server/engine/tts/paddleinference/tts_engine.py index a9dc5f4ea742b903e229c4f3520909667a67881c..5955c1a216a304629c4896a0f9462d39d9121715 100644 --- a/paddlespeech/server/engine/tts/paddleinference/tts_engine.py +++ b/paddlespeech/server/engine/tts/paddleinference/tts_engine.py @@ -14,6 +14,7 @@ import base64 import io import os +import time from typing import Optional import librosa @@ -179,7 +180,7 @@ class TTSServerExecutor(TTSExecutor): self.phones_dict = os.path.abspath(phones_dict) self.am_sample_rate = am_sample_rate self.am_res_path = os.path.dirname(os.path.abspath(self.am_model)) - print("self.phones_dict:", self.phones_dict) + logger.info("self.phones_dict: {}".format(self.phones_dict)) # for speedyspeech self.tones_dict = None @@ -224,21 +225,21 @@ class TTSServerExecutor(TTSExecutor): with open(self.phones_dict, "r") as f: phn_id = [line.strip().split() for line in f.readlines()] vocab_size = len(phn_id) - print("vocab_size:", vocab_size) + logger.info("vocab_size: {}".format(vocab_size)) tone_size = None if self.tones_dict: with open(self.tones_dict, "r") as f: tone_id = [line.strip().split() for line in f.readlines()] tone_size = len(tone_id) - print("tone_size:", tone_size) + logger.info("tone_size: {}".format(tone_size)) spk_num = None if self.speaker_dict: with open(self.speaker_dict, 'rt') as f: spk_id = [line.strip().split() for line in f.readlines()] spk_num = len(spk_id) - print("spk_num:", spk_num) + logger.info("spk_num: {}".format(spk_num)) # frontend if lang == 'zh': @@ -248,21 +249,29 @@ class TTSServerExecutor(TTSExecutor): elif lang == 'en': self.frontend = English(phone_vocab_path=self.phones_dict) - print("frontend done!") - - # am predictor - self.am_predictor_conf = am_predictor_conf - self.am_predictor = init_predictor( - model_file=self.am_model, - params_file=self.am_params, - predictor_conf=self.am_predictor_conf) - - # voc predictor - self.voc_predictor_conf = voc_predictor_conf - self.voc_predictor = init_predictor( - model_file=self.voc_model, - params_file=self.voc_params, - predictor_conf=self.voc_predictor_conf) + logger.info("frontend done!") + + try: + # am predictor + self.am_predictor_conf = am_predictor_conf + self.am_predictor = init_predictor( + model_file=self.am_model, + params_file=self.am_params, + predictor_conf=self.am_predictor_conf) + logger.info("Create AM predictor successfully.") + except BaseException: + logger.error("Failed to create AM predictor.") + + try: + # voc predictor + self.voc_predictor_conf = voc_predictor_conf + self.voc_predictor = init_predictor( + model_file=self.voc_model, + params_file=self.voc_params, + predictor_conf=self.voc_predictor_conf) + logger.info("Create Vocoder predictor successfully.") + except BaseException: + logger.error("Failed to create Vocoder predictor.") @paddle.no_grad() def infer(self, @@ -277,6 +286,7 @@ class TTSServerExecutor(TTSExecutor): am_dataset = am[am.rindex('_') + 1:] get_tone_ids = False merge_sentences = False + frontend_st = time.time() if am_name == 'speedyspeech': get_tone_ids = True if lang == 'zh': @@ -292,10 +302,14 @@ class TTSServerExecutor(TTSExecutor): text, merge_sentences=merge_sentences) phone_ids = input_ids["phone_ids"] else: - print("lang should in {'zh', 'en'}!") + logger.error("lang should in {'zh', 'en'}!") + self.frontend_time = time.time() - frontend_st + self.am_time = 0 + self.voc_time = 0 flags = 0 for i in range(len(phone_ids)): + am_st = time.time() part_phone_ids = phone_ids[i] # am if am_name == 'speedyspeech': @@ -314,7 +328,10 @@ class TTSServerExecutor(TTSExecutor): am_result = run_model(self.am_predictor, [part_phone_ids.numpy()]) mel = am_result[0] + self.am_time += (time.time() - am_st) + # voc + voc_st = time.time() voc_result = run_model(self.voc_predictor, [mel]) wav = voc_result[0] wav = paddle.to_tensor(wav) @@ -324,6 +341,7 @@ class TTSServerExecutor(TTSExecutor): flags = 1 else: wav_all = paddle.concat([wav_all, wav]) + self.voc_time += (time.time() - voc_st) self._outputs['wav'] = wav_all @@ -370,7 +388,7 @@ class TTSEngine(BaseEngine): def postprocess(self, wav, original_fs: int, - target_fs: int=16000, + target_fs: int=0, volume: float=1.0, speed: float=1.0, audio_path: str=None): @@ -395,38 +413,50 @@ class TTSEngine(BaseEngine): if target_fs == 0 or target_fs > original_fs: target_fs = original_fs wav_tar_fs = wav + logger.info( + "The sample rate of synthesized audio is the same as model, which is {}Hz". + format(original_fs)) else: wav_tar_fs = librosa.resample( np.squeeze(wav), original_fs, target_fs) - + logger.info( + "The sample rate of model is {}Hz and the target sample rate is {}Hz. Converting the sample rate of the synthesized audio successfully.". + format(original_fs, target_fs)) # transform volume wav_vol = wav_tar_fs * volume + logger.info("Transform the volume of the audio successfully.") # transform speed try: # windows not support soxbindings wav_speed = change_speed(wav_vol, speed, target_fs) + logger.info("Transform the speed of the audio successfully.") except ServerBaseException: raise ServerBaseException( ErrorCode.SERVER_INTERNAL_ERR, - "Transform speed failed. Can not install soxbindings on your system. \ + "Failed to transform speed. Can not install soxbindings on your system. \ You need to set speed value 1.0.") except BaseException: - logger.error("Transform speed failed.") + logger.error("Failed to transform speed.") # wav to base64 buf = io.BytesIO() wavfile.write(buf, target_fs, wav_speed) base64_bytes = base64.b64encode(buf.read()) wav_base64 = base64_bytes.decode('utf-8') + logger.info("Audio to string successfully.") # save audio - if audio_path is not None and audio_path.endswith(".wav"): - sf.write(audio_path, wav_speed, target_fs) - elif audio_path is not None and audio_path.endswith(".pcm"): - wav_norm = wav_speed * (32767 / max(0.001, - np.max(np.abs(wav_speed)))) - with open(audio_path, "wb") as f: - f.write(wav_norm.astype(np.int16)) + if audio_path is not None: + if audio_path.endswith(".wav"): + sf.write(audio_path, wav_speed, target_fs) + elif audio_path.endswith(".pcm"): + wav_norm = wav_speed * (32767 / max(0.001, + np.max(np.abs(wav_speed)))) + with open(audio_path, "wb") as f: + f.write(wav_norm.astype(np.int16)) + logger.info("Save audio to {} successfully.".format(audio_path)) + else: + logger.info("There is no need to save audio.") return target_fs, wav_base64 @@ -462,8 +492,12 @@ class TTSEngine(BaseEngine): lang = self.config.lang try: + infer_st = time.time() self.executor.infer( text=sentence, lang=lang, am=self.config.am, spk_id=spk_id) + infer_et = time.time() + infer_time = infer_et - infer_st + except ServerBaseException: raise ServerBaseException(ErrorCode.SERVER_INTERNAL_ERR, "tts infer failed.") @@ -471,6 +505,7 @@ class TTSEngine(BaseEngine): logger.error("tts infer failed.") try: + postprocess_st = time.time() target_sample_rate, wav_base64 = self.postprocess( wav=self.executor._outputs['wav'].numpy(), original_fs=self.executor.am_sample_rate, @@ -478,10 +513,34 @@ class TTSEngine(BaseEngine): volume=volume, speed=speed, audio_path=save_path) + postprocess_et = time.time() + postprocess_time = postprocess_et - postprocess_st + duration = len(self.executor._outputs['wav'] + .numpy()) / self.executor.am_sample_rate + rtf = infer_time / duration + except ServerBaseException: raise ServerBaseException(ErrorCode.SERVER_INTERNAL_ERR, "tts postprocess failed.") except BaseException: logger.error("tts postprocess failed.") + logger.info("AM model: {}".format(self.config.am)) + logger.info("Vocoder model: {}".format(self.config.voc)) + logger.info("Language: {}".format(lang)) + logger.info("tts engine type: paddle inference") + + logger.info("audio duration: {}".format(duration)) + logger.info( + "frontend inference time: {}".format(self.executor.frontend_time)) + logger.info("AM inference time: {}".format(self.executor.am_time)) + logger.info("Vocoder inference time: {}".format(self.executor.voc_time)) + logger.info("total inference time: {}".format(infer_time)) + logger.info( + "postprocess (change speed, volume, target sample rate) time: {}". + format(postprocess_time)) + logger.info("total generate audio time: {}".format(infer_time + + postprocess_time)) + logger.info("RTF: {}".format(rtf)) + return lang, target_sample_rate, wav_base64 diff --git a/paddlespeech/server/engine/tts/python/tts_engine.py b/paddlespeech/server/engine/tts/python/tts_engine.py index 20b4e0fe94589bf831929cdd19f1b77fa6297f39..7dd576699d02c2ecef8b0993a0273f9826c08a6b 100644 --- a/paddlespeech/server/engine/tts/python/tts_engine.py +++ b/paddlespeech/server/engine/tts/python/tts_engine.py @@ -13,6 +13,7 @@ # limitations under the License. import base64 import io +import time import librosa import numpy as np @@ -54,11 +55,20 @@ class TTSEngine(BaseEngine): try: self.config = get_config(config_file) - if self.config.device is None: - paddle.set_device(paddle.get_device()) + if self.config.device: + self.device = self.config.device else: - paddle.set_device(self.config.device) + self.device = paddle.get_device() + paddle.set_device(self.device) + except BaseException: + logger.error( + "Set device failed, please check if device is already used and the parameter 'device' in the yaml file" + ) + logger.error("Initialize TTS server engine Failed on device: %s." % + (self.device)) + return False + try: self.executor._init_from_path( am=self.config.am, am_config=self.config.am_config, @@ -73,16 +83,19 @@ class TTSEngine(BaseEngine): voc_stat=self.config.voc_stat, lang=self.config.lang) except BaseException: - logger.error("Initialize TTS server engine Failed.") + logger.error("Failed to get model related files.") + logger.error("Initialize TTS server engine Failed on device: %s." % + (self.device)) return False - logger.info("Initialize TTS server engine successfully.") + logger.info("Initialize TTS server engine successfully on device: %s." % + (self.device)) return True def postprocess(self, wav, original_fs: int, - target_fs: int=16000, + target_fs: int=0, volume: float=1.0, speed: float=1.0, audio_path: str=None): @@ -107,38 +120,50 @@ class TTSEngine(BaseEngine): if target_fs == 0 or target_fs > original_fs: target_fs = original_fs wav_tar_fs = wav + logger.info( + "The sample rate of synthesized audio is the same as model, which is {}Hz". + format(original_fs)) else: wav_tar_fs = librosa.resample( np.squeeze(wav), original_fs, target_fs) - + logger.info( + "The sample rate of model is {}Hz and the target sample rate is {}Hz. Converting the sample rate of the synthesized audio successfully.". + format(original_fs, target_fs)) # transform volume wav_vol = wav_tar_fs * volume + logger.info("Transform the volume of the audio successfully.") # transform speed try: # windows not support soxbindings wav_speed = change_speed(wav_vol, speed, target_fs) + logger.info("Transform the speed of the audio successfully.") except ServerBaseException: raise ServerBaseException( ErrorCode.SERVER_INTERNAL_ERR, - "Transform speed failed. Can not install soxbindings on your system. \ + "Failed to transform speed. Can not install soxbindings on your system. \ You need to set speed value 1.0.") except BaseException: - logger.error("Transform speed failed.") + logger.error("Failed to transform speed.") # wav to base64 buf = io.BytesIO() wavfile.write(buf, target_fs, wav_speed) base64_bytes = base64.b64encode(buf.read()) wav_base64 = base64_bytes.decode('utf-8') + logger.info("Audio to string successfully.") # save audio - if audio_path is not None and audio_path.endswith(".wav"): - sf.write(audio_path, wav_speed, target_fs) - elif audio_path is not None and audio_path.endswith(".pcm"): - wav_norm = wav_speed * (32767 / max(0.001, - np.max(np.abs(wav_speed)))) - with open(audio_path, "wb") as f: - f.write(wav_norm.astype(np.int16)) + if audio_path is not None: + if audio_path.endswith(".wav"): + sf.write(audio_path, wav_speed, target_fs) + elif audio_path.endswith(".pcm"): + wav_norm = wav_speed * (32767 / max(0.001, + np.max(np.abs(wav_speed)))) + with open(audio_path, "wb") as f: + f.write(wav_norm.astype(np.int16)) + logger.info("Save audio to {} successfully.".format(audio_path)) + else: + logger.info("There is no need to save audio.") return target_fs, wav_base64 @@ -174,8 +199,15 @@ class TTSEngine(BaseEngine): lang = self.config.lang try: + infer_st = time.time() self.executor.infer( text=sentence, lang=lang, am=self.config.am, spk_id=spk_id) + infer_et = time.time() + infer_time = infer_et - infer_st + duration = len(self.executor._outputs['wav'] + .numpy()) / self.executor.am_config.fs + rtf = infer_time / duration + except ServerBaseException: raise ServerBaseException(ErrorCode.SERVER_INTERNAL_ERR, "tts infer failed.") @@ -183,6 +215,7 @@ class TTSEngine(BaseEngine): logger.error("tts infer failed.") try: + postprocess_st = time.time() target_sample_rate, wav_base64 = self.postprocess( wav=self.executor._outputs['wav'].numpy(), original_fs=self.executor.am_config.fs, @@ -190,10 +223,32 @@ class TTSEngine(BaseEngine): volume=volume, speed=speed, audio_path=save_path) + postprocess_et = time.time() + postprocess_time = postprocess_et - postprocess_st + except ServerBaseException: raise ServerBaseException(ErrorCode.SERVER_INTERNAL_ERR, "tts postprocess failed.") except BaseException: logger.error("tts postprocess failed.") + logger.info("AM model: {}".format(self.config.am)) + logger.info("Vocoder model: {}".format(self.config.voc)) + logger.info("Language: {}".format(lang)) + logger.info("tts engine type: python") + + logger.info("audio duration: {}".format(duration)) + logger.info( + "frontend inference time: {}".format(self.executor.frontend_time)) + logger.info("AM inference time: {}".format(self.executor.am_time)) + logger.info("Vocoder inference time: {}".format(self.executor.voc_time)) + logger.info("total inference time: {}".format(infer_time)) + logger.info( + "postprocess (change speed, volume, target sample rate) time: {}". + format(postprocess_time)) + logger.info("total generate audio time: {}".format(infer_time + + postprocess_time)) + logger.info("RTF: {}".format(rtf)) + logger.info("device: {}".format(self.device)) + return lang, target_sample_rate, wav_base64 diff --git a/paddlespeech/server/restful/tts_api.py b/paddlespeech/server/restful/tts_api.py index c7e91300da3eabf80755967cfd7eab99c299d7cd..0af0f6d07901d91887b401d5a2dfb411aa9d80b9 100644 --- a/paddlespeech/server/restful/tts_api.py +++ b/paddlespeech/server/restful/tts_api.py @@ -16,6 +16,7 @@ from typing import Union from fastapi import APIRouter +from paddlespeech.cli.log import logger from paddlespeech.server.engine.engine_pool import get_engine_pool from paddlespeech.server.restful.request import TTSRequest from paddlespeech.server.restful.response import ErrorResponse @@ -60,6 +61,9 @@ def tts(request_body: TTSRequest): Returns: json: [description] """ + + logger.info("request: {}".format(request_body)) + # get params text = request_body.text spk_id = request_body.spk_id @@ -92,6 +96,7 @@ def tts(request_body: TTSRequest): # get single engine from engine pool engine_pool = get_engine_pool() tts_engine = engine_pool['tts'] + logger.info("Get tts engine successfully.") lang, target_sample_rate, wav_base64 = tts_engine.run( text, spk_id, speed, volume, sample_rate, save_path) diff --git a/paddlespeech/server/utils/paddle_predictor.py b/paddlespeech/server/utils/paddle_predictor.py index f4216d74ca9c1cd1a444e61fe5a775db2eca3d85..4035d48d8c9928aa9c537ec3be25eb606a68960b 100644 --- a/paddlespeech/server/utils/paddle_predictor.py +++ b/paddlespeech/server/utils/paddle_predictor.py @@ -15,6 +15,7 @@ import os from typing import List from typing import Optional +import paddle from paddle.inference import Config from paddle.inference import create_predictor @@ -40,15 +41,30 @@ def init_predictor(model_dir: Optional[os.PathLike]=None, else: config = Config(model_file, params_file) - config.enable_memory_optim() - if "gpu" in predictor_conf["device"]: - gpu_id = predictor_conf["device"].split(":")[-1] + # set device + if predictor_conf["device"]: + device = predictor_conf["device"] + else: + device = paddle.get_device() + if "gpu" in device: + gpu_id = device.split(":")[-1] config.enable_use_gpu(1000, int(gpu_id)) - if predictor_conf["enable_mkldnn"]: - config.enable_mkldnn() + + # IR optim if predictor_conf["switch_ir_optim"]: config.switch_ir_optim() + # glog + if not predictor_conf["glog_info"]: + config.disable_glog_info() + + # config summary + if predictor_conf["summary"]: + print(config.summary()) + + # memory optim + config.enable_memory_optim() + predictor = create_predictor(config) return predictor