未验证 提交 240520c0 编写于 作者: Q qingen 提交者: GitHub

Merge branch 'PaddlePaddle:develop' into cluster

......@@ -37,7 +37,7 @@ Model Type | Dataset| Example Link | Pretrained Models|Static Models|Size (stati
Tacotron2|LJSpeech|[tacotron2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts0)|[tacotron2_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.2.0.zip)|||
Tacotron2|CSMSC|[tacotron2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts0)|[tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)|[tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)|103MB|
TransformerTTS| LJSpeech| [transformer-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts1)|[transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)|||
SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2) |[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)|[speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)|12MB|
SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2)|[speedyspeech_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_ckpt_0.2.0.zip)|[speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)|12MB|
FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)|157MB|
FastSpeech2-Conformer| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)|||
FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)|||
......
......@@ -5,7 +5,7 @@ source path.sh
gpus=0,1,2,3
stage=0
stop_stage=100
conf_path=conf/deepspeech2.yaml #conf/deepspeech2.yaml or conf/deepspeeech2_online.yaml
conf_path=conf/deepspeech2.yaml #conf/deepspeech2.yaml or conf/deepspeech2_online.yaml
decode_conf_path=conf/tuning/decode.yaml
avg_num=1
model_type=offline # offline or online
......
......@@ -223,22 +223,28 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
## Pretrained Model
Pretrained SpeedySpeech model with no silence in the edge of audios:
- [speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)
- [speedyspeech_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_ckpt_0.2.0.zip)
The static model can be downloaded here:
- [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip)
- [speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)
The ONNX model can be downloaded here:
- [speedyspeech_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_onnx_0.2.0.zip)
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss
:-------------:| :------------:| :-----: | :-----: | :--------:|:--------:
default| 1(gpu) x 11400|0.83655|0.42324|0.03211| 0.38119
default| 1(gpu) x 11400|0.79532|0.400246|0.030259| 0.36482
SpeedySpeech checkpoint contains files listed below.
```text
speedyspeech_nosil_baker_ckpt_0.5
speedyspeech_csmsc_ckpt_0.2.0
├── default.yaml # default config used to train speedyspeech
├── feats_stats.npy # statistics used to normalize spectrogram when training speedyspeech
├── phone_id_map.txt # phone vocabulary file when training speedyspeech
├── snapshot_iter_11400.pdz # model parameters and optimizer states
├── snapshot_iter_30600.pdz # model parameters and optimizer states
└── tone_id_map.txt # tone vocabulary file when training speedyspeech
```
You can use the following scripts to synthesize for `${BIN_DIR}/../sentences.txt` using pretrained speedyspeech and parallel wavegan models.
......@@ -249,9 +255,9 @@ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=speedyspeech_csmsc \
--am_config=speedyspeech_nosil_baker_ckpt_0.5/default.yaml \
--am_ckpt=speedyspeech_nosil_baker_ckpt_0.5/snapshot_iter_11400.pdz \
--am_stat=speedyspeech_nosil_baker_ckpt_0.5/feats_stats.npy \
--am_config=speedyspeech_csmsc_ckpt_0.2.0/default.yaml \
--am_ckpt=speedyspeech_csmsc_ckpt_0.2.0/snapshot_iter_30600.pdz \
--am_stat=speedyspeech_csmsc_ckpt_0.2.0/feats_stats.npy \
--voc=pwgan_csmsc \
--voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
......@@ -260,6 +266,6 @@ python3 ${BIN_DIR}/../synthesize_e2e.py \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=exp/default/test_e2e \
--inference_dir=exp/default/inference \
--phones_dict=speedyspeech_nosil_baker_ckpt_0.5/phone_id_map.txt \
--tones_dict=speedyspeech_nosil_baker_ckpt_0.5/tone_id_map.txt
--phones_dict=speedyspeech_csmsc_ckpt_0.2.0/phone_id_map.txt \
--tones_dict=speedyspeech_csmsc_ckpt_0.2.0/tone_id_map.txt
```
train_output_path=$1
stage=0
stop_stage=0
# only support default_fastspeech2/speedyspeech + hifigan/mb_melgan now!
# synthesize from metadata
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
python3 ${BIN_DIR}/../ort_predict.py \
--inference_dir=${train_output_path}/inference_onnx \
--am=speedyspeech_csmsc \
--voc=hifigan_csmsc \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/onnx_infer_out \
--device=cpu \
--cpu_threads=2
fi
# e2e, synthesize from text
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
python3 ${BIN_DIR}/../ort_predict_e2e.py \
--inference_dir=${train_output_path}/inference_onnx \
--am=speedyspeech_csmsc \
--voc=hifigan_csmsc \
--output_dir=${train_output_path}/onnx_infer_out_e2e \
--text=${BIN_DIR}/../csmsc_test.txt \
--phones_dict=dump/phone_id_map.txt \
--tones_dict=dump/tone_id_map.txt \
--device=cpu \
--cpu_threads=2
fi
../../tts3/local/paddle2onnx.sh
\ No newline at end of file
......@@ -40,3 +40,25 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# inference with static model
CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1
fi
# paddle2onnx, please make sure the static models are in ${train_output_path}/inference first
# we have only tested the following models so far
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# install paddle2onnx
version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
if [[ -z "$version" || ${version} != '0.9.4' ]]; then
pip install paddle2onnx==0.9.4
fi
./local/paddle2onnx.sh ${train_output_path} inference inference_onnx speedyspeech_csmsc
./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc
fi
# inference with onnxruntime, use fastspeech2 + hifigan by default
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
# install onnxruntime
version=$(echo `pip list |grep "onnxruntime"` |awk -F" " '{print $2}')
if [[ -z "$version" || ${version} != '1.10.0' ]]; then
pip install onnxruntime==1.10.0
fi
./local/ort_predict.sh ${train_output_path}
fi
......@@ -3,7 +3,7 @@ train_output_path=$1
stage=0
stop_stage=0
# only support default_fastspeech2 + hifigan/mb_melgan now!
# only support default_fastspeech2/speedyspeech + hifigan/mb_melgan now!
# synthesize from metadata
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
......
......@@ -19,4 +19,5 @@ paddle2onnx \
--model_filename ${model}.pdmodel \
--params_filename ${model}.pdiparams \
--save_file ${train_output_path}/${output_dir}/${model}.onnx \
--opset_version 11 \
--enable_dev_version ${enable_dev_version}
\ No newline at end of file
......@@ -133,6 +133,9 @@ The pretrained model can be downloaded here:
The static model can be downloaded here:
- [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip)
The ONNX model can be downloaded here:
- [pwgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_csmsc_onnx_0.2.0.zip)
Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss| eval/spectral_convergence_loss
:-------------:| :------------:| :-----: | :-----: | :--------:
default| 1(gpu) x 400000|1.948763|0.670098|0.248882
......
......@@ -26,10 +26,10 @@ from paddlespeech.s2t.utils.log import Log
#TODO(Hui Zhang): remove fluid import
logger = Log(__name__).getlog()
########### hcak logging #############
########### hack logging #############
logger.warn = logger.warning
########### hcak paddle #############
########### hack paddle #############
paddle.half = 'float16'
paddle.float = 'float32'
paddle.double = 'float64'
......@@ -110,7 +110,7 @@ if not hasattr(paddle, 'cat'):
paddle.cat = cat
########### hcak paddle.Tensor #############
########### hack paddle.Tensor #############
def item(x: paddle.Tensor):
return x.numpy().item()
......@@ -353,7 +353,7 @@ if not hasattr(paddle.Tensor, 'tolist'):
setattr(paddle.Tensor, 'tolist', tolist)
########### hcak paddle.nn #############
########### hack paddle.nn #############
class GLU(nn.Layer):
"""Gated Linear Units (GLU) Layer"""
......
......@@ -4,7 +4,7 @@
augment: True
batch_size: 32
num_workers: 2
num_speakers: 1211 # 1211 vox1, 5994 vox2, 7205 vox1+2, test speakers: 41
num_speakers: 7205 # 1211 vox1, 5994 vox2, 7205 vox1+2, test speakers: 41
shuffle: True
skip_prep: False
split_ratio: 0.9
......@@ -42,8 +42,16 @@ epochs: 10
save_interval: 10
log_interval: 10
learning_rate: 1e-8
max_lr: 1e-3
step_size: 140000
###########################################
# loss #
###########################################
margin: 0.2
scale: 30
###########################################
# Testing #
###########################################
......
......@@ -2,7 +2,7 @@
# Data #
###########################################
augment: True
batch_size: 16
batch_size: 32
num_workers: 2
num_speakers: 1211 # 1211 vox1, 5994 vox2, 7205 vox1+2, test speakers: 41
shuffle: True
......@@ -42,7 +42,14 @@ epochs: 100
save_interval: 10
log_interval: 10
learning_rate: 1e-8
max_lr: 1e-3
step_size: 140000
###########################################
# loss #
###########################################
margin: 0.2
scale: 30
###########################################
# Testing #
......
......@@ -341,7 +341,7 @@ def stft(x: np.ndarray,
hop_length (Optional[int], optional): Number of steps to advance between adjacent windows. Defaults to None.
win_length (Optional[int], optional): The size of window. Defaults to None.
window (str, optional): A string of window specification. Defaults to "hann".
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
dtype (type, optional): Data type of STFT results. Defaults to np.complex64.
pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to "reflect".
......@@ -509,7 +509,7 @@ def melspectrogram(x: np.ndarray,
fmin (float, optional): Minimum frequency in Hz. Defaults to 50.0.
fmax (Optional[float], optional): Maximum frequency in Hz. Defaults to None.
window (str, optional): A string of window specification. Defaults to "hann".
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to "reflect".
power (float, optional): Exponent for the magnitude melspectrogram. Defaults to 2.0.
to_db (bool, optional): Enable db scale. Defaults to True.
......@@ -564,7 +564,7 @@ def spectrogram(x: np.ndarray,
window_size (int, optional): Size of FFT and window length. Defaults to 512.
hop_length (int, optional): Number of steps to advance between adjacent windows. Defaults to 320.
window (str, optional): A string of window specification. Defaults to "hann".
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to "reflect".
power (float, optional): Exponent for the magnitude melspectrogram. Defaults to 2.0.
......
......@@ -42,7 +42,7 @@ class Spectrogram(nn.Layer):
win_length (Optional[int], optional): The window length of the short time FFT. If `None`, it is set to same as `n_fft`. Defaults to None.
window (str, optional): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'. Defaults to 'hann'.
power (float, optional): Exponent for the magnitude spectrogram. Defaults to 2.0.
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to 'reflect'.
dtype (str, optional): Data type of input and window. Defaults to 'float32'.
"""
......@@ -99,7 +99,7 @@ class MelSpectrogram(nn.Layer):
win_length (Optional[int], optional): The window length of the short time FFT. If `None`, it is set to same as `n_fft`. Defaults to None.
window (str, optional): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'. Defaults to 'hann'.
power (float, optional): Exponent for the magnitude spectrogram. Defaults to 2.0.
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to 'reflect'.
n_mels (int, optional): Number of mel bins. Defaults to 64.
f_min (float, optional): Minimum frequency in Hz. Defaults to 50.0.
......@@ -176,7 +176,7 @@ class LogMelSpectrogram(nn.Layer):
win_length (Optional[int], optional): The window length of the short time FFT. If `None`, it is set to same as `n_fft`. Defaults to None.
window (str, optional): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'. Defaults to 'hann'.
power (float, optional): Exponent for the magnitude spectrogram. Defaults to 2.0.
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to 'reflect'.
n_mels (int, optional): Number of mel bins. Defaults to 64.
f_min (float, optional): Minimum frequency in Hz. Defaults to 50.0.
......@@ -257,7 +257,7 @@ class MFCC(nn.Layer):
win_length (Optional[int], optional): The window length of the short time FFT. If `None`, it is set to same as `n_fft`. Defaults to None.
window (str, optional): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'. Defaults to 'hann'.
power (float, optional): Exponent for the magnitude spectrogram. Defaults to 2.0.
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to 'reflect'.
n_mels (int, optional): Number of mel bins. Defaults to 64.
f_min (float, optional): Minimum frequency in Hz. Defaults to 50.0.
......
......@@ -43,13 +43,13 @@ pretrained_models = {
# speedyspeech
"speedyspeech_csmsc-zh": {
'url':
'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip',
'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_ckpt_0.2.0.zip',
'md5':
'9edce23b1a87f31b814d9477bf52afbc',
'6f6fa967b408454b6662c8c00c0027cb',
'config':
'default.yaml',
'ckpt':
'snapshot_iter_11400.pdz',
'snapshot_iter_30600.pdz',
'speech_stats':
'feats_stats.npy',
'phones_dict':
......
......@@ -26,10 +26,10 @@ from paddlespeech.s2t.utils.log import Log
#TODO(Hui Zhang): remove fluid import
logger = Log(__name__).getlog()
########### hcak logging #############
########### hack logging #############
logger.warn = logger.warning
########### hcak paddle #############
########### hack paddle #############
paddle.half = 'float16'
paddle.float = 'float32'
paddle.double = 'float64'
......@@ -110,7 +110,7 @@ if not hasattr(paddle, 'cat'):
paddle.cat = cat
########### hcak paddle.Tensor #############
########### hack paddle.Tensor #############
def item(x: paddle.Tensor):
return x.numpy().item()
......
......@@ -37,7 +37,7 @@ pretrained_models = {
'url':
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz',
'md5':
'23e16c69730a1cb5d735c98c83c21e16',
'd5e076217cf60486519f72c217d21b9b',
'cfg_path':
'model.yaml',
'ckpt_path':
......
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Copyright 2021 Mobvoi Inc. All Rights Reserved.
# Author: zhendong.peng@mobvoi.com (Zhendong Peng)
import argparse
from flask import Flask, render_template
parser = argparse.ArgumentParser(description='training your network')
parser.add_argument('--port', default=19999, type=int, help='port id')
args = parser.parse_args()
app = Flask(__name__)
@app.route('/')
def index():
return render_template('index.html')
if __name__ == '__main__':
app.run(host='0.0.0.0', port=args.port, debug=True)
/*
* @Author: baipengxia
* @Date: 2021-03-12 11:44:28
* @Last Modified by: baipengxia
* @Last Modified time: 2021-03-12 15:14:24
*/
/** COMMON RESET **/
* {
-webkit-tap-highlight-color: rgba(0, 0, 0, 0);
}
body,
h1,
h2,
h3,
h4,
h5,
h6,
hr,
p,
dl,
dt,
dd,
ul,
ol,
li,
fieldset,
lengend,
button,
input,
textarea,
th,
td {
margin: 0;
padding: 0;
color: #000;
}
body {
font-size: 14px;
}
html, body {
min-width: 1200px;
}
button,
input,
select,
textarea {
font-size: 14px;
}
h1 {
font-size: 18px;
}
h2 {
font-size: 14px;
}
h3 {
font-size: 14px;
}
ul,
ol,
li {
list-style: none;
}
a {
text-decoration: none;
}
a:hover {
text-decoration: none;
}
fieldset,
img {
border: none;
}
table {
border-collapse: collapse;
border-spacing: 0;
}
i {
font-style: normal;
}
label {
position: inherit;
}
.clearfix:after {
content: ".";
display: block;
height: 0;
clear: both;
visibility: hidden;
}
.clearfix {
zoom: 1;
display: block;
}
html,
body {
font-family: Tahoma, Arial, 'microsoft yahei', 'Roboto', 'Droid Sans', 'Helvetica Neue', 'Droid Sans Fallback', 'Heiti SC', 'Hiragino Sans GB', 'Simsun', 'sans-self';
}
.audio-banner {
width: 100%;
overflow: auto;
padding: 0;
background: url('../image/voice-dictation.svg');
background-size: cover;
}
.weaper {
width: 1200px;
height: 155px;
margin: 72px auto;
}
.text-content {
width: 670px;
height: 100%;
float: left;
}
.text-content .title {
font-size: 34px;
font-family: 'PingFangSC-Medium';
font-weight: 500;
color: rgba(255, 255, 255, 1);
line-height: 48px;
}
.text-content .con {
font-size: 16px;
font-family: PingFangSC-Light;
font-weight: 300;
color: rgba(255, 255, 255, 1);
line-height: 30px;
}
.img-con {
width: 416px;
height: 100%;
float: right;
}
.img-con img {
width: 100%;
height: 100%;
}
.con-container {
margin-top: 34px;
}
.audio-advantage {
background: #f8f9fa;
}
.asr-advantage {
width: 1200px;
margin: 0 auto;
}
.asr-advantage h2 {
text-align: center;
font-size: 22px;
padding: 30px 0 0 0;
}
.asr-advantage > ul > li {
box-sizing: border-box;
padding: 0 16px;
width: 33%;
text-align: center;
margin-bottom: 35px;
}
.asr-advantage > ul > li .icons{
margin-top: 10px;
margin-bottom: 20px;
width: 42px;
height: 42px;
}
.service-item-content {
margin-top: 35px;
display: flex;
justify-content: center;
flex-wrap: wrap;
}
.service-item-content img {
width: 160px;
vertical-align: bottom;
}
.service-item-content > li {
box-sizing: border-box;
padding: 0 16px;
width: 33%;
text-align: center;
margin-bottom: 35px;
}
.service-item-content > li .service-item-content-title {
line-height: 1.5;
font-weight: 700;
margin-top: 10px;
}
.service-item-content > li .service-item-content-desc {
margin-top: 5px;
line-height: 1.8;
color: #657384;
}
.audio-scene-con {
width: 100%;
padding-bottom: 84px;
background: #fff;
}
.audio-scene {
overflow: auto;
width: 1200px;
background: #fff;
text-align: center;
padding: 0;
margin: 0 auto;
}
.audio-scene h2 {
padding: 30px 0 0 0;
font-size: 22px;
text-align: center;
}
.audio-experience {
width: 100%;
height: 538px;
background: #fff;
padding: 0;
margin: 0;
overflow: auto;
}
.asr-box {
width: 1200px;
height: 394px;
margin: 64px auto;
}
.asr-box h2 {
font-size: 22px;
text-align: center;
margin-bottom: 64px;
}
.voice-container {
position: relative;
width: 1200px;
height: 308px;
background: rgba(255, 255, 255, 1);
border-radius: 8px;
border: 1px solid rgba(225, 225, 225, 1);
}
.voice-container .voice {
height: 236px;
width: 100%;
border-radius: 8px;
}
.voice-container .voice textarea {
height: 100%;
width: 100%;
border: none;
outline: none;
border-radius: 8px;
padding: 25px;
font-size: 14px;
box-sizing: border-box;
resize: none;
}
.voice-input {
width: 100%;
height: 72px;
box-sizing: border-box;
padding-left: 35px;
background: rgba(242, 244, 245, 1);
border-radius: 8px;
line-height: 72px;
}
.voice-input .el-select {
width: 492px;
}
.start-voice {
display: inline-block;
margin-left: 10px;
}
.start-voice .time {
margin-right: 25px;
}
.asr-advantage > ul > li {
margin-bottom: 77px;
}
#msg {
width: 100%;
line-height: 40px;
font-size: 14px;
margin-left: 330px;
}
#captcha {
margin-left: 350px !important;
display: inline-block;
position: relative;
}
.black {
position: fixed;
width: 100%;
height: 100%;
z-index: 5;
background: rgba(0, 0, 0, 0.5);
top: 0;
left: 0;
}
.container {
position: fixed;
z-index: 6;
top: 25%;
left: 10%;
}
.audio-scene-con {
width: 100%;
padding-bottom: 84px;
background: #fff;
}
#sound {
color: #fff;
cursor: pointer;
background: #147ede;
padding: 10px;
margin-top: 30px;
margin-left: 135px;
width: 176px;
height: 30px !important;
text-align: center;
line-height: 30px !important;
border-radius: 10px;
}
.con-ten {
position: absolute;
width: 100%;
height: 100%;
z-index: 5;
background: #fff;
opacity: 0.5;
top: 0;
left: 0;
}
.websocket-url {
width: 320px;
height: 20px;
border: 1px solid #dcdfe6;
line-height: 20px;
padding: 10px;
border-radius: 4px;
}
.voice-btn {
color: #fff;
background-color: #409eff;
font-weight: 500;
padding: 12px 20px;
font-size: 14px;
border-radius: 4px;
border: 0;
cursor: pointer;
}
.voice-btn.end {
display: none;
}
.result-text {
background: #fff;
padding: 20px;
}
.voice-footer {
border-top: 1px solid #dddede;
background: #f7f9fa;
text-align: center;
margin-bottom: 8px;
color: #333;
font-size: 12px;
padding: 20px 0;
}
/** line animate **/
.time-box {
display: none;
margin-left: 10px;
width: 300px;
}
.total-time {
font-size: 14px;
color: #545454;
}
.voice-btn.end.show,
.time-box.show {
display: inline;
}
.start-taste-line {
margin-right: 20px;
display: inline-block;
}
.start-taste-line hr {
background-color: #187cff;
width: 3px;
height: 8px;
margin: 0 3px;
display: inline-block;
border: none;
}
.hr {
animation: note 0.2s ease-in-out;
animation-iteration-count: infinite;
animation-direction: alternate;
}
.hr-one {
animation-delay: -0.9s;
}
.hr-two {
animation-delay: -0.8s;
}
.hr-three {
animation-delay: -0.7s;
}
.hr-four {
animation-delay: -0.6s;
}
.hr-five {
animation-delay: -0.5s;
}
.hr-six {
animation-delay: -0.4s;
}
.hr-seven {
animation-delay: -0.3s;
}
.hr-eight {
animation-delay: -0.2s;
}
.hr-nine {
animation-delay: -0.1s;
}
@keyframes note {
from {
transform: scaleY(1);
}
to {
transform: scaleY(4);
}
}
\ No newline at end of file
因为 它太大了无法显示 source diff 。你可以改为 查看blob
因为 它太大了无法显示 source diff 。你可以改为 查看blob
SoundRecognizer = {
rec: null,
wave: null,
SampleRate: 16000,
testBitRate: 16,
isCloseRecorder: false,
SendInterval: 300,
realTimeSendTryType: 'pcm',
realTimeSendTryEncBusy: 0,
realTimeSendTryTime: 0,
realTimeSendTryNumber: 0,
transferUploadNumberMax: 0,
realTimeSendTryChunk: null,
soundType: "pcm",
init: function (config) {
this.soundType = config.soundType || 'pcm';
this.SampleRate = config.sampleRate || 16000;
this.recwaveElm = config.recwaveElm || '';
this.TransferUpload = config.translerCallBack || this.TransferProcess;
this.initRecorder();
},
RealTimeSendTryReset: function (type) {
this.realTimeSendTryType = type;
this.realTimeSendTryTime = 0;
},
RealTimeSendTry: function (rec, isClose) {
var that = this;
var t1 = Date.now(), endT = 0, recImpl = Recorder.prototype;
if (this.realTimeSendTryTime == 0) {
this.realTimeSendTryTime = t1;
this.realTimeSendTryEncBusy = 0;
this.realTimeSendTryNumber = 0;
this.transferUploadNumberMax = 0;
this.realTimeSendTryChunk = null;
}
if (!isClose && t1 - this.realTimeSendTryTime < this.SendInterval) {
return;//控制缓冲达到指定间隔才进行传输
}
this.realTimeSendTryTime = t1;
var number = ++this.realTimeSendTryNumber;
//借用SampleData函数进行数据的连续处理,采样率转换是顺带的
var chunk = Recorder.SampleData(rec.buffers, rec.srcSampleRate, this.SampleRate, this.realTimeSendTryChunk, { frameType: isClose ? "" : this.realTimeSendTryType });
//清理已处理完的缓冲数据,释放内存以支持长时间录音,最后完成录音时不能调用stop,因为数据已经被清掉了
for (var i = this.realTimeSendTryChunk ? this.realTimeSendTryChunk.index : 0; i < chunk.index; i++) {
rec.buffers[i] = null;
}
this.realTimeSendTryChunk = chunk;
//没有新数据,或结束时的数据量太小,不能进行mock转码
if (chunk.data.length == 0 || isClose && chunk.data.length < 2000) {
this.TransferUpload(number, null, 0, null, isClose);
return;
}
//实时编码队列阻塞处理
if (!isClose) {
if (this.realTimeSendTryEncBusy >= 2) {
console.log("编码队列阻塞,已丢弃一帧", 1);
return;
}
}
this.realTimeSendTryEncBusy++;
//通过mock方法实时转码成mp3、wav
var encStartTime = Date.now();
var recMock = Recorder({
type: this.realTimeSendTryType
, sampleRate: this.SampleRate //采样率
, bitRate: this.testBitRate //比特率
});
recMock.mock(chunk.data, chunk.sampleRate);
recMock.stop(function (blob, duration) {
that.realTimeSendTryEncBusy && (that.realTimeSendTryEncBusy--);
blob.encTime = Date.now() - encStartTime;
//转码好就推入传输
that.TransferUpload(number, blob, duration, recMock, isClose);
}, function (msg) {
that.realTimeSendTryEncBusy && (that.realTimeSendTryEncBusy--);
//转码错误?没想到什么时候会产生错误!
console.log("不应该出现的错误:" + msg, 1);
});
},
recordClose: function () {
try {
this.rec.close(function () {
this.isCloseRecorder = true;
});
this.RealTimeSendTry(this.rec, true);//最后一次发送
} catch (ex) {
// recordClose();
}
},
recordEnd: function () {
try {
this.rec.stop(function (blob, time) {
this.recordClose();
}, function (s) {
this.recordClose();
});
} catch (ex) {
}
},
initRecorder: function () {
var that = this;
var rec = Recorder({
type: that.soundType
, bitRate: that.testBitRate
, sampleRate: that.SampleRate
, onProcess: function (buffers, level, time, sampleRate) {
that.wave.input(buffers[buffers.length - 1], level, sampleRate);
that.RealTimeSendTry(rec, false);//推入实时处理,因为是unknown格式,这里简化函数调用,没有用到buffers和bufferSampleRate,因为这些数据和rec.buffers是完全相同的。
}
});
rec.open(function () {
that.wave = Recorder.FrequencyHistogramView({
elem: that.recwaveElm, lineCount: 90
, position: 0
, minHeight: 1
, stripeEnable: false
});
rec.start();
that.isCloseRecorder = false;
that.RealTimeSendTryReset(that.soundType);//重置
});
this.rec = rec;
},
TransferProcess: function (number, blobOrNull, duration, blobRec, isClose) {
}
}
\ No newline at end of file
/*
录音
https://github.com/xiangyuecn/Recorder
src: engine/pcm.js
*/
!function(){"use strict";Recorder.prototype.enc_pcm={stable:!0,testmsg:"pcm为未封装的原始音频数据,pcm数据文件无法直接播放;支持位数8位、16位(填在比特率里面),采样率取值无限制"},Recorder.prototype.pcm=function(e,t,r){var a=this.set,n=e.length,o=8==a.bitRate?8:16,c=new ArrayBuffer(n*(o/8)),s=new DataView(c),l=0;if(8==o)for(var p=0;p<n;p++,l++){var i=128+(e[p]>>8);s.setInt8(l,i,!0)}else for(p=0;p<n;p++,l+=2)s.setInt16(l,e[p],!0);t(new Blob([s.buffer],{type:"audio/pcm"}))},Recorder.pcm2wav=function(e,a,n){e.slice&&null!=e.type&&(e={blob:e});var o=e.sampleRate||16e3,c=e.bitRate||16;if(e.sampleRate&&e.bitRate||console.warn("pcm2wav必须提供sampleRate和bitRate"),Recorder.prototype.wav){var s=new FileReader;s.onloadend=function(){var e;if(8==c){var t=new Uint8Array(s.result);e=new Int16Array(t.length);for(var r=0;r<t.length;r++)e[r]=t[r]-128<<8}else e=new Int16Array(s.result);Recorder({type:"wav",sampleRate:o,bitRate:c}).mock(e,o).stop(function(e,t){a(e,t)},n)},s.readAsArrayBuffer(e.blob)}else n("pcm2wav必须先加载wav编码器wav.js")}}();
\ No newline at end of file
/*
录音
https://github.com/xiangyuecn/Recorder
src: engine/wav.js
*/
!function(){"use strict";Recorder.prototype.enc_wav={stable:!0,testmsg:"支持位数8位、16位(填在比特率里面),采样率取值无限制"},Recorder.prototype.wav=function(t,e,n){var r=this.set,a=t.length,o=r.sampleRate,f=8==r.bitRate?8:16,i=a*(f/8),s=new ArrayBuffer(44+i),c=new DataView(s),u=0,v=function(t){for(var e=0;e<t.length;e++,u++)c.setUint8(u,t.charCodeAt(e))},w=function(t){c.setUint16(u,t,!0),u+=2},l=function(t){c.setUint32(u,t,!0),u+=4};if(v("RIFF"),l(36+i),v("WAVE"),v("fmt "),l(16),w(1),w(1),l(o),l(o*(f/8)),w(f/8),w(f),v("data"),l(i),8==f)for(var p=0;p<a;p++,u++){var d=128+(t[p]>>8);c.setInt8(u,d,!0)}else for(p=0;p<a;p++,u+=2)c.setInt16(u,t[p],!0);e(new Blob([c.buffer],{type:"audio/wav"}))}}();
\ No newline at end of file
/*
录音
https://github.com/xiangyuecn/Recorder
src: extensions/frequency.histogram.view.js
*/
!function(){"use strict";var t=function(t){return new e(t)},e=function(t){var e=this,r={scale:2,fps:20,lineCount:30,widthRatio:.6,spaceWidth:0,minHeight:0,position:-1,mirrorEnable:!1,stripeEnable:!0,stripeHeight:3,stripeMargin:6,fallDuration:1e3,stripeFallDuration:3500,linear:[0,"rgba(0,187,17,1)",.5,"rgba(255,215,0,1)",1,"rgba(255,102,0,1)"],stripeLinear:null,shadowBlur:0,shadowColor:"#bbb",stripeShadowBlur:-1,stripeShadowColor:"",onDraw:function(t,e){}};for(var a in t)r[a]=t[a];e.set=t=r;var i=t.elem;i&&("string"==typeof i?i=document.querySelector(i):i.length&&(i=i[0])),i&&(t.width=i.offsetWidth,t.height=i.offsetHeight);var o=t.scale,l=t.width*o,n=t.height*o,h=e.elem=document.createElement("div"),s=["","transform-origin:0 0;","transform:scale("+1/o+");"];h.innerHTML='<div style="width:'+t.width+"px;height:"+t.height+'px;overflow:hidden"><div style="width:'+l+"px;height:"+n+"px;"+s.join("-webkit-")+s.join("-ms-")+s.join("-moz-")+s.join("")+'"><canvas/></div></div>';var f=e.canvas=h.querySelector("canvas");e.ctx=f.getContext("2d");if(f.width=l,f.height=n,i&&(i.innerHTML="",i.appendChild(h)),!Recorder.LibFFT)throw new Error("需要lib.fft.js支持");e.fft=Recorder.LibFFT(1024),e.lastH=[],e.stripesH=[]};e.prototype=t.prototype={genLinear:function(t,e,r,a){for(var i=t.createLinearGradient(0,r,0,a),o=0;o<e.length;)i.addColorStop(e[o++],e[o++]);return i},input:function(t,e,r){var a=this;a.sampleRate=r,a.pcmData=t,a.pcmPos=0,a.inputTime=Date.now(),a.schedule()},schedule:function(){var t=this,e=t.set,r=Math.floor(1e3/e.fps);t.timer||(t.timer=setInterval(function(){t.schedule()},r));var a=Date.now(),i=t.drawTime||0;if(a-t.inputTime>1.3*e.stripeFallDuration)return clearInterval(t.timer),void(t.timer=0);if(!(a-i<r)){t.drawTime=a;for(var o=t.fft.bufferSize,l=t.pcmData,n=t.pcmPos,h=new Int16Array(o),s=0;s<o&&n<l.length;s++,n++)h[s]=l[n];t.pcmPos=n;var f=t.fft.transform(h);t.draw(f,t.sampleRate)}},draw:function(t,e){var r=this,a=r.set,i=r.ctx,o=a.scale,l=a.width*o,n=a.height*o,h=a.lineCount,s=r.fft.bufferSize,f=a.position,d=Math.abs(a.position),c=1==f?0:n,p=n;d<1&&(c=p/=2,p=Math.floor(p*(1+d)),c=Math.floor(0<f?c*(1-d):c*(1+d)));for(var u=r.lastH,v=r.stripesH,w=Math.ceil(p/(a.fallDuration/(1e3/a.fps))),g=Math.ceil(p/(a.stripeFallDuration/(1e3/a.fps))),m=a.stripeMargin*o,M=1<<(Math.round(Math.log(s)/Math.log(2)+3)<<1),b=Math.log(M)/Math.log(10),L=20*Math.log(32767)/Math.log(10),y=s/2,S=Math.min(y,Math.floor(5e3*y/(e/2))),C=S==y,H=C?h:Math.round(.8*h),R=S/H,D=C?0:(y-S)/(h-H),x=0,F=0;F<h;F++){var T=Math.ceil(x);x+=F<H?R:D;for(var B=Math.min(Math.ceil(x),y),E=0,j=T;j<B;j++)E=Math.max(E,Math.abs(t[j]));var I=M<E?Math.floor(17*(Math.log(E)/Math.log(10)-b)):0,q=p*Math.min(I/L,1);u[F]=(u[F]||0)-w,q<u[F]&&(q=u[F]),q<0&&(q=0),u[F]=q;var z=v[F]||0;if(q&&z<q+m)v[F]=q+m;else{var P=z-g;P<0&&(P=0),v[F]=P}}i.clearRect(0,0,l,n);var W=r.genLinear(i,a.linear,c,c-p),k=a.stripeLinear&&r.genLinear(i,a.stripeLinear,c,c-p)||W,A=r.genLinear(i,a.linear,c,c+p),G=a.stripeLinear&&r.genLinear(i,a.stripeLinear,c,c+p)||A;i.shadowBlur=a.shadowBlur*o,i.shadowColor=a.shadowColor;var V=a.mirrorEnable,J=V?2*h-1:h,K=a.widthRatio,N=a.spaceWidth*o;0!=N&&(K=(l-N*(J+1))/l);for(var O=Math.max(1*o,Math.floor(l*K/J)),Q=(l-J*O)/(J+1),U=a.minHeight*o,X=V?l/2-(Q+O/2):0,Y=(F=0,X);F<h;F++)Y+=Q,$=Math.floor(Y),q=Math.max(u[F],U),0!=c&&(_=c-q,i.fillStyle=W,i.fillRect($,_,O,q)),c!=n&&(i.fillStyle=A,i.fillRect($,c,O,q)),Y+=O;if(a.stripeEnable){var Z=a.stripeShadowBlur;i.shadowBlur=(-1==Z?a.shadowBlur:Z)*o,i.shadowColor=a.stripeShadowColor||a.shadowColor;var $,_,tt=a.stripeHeight*o;for(F=0,Y=X;F<h;F++)Y+=Q,$=Math.floor(Y),q=v[F],0!=c&&((_=c-q-tt)<0&&(_=0),i.fillStyle=k,i.fillRect($,_,O,tt)),c!=n&&(n<(_=c+q)+tt&&(_=n-tt),i.fillStyle=G,i.fillRect($,_,O,tt)),Y+=O}if(V){var et=Math.floor(l/2);i.save(),i.scale(-1,1),i.drawImage(r.canvas,Math.ceil(l/2),0,et,n,-et,0,et,n),i.restore()}a.onDraw(t,e)}},Recorder.FrequencyHistogramView=t}();
\ No newline at end of file
/*
录音
https://github.com/xiangyuecn/Recorder
src: extensions/lib.fft.js
*/
Recorder.LibFFT=function(r){"use strict";var s,v,d,l,F,b,g,m;return function(r){var o,t,a,f;for(s=Math.round(Math.log(r)/Math.log(2)),d=((v=1<<s)<<2)*Math.sqrt(2),l=[],F=[],b=[0],g=[0],m=[],o=0;o<v;o++){for(a=o,f=t=0;t!=s;t++)f<<=1,f|=1&a,a>>>=1;m[o]=f}var n,u=2*Math.PI/v;for(o=(v>>1)-1;0<o;o--)n=o*u,g[o]=Math.cos(n),b[o]=Math.sin(n)}(r),{transform:function(r){var o,t,a,f,n,u,e,h,M=1,i=s-1;for(o=0;o!=v;o++)l[o]=r[m[o]],F[o]=0;for(o=s;0!=o;o--){for(t=0;t!=M;t++)for(n=g[t<<i],u=b[t<<i],a=t;a<v;a+=M<<1)e=n*l[f=a+M]-u*F[f],h=n*F[f]+u*l[f],l[f]=l[a]-e,F[f]=F[a]-h,l[a]+=e,F[a]+=h;M<<=1,i--}t=v>>1;var c=new Float64Array(t);for(n=-(u=d),o=t;0!=o;o--)e=l[o],h=F[o],c[o-1]=n<e&&e<u&&n<h&&h<u?0:Math.round(e*e+h*h);return c},bufferSize:v}};
\ No newline at end of file
/*
录音
https://github.com/xiangyuecn/Recorder
src: recorder-core.js
*/
!function(y){"use strict";var h=function(){},A=function(e){return new t(e)};A.IsOpen=function(){var e=A.Stream;if(e){var t=e.getTracks&&e.getTracks()||e.audioTracks||[],n=t[0];if(n){var r=n.readyState;return"live"==r||r==n.LIVE}}return!1},A.BufferSize=4096,A.Destroy=function(){for(var e in M("Recorder Destroy"),g(),n)n[e]()};var n={};A.BindDestroy=function(e,t){n[e]=t},A.Support=function(){var e=y.AudioContext;if(e||(e=y.webkitAudioContext),!e)return!1;var t=navigator.mediaDevices||{};return t.getUserMedia||(t=navigator).getUserMedia||(t.getUserMedia=t.webkitGetUserMedia||t.mozGetUserMedia||t.msGetUserMedia),!!t.getUserMedia&&(A.Scope=t,A.Ctx&&"closed"!=A.Ctx.state||(A.Ctx=new e,A.BindDestroy("Ctx",function(){var e=A.Ctx;e&&e.close&&(e.close(),A.Ctx=0)})),!0)};var k="ConnectEnableWorklet";A[k]=!1;var d=function(e){var t=(e=e||A).BufferSize||A.BufferSize,r=A.Ctx,n=e.Stream,a=n._m=r.createMediaStreamSource(n),u=n._call,o=function(e,t){if(!t||h)for(var n in u){for(var r=t||e.inputBuffer.getChannelData(0),a=r.length,o=new Int16Array(a),s=0,i=0;i<a;i++){var c=Math.max(-1,Math.min(1,r[i]));c=c<0?32768*c:32767*c,o[i]=c,s+=Math.abs(c)}for(var f in u)u[f](o,s);return}else M(l+"多余回调",3)},s="ScriptProcessor",l="audioWorklet",i="Recorder",c=i+" "+l,f="RecProc",p=r.createScriptProcessor||r.createJavaScriptNode,v="。由于"+l+"内部1秒375次回调,在移动端可能会有性能问题导致回调丢失录音变短,PC端无影响,暂不建议开启"+l+"",m=function(){h=n.isWorklet=!1,I(n),M("Connect采用老的"+s+""+(A[k]?"但已":"")+"设置"+i+"."+k+"=true尝试启用"+l+v,3);var e=n._p=p.call(r,t,1,1);a.connect(e),e.connect(r.destination),e.onaudioprocess=function(e){o(e)}},h=n.isWorklet=!p||A[k],d=y.AudioWorkletNode;if(h&&r[l]&&d){var g,S=function(){return h&&n._na},_=n._na=function(){""!==g&&(clearTimeout(g),g=setTimeout(function(){g=0,M(l+"未返回任何音频,恢复使用"+s,3),S()&&p&&m()},500))},C=function(){if(S()){var e=n._n=new d(r,f,{processorOptions:{bufferSize:t}});a.connect(e),e.connect(r.destination),e.port.onmessage=function(e){g&&(clearTimeout(g),g=""),o(0,e.data.val)},M("Connect采用"+l+"方式,设置"+i+"."+k+"=false可恢复老式"+s+v,3)}};r.resume()[u&&"finally"](function(){if(S())if(r[f])C();else{var e,t,n=(t="class "+f+" extends AudioWorkletProcessor{",t+="constructor "+(e=function(e){return e.toString().replace(/^function|DEL_/g,"").replace(/\$RA/g,c)})(function(e){DEL_super(e);var t=this,n=e.processorOptions.bufferSize;t.bufferSize=n,t.buffer=new Float32Array(2*n),t.pos=0,t.port.onmessage=function(e){e.data.kill&&(t.kill=!0,console.log("$RA kill call"))},console.log("$RA .ctor call",e)}),t+="process "+e(function(e,t,n){var r=this,a=r.bufferSize,o=r.buffer,s=r.pos;if((e=(e[0]||[])[0]||[]).length){o.set(e,s);var i=~~((s+=e.length)/a)*a;if(i){this.port.postMessage({val:o.slice(0,i)});var c=o.subarray(i,s);(o=new Float32Array(2*a)).set(c),s=c.length,r.buffer=o}r.pos=s}return!r.kill}),t+='}try{registerProcessor("'+f+'", '+f+')}catch(e){console.error("'+c+'注册失败",e)}',"data:text/javascript;base64,"+btoa(unescape(encodeURIComponent(t))));r[l].addModule(n).then(function(e){S()&&(r[f]=1,C(),g&&_())})[u&&"catch"](function(e){M(l+".addModule失败",1,e),S()&&m()})}})}else m()},I=function(e){e._na=null,e._n&&(e._n.port.postMessage({kill:!0}),e._n.disconnect(),e._n=null)},g=function(e){var t=(e=e||A)==A,n=e.Stream;if(n&&(n._m&&(n._m.disconnect(),n._m=null),n._p&&(n._p.disconnect(),n._p.onaudioprocess=n._p=null),I(n),t)){for(var r=n.getTracks&&n.getTracks()||n.audioTracks||[],a=0;a<r.length;a++){var o=r[a];o.stop&&o.stop()}n.stop&&n.stop()}e.Stream=0};A.SampleData=function(e,t,n,r,a){r||(r={});var o=r.index||0,s=r.offset||0,i=r.frameNext||[];a||(a={});var c=a.frameSize||1;a.frameType&&(c="mp3"==a.frameType?1152:1);for(var f=0,u=o;u<e.length;u++)f+=e[u].length;f=Math.max(0,f-Math.floor(s));var l=t/n;1<l?f=Math.floor(f/l):(l=1,n=t),f+=i.length;for(var p=new Int16Array(f),v=0,u=0;u<i.length;u++)p[v]=i[u],v++;for(var m=e.length;o<m;o++){for(var h=e[o],u=s,d=h.length;u<d;){var g=Math.floor(u),S=Math.ceil(u),_=u-g,C=h[g],y=S<d?h[S]:(e[o+1]||[C])[0]||0;p[v]=C+(y-C)*_,v++,u+=l}s=u-d}i=null;var k=p.length%c;if(0<k){var I=2*(p.length-k);i=new Int16Array(p.buffer.slice(I)),p=new Int16Array(p.buffer.slice(0,I))}return{index:o,offset:s,frameNext:i,sampleRate:n,data:p}},A.PowerLevel=function(e,t){var n=e/t||0;return n<1251?Math.round(n/1250*10):Math.round(Math.min(100,Math.max(0,100*(1+Math.log(n/1e4)/Math.log(10)))))};var M=function(e,t){var n=new Date,r=("0"+n.getMinutes()).substr(-2)+":"+("0"+n.getSeconds()).substr(-2)+"."+("00"+n.getMilliseconds()).substr(-3),a=this&&this.envIn&&this.envCheck&&this.id,o=["["+r+" Recorder"+(a?":"+a:"")+"]"+e],s=arguments,i=y.console||{},c=2,f=i.log;for("number"==typeof t?f=1==t?i.error:3==t?i.warn:f:c=1;c<s.length;c++)o.push(s[c]);u?f&&f("[IsLoser]"+o[0],1<o.length?o:""):f.apply(i,o)},u=!0;try{u=!console.log.apply}catch(e){}A.CLog=M;var r=0;function t(e){this.id=++r,A.Traffic&&A.Traffic();var t={type:"mp3",bitRate:16,sampleRate:16e3,onProcess:h};for(var n in e)t[n]=e[n];this.set=t,this._S=9,this.Sync={O:9,C:9}}A.Sync={O:9,C:9},A.prototype=t.prototype={CLog:M,_streamStore:function(){return this.set.sourceStream?this:A},open:function(e,n){var r=this,t=r._streamStore();e=e||h;var a=function(e,t){t=!!t,r.CLog("录音open失败:"+e+",isUserNotAllow:"+t,1),n&&n(e,t)},o=function(){r.CLog("open ok id:"+r.id),e(),r._SO=0},s=t.Sync,i=++s.O,c=s.C;r._O=r._O_=i,r._SO=r._S;var f=function(){if(c!=s.C||!r._O){var e="open被取消";return i==s.O?r.close():e="open被中断",a(e),!0}},u=r.envCheck({envName:"H5",canProcess:!0});if(u)a("不能录音:"+u);else if(r.set.sourceStream){if(!A.Support())return void a("不支持此浏览器从流中获取录音");g(t),r.Stream=r.set.sourceStream,r.Stream._call={};try{d(t)}catch(e){return void a("从流中打开录音失败:"+e.message)}o()}else{var l=function(e,t){try{y.top.a}catch(e){return void a('无权录音(跨域,请尝试给iframe添加麦克风访问策略,如allow="camera;microphone")')}/Permission|Allow/i.test(e)?a("用户拒绝了录音权限",!0):!1===y.isSecureContext?a("无权录音(需https)"):/Found/i.test(e)?a(t+",无可用麦克风"):a(t)};if(A.IsOpen())o();else if(A.Support()){var p=function(e){(A.Stream=e)._call={},f()||setTimeout(function(){f()||(A.IsOpen()?(d(),o()):a("录音功能无效:无音频流"))},100)},v=function(e){var t=e.name||e.message||e.code+":"+e;r.CLog("请求录音权限错误",1,e),l(t,"无法录音:"+t)},m=A.Scope.getUserMedia({audio:r.set.audioTrackSet||!0},p,v);m&&m.then&&m.then(p)[e&&"catch"](v)}else l("","此浏览器不支持录音")}},close:function(e){e=e||h;var t=this,n=t._streamStore();t._stop();var r=n.Sync;if(t._O=0,t._O_!=r.O)return t.CLog("close被忽略(因为同时open了多个rec,只有最后一个会真正close)",3),void e();r.C++,g(n),t.CLog("close"),e()},mock:function(e,t){var n=this;return n._stop(),n.isMock=1,n.mockEnvInfo=null,n.buffers=[e],n.recSize=e.length,n.srcSampleRate=t,n},envCheck:function(e){var t,n=this.set;return t||(this[n.type+"_envCheck"]?t=this[n.type+"_envCheck"](e,n):n.takeoffEncodeChunk&&(t=n.type+"类型不支持设置takeoffEncodeChunk")),t||""},envStart:function(e,t){var n=this,r=n.set;if(n.isMock=e?1:0,n.mockEnvInfo=e,n.buffers=[],n.recSize=0,n.envInLast=0,n.envInFirst=0,n.envInFix=0,n.envInFixTs=[],r.sampleRate=Math.min(t,r.sampleRate),n.srcSampleRate=t,n.engineCtx=0,n[r.type+"_start"]){var a=n.engineCtx=n[r.type+"_start"](r);a&&(a.pcmDatas=[],a.pcmSize=0)}},envResume:function(){this.envInFixTs=[]},envIn:function(e,t){var a=this,o=a.set,s=a.engineCtx,n=a.srcSampleRate,r=e.length,i=A.PowerLevel(t,r),c=a.buffers,f=c.length;c.push(e);var u=c,l=f,p=Date.now(),v=Math.round(r/n*1e3);a.envInLast=p,1==a.buffers.length&&(a.envInFirst=p-v);var m=a.envInFixTs;m.splice(0,0,{t:p,d:v});for(var h=p,d=0,g=0;g<m.length;g++){var S=m[g];if(3e3<p-S.t){m.length=g;break}h=S.t,d+=S.d}var _=m[1],C=p-h;if(C/3<C-d&&(_&&1e3<C||6<=m.length)){var y=p-_.t-v;if(v/5<y){var k=!o.disableEnvInFix;if(a.CLog("["+p+"]"+(k?"":"")+"补偿"+y+"ms",3),a.envInFix+=y,k){var I=new Int16Array(y*n/1e3);r+=I.length,c.push(I)}}}var M=a.recSize,x=r,b=M+x;if(a.recSize=b,s){var R=A.SampleData(c,n,o.sampleRate,s.chunkInfo);s.chunkInfo=R,b=(M=s.pcmSize)+(x=R.data.length),s.pcmSize=b,c=s.pcmDatas,f=c.length,c.push(R.data),n=R.sampleRate}var L=Math.round(b/n*1e3),w=c.length,T=u.length,z=function(){for(var e=O?0:-x,t=null==c[0],n=f;n<w;n++){var r=c[n];null==r?t=1:(e+=r.length,s&&r.length&&a[o.type+"_encode"](s,r))}if(t&&s)for(n=l,u[0]&&(n=0);n<T;n++)u[n]=null;t&&(e=O?x:0,c[0]=null),s?s.pcmSize+=e:a.recSize+=e},O=o.onProcess(c,i,L,n,f,z);if(!0===O){var D=0;for(g=f;g<w;g++)null==c[g]?D=1:c[g]=new Int16Array(0);D?a.CLog("未进入异步前不能清除buffers",3):s?s.pcmSize-=x:a.recSize-=x}else z()},start:function(){var e=this,t=A.Ctx,n=1;if(e.set.sourceStream?e.Stream||(n=0):A.IsOpen()||(n=0),n)if(e.CLog("开始录音"),e._stop(),e.state=0,e.envStart(null,t.sampleRate),e._SO&&e._SO+1!=e._S)e.CLog("start被中断",3);else{e._SO=0;var r=function(){e.state=1,e.resume()};"suspended"==t.state?(e.CLog("wait ctx resume..."),e.state=3,t.resume().then(function(){e.CLog("ctx resume"),3==e.state&&r()})):r()}else e.CLog("未open",1)},pause:function(){var e=this;e.state&&(e.state=2,e.CLog("pause"),delete e._streamStore().Stream._call[e.id])},resume:function(){var e,n=this;if(n.state){n.state=1,n.CLog("resume"),n.envResume();var t=n._streamStore();t.Stream._call[n.id]=function(e,t){1==n.state&&n.envIn(e,t)},(e=(t||A).Stream)._na&&e._na()}},_stop:function(e){var t=this,n=t.set;t.isMock||t._S++,t.state&&(t.pause(),t.state=0),!e&&t[n.type+"_stop"]&&(t[n.type+"_stop"](t.engineCtx),t.engineCtx=0)},stop:function(n,t,e){var r,a=this,o=a.set;a.CLog("stop "+(a.envInLast?a.envInLast-a.envInFirst+"ms 补"+a.envInFix+"ms":"-"));var s=function(){a._stop(),e&&a.close()},i=function(e){a.CLog("结束录音失败:"+e,1),t&&t(e),s()},c=function(e,t){if(a.CLog("结束录音 编码"+(Date.now()-r)+"ms 音频"+t+"ms/"+e.size+"b"),o.takeoffEncodeChunk)a.CLog("启用takeoffEncodeChunk后stop返回的blob长度为0不提供音频数据",3);else if(e.size<Math.max(100,t/2))return void i("生成的"+o.type+"无效");n&&n(e,t),s()};if(!a.isMock){var f=3==a.state;if(!a.state||f)return void i("未开始录音"+(f?",开始录音前无用户交互导致AudioContext未运行":""));a._stop(!0)}var u=a.recSize;if(u)if(a.buffers[0])if(a[o.type]){if(a.isMock){var l=a.envCheck(a.mockEnvInfo||{envName:"mock",canProcess:!1});if(l)return void i("录音错误:"+l)}var p=a.engineCtx;if(a[o.type+"_complete"]&&p){var v=Math.round(p.pcmSize/o.sampleRate*1e3);return r=Date.now(),void a[o.type+"_complete"](p,function(e){c(e,v)},i)}r=Date.now();var m=A.SampleData(a.buffers,a.srcSampleRate,o.sampleRate);o.sampleRate=m.sampleRate;var h=m.data;v=Math.round(h.length/o.sampleRate*1e3),a.CLog("采样"+u+"->"+h.length+" 花:"+(Date.now()-r)+"ms"),setTimeout(function(){r=Date.now(),a[o.type](h,function(e){c(e,v)},function(e){i(e)})})}else i("未加载"+o.type+"编码器");else i("音频buffers被释放");else i("未采集到录音")}},y.Recorder&&y.Recorder.Destroy(),(y.Recorder=A).LM="2022-03-05 11:53:19",A.TrafficImgUrl="//ia.51.la/go1?id=20469973&pvFlag=1",A.Traffic=function(){var e=A.TrafficImgUrl;if(e){var t=A.Traffic,n=location.href.replace(/#.*/,"");if(0==e.indexOf("//")&&(e=/^https:/i.test(n)?"https:"+e:"http:"+e),!t[n]){t[n]=1;var r=new Image;r.src=e,M("Traffic Analysis Image: Recorder.TrafficImgUrl="+A.TrafficImgUrl)}}}}(window),"function"==typeof define&&define.amd&&define(function(){return Recorder}),"object"==typeof module&&module.exports&&(module.exports=Recorder);
\ No newline at end of file
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>PaddleSpeech Serving-语音实时转写</title>
<link rel="shortcut icon" href="./static/paddle.ico">
<script src="../static/js/jquery-3.2.1.min.js"></script>
<script src="../static/js/recorder/recorder-core.js"></script>
<script src="../static/js/recorder/extensions/lib.fft.js"></script>
<script src="../static/js/recorder/extensions/frequency.histogram.view.js"></script>
<script src="../static/js/recorder/engine/pcm.js"></script>
<script src="../static/js/SoundRecognizer.js"></script>
<link rel="stylesheet" href="../static/css/style.css">
<link rel="stylesheet" href="../static/css/font-awesome.min.css">
</head>
<body>
<div class="asr-content">
<div class="audio-banner">
<div class="weaper">
<div class="text-content">
<p><span class="title">PaddleSpeech Serving简介</span></p>
<p class="con-container">
<span class="con">PaddleSpeech 是基于飞桨 PaddlePaddle 的语音方向的开源模型库,用于语音和音频中的各种关键任务的开发。PaddleSpeech Serving是基于python + fastapi 的语音算法模型的C/S类型后端服务,旨在统一paddle speech下的各语音算子来对外提供后端服务。</span>
</p>
</div>
<div class="img-con">
<img src="../static/image/PaddleSpeech_logo.png" alt="" />
</div>
</div>
</div>
<div class="audio-experience">
<div class="asr-box">
<h2>产品体验</h2>
<div id="client-word-recorder" style="position: relative;">
<div class="pd">
<div style="text-align:center;height:20px;width:100%;
border:0px solid #bcbcbc;color:#000;box-sizing: border-box;display:inline-block"
class="recwave">
</div>
</div>
</div>
<div class="voice-container">
<div class="voice-input">
<span>WebSocket URL:</span>
<input type="text" id="socketUrl" class="websocket-url" value="ws://127.0.0.1:8091/ws/asr"
placeholder="请输入服务器地址,如:ws://127.0.0.1:8091/ws/asr">
<div class="start-voice">
<button type="primary" id="beginBtn" class="voice-btn">
<span class="fa fa-microphone"> 开始识别</span>
</button>
<button type="primary" id="endBtn" class="voice-btn end">
<span class="fa fa-microphone-slash"> 结束识别</span>
</button>
<div id="timeBox" class="time-box flex-display-1">
<span class="total-time">识别中,<i id="timeCount"></i> 秒后自动停止识别</span>
</div>
</div>
</div>
<div class="voice">
<div class="result-text" id="resultPanel">此处显示识别结果</div>
</div>
</div>
</div>
</div>
</div>
<script>
var wenetWs = null
var timeLoop = null
var result = ""
$(document).ready(function () {
$('#beginBtn').on('click', startRecording)
$('#endBtn').on('click', stopRecording)
})
function openWebSocket(url) {
if ("WebSocket" in window) {
wenetWs = new WebSocket(url)
wenetWs.onopen = function () {
console.log("Websocket 连接成功,开始识别")
wenetWs.send(JSON.stringify({
"signal": "start"
}))
}
wenetWs.onmessage = function (_msg) { parseResult(_msg.data) }
wenetWs.onclose = function () {
console.log("WebSocket 连接断开")
}
wenetWs.onerror = function () { console.log("WebSocket 连接失败") }
}
}
function parseResult(data) {
var data = JSON.parse(data)
var result = data.asr_results
console.log(result)
$("#resultPanel").html(result)
}
function TransferUpload(number, blobOrNull, duration, blobRec, isClose) {
if (blobOrNull) {
var blob = blobOrNull
var encTime = blob.encTime
var reader = new FileReader()
reader.onloadend = function () { wenetWs.send(reader.result) }
reader.readAsArrayBuffer(blob)
}
}
function startRecording() {
// Check socket url
var socketUrl = $('#socketUrl').val()
if (!socketUrl.trim()) {
alert('请输入 WebSocket 服务器地址,如:ws://127.0.0.1:8091/ws/asr')
$('#socketUrl').focus()
return
}
// init recorder
SoundRecognizer.init({
soundType: 'pcm',
sampleRate: 16000,
recwaveElm: '.recwave',
translerCallBack: TransferUpload
})
openWebSocket(socketUrl)
// Change button state
$('#beginBtn').hide()
$('#endBtn, #timeBox').addClass('show')
// Start countdown
var seconds = 180
$('#timeCount').text(seconds)
timeLoop = setInterval(function () {
seconds--
$('#timeCount').text(seconds)
if (seconds === 0) {
stopRecording()
}
}, 1000)
}
function stopRecording() {
wenetWs.send(JSON.stringify({ "signal": "end" }))
SoundRecognizer.recordClose()
$('#endBtn').add($('#timeBox')).removeClass('show')
$('#beginBtn').show()
$('#timeCount').text('')
clearInterval(timeLoop)
}
</script>
</body>
</html>
\ No newline at end of file
......@@ -38,8 +38,6 @@ def get_predictor(args, filed='am'):
config.enable_use_gpu(100, 0)
elif args.device == "cpu":
config.disable_gpu()
# This line must be commented for fastspeech2, if not, it will OOM
if model_name != 'fastspeech2':
config.enable_memory_optim()
predictor = inference.create_predictor(config)
return predictor
......
......@@ -70,8 +70,15 @@ def ort_predict(args):
# am warmup
for T in [27, 38, 54]:
data = np.random.randint(1, 266, size=(T, ))
am_sess.run(None, {"text": data})
am_input_feed = {}
if am_name == 'fastspeech2':
phone_ids = np.random.randint(1, 266, size=(T, ))
am_input_feed.update({'text': phone_ids})
elif am_name == 'speedyspeech':
phone_ids = np.random.randint(1, 92, size=(T, ))
tone_ids = np.random.randint(1, 5, size=(T, ))
am_input_feed.update({'phones': phone_ids, 'tones': tone_ids})
am_sess.run(None, input_feed=am_input_feed)
# voc warmup
for T in [227, 308, 544]:
......@@ -81,14 +88,20 @@ def ort_predict(args):
N = 0
T = 0
am_input_feed = {}
for example in test_dataset:
utt_id = example['utt_id']
if am_name == 'fastspeech2':
phone_ids = example["text"]
am_input_feed.update({'text': phone_ids})
elif am_name == 'speedyspeech':
phone_ids = example["phones"]
tone_ids = example["tones"]
am_input_feed.update({'phones': phone_ids, 'tones': tone_ids})
with timer() as t:
mel = am_sess.run(output_names=None, input_feed={'text': phone_ids})
mel = am_sess.run(output_names=None, input_feed=am_input_feed)
mel = mel[0]
wav = voc_sess.run(output_names=None, input_feed={'logmel': mel})
N += len(wav[0])
T += t.elapse
speed = len(wav[0]) / t.elapse
......@@ -110,9 +123,7 @@ def parse_args():
'--am',
type=str,
default='fastspeech2_csmsc',
choices=[
'fastspeech2_csmsc',
],
choices=['fastspeech2_csmsc', 'speedyspeech_csmsc'],
help='Choose acoustic model type of tts task.')
# voc
......
......@@ -68,39 +68,58 @@ def ort_predict(args):
# vocoder
voc_sess = get_sess(args, filed='voc')
# frontend warmup
# Loading model cost 0.5+ seconds
if args.lang == 'zh':
frontend.get_input_ids("你好,欢迎使用飞桨框架进行深度学习研究!", merge_sentences=True)
else:
print("lang should in be 'zh' here!")
# am warmup
for T in [27, 38, 54]:
data = np.random.randint(1, 266, size=(T, ))
am_sess.run(None, {"text": data})
am_input_feed = {}
if am_name == 'fastspeech2':
phone_ids = np.random.randint(1, 266, size=(T, ))
am_input_feed.update({'text': phone_ids})
elif am_name == 'speedyspeech':
phone_ids = np.random.randint(1, 92, size=(T, ))
tone_ids = np.random.randint(1, 5, size=(T, ))
am_input_feed.update({'phones': phone_ids, 'tones': tone_ids})
am_sess.run(None, input_feed=am_input_feed)
# voc warmup
for T in [227, 308, 544]:
data = np.random.rand(T, 80).astype("float32")
voc_sess.run(None, {"logmel": data})
voc_sess.run(None, input_feed={"logmel": data})
print("warm up done!")
# frontend warmup
# Loading model cost 0.5+ seconds
if args.lang == 'zh':
frontend.get_input_ids("你好,欢迎使用飞桨框架进行深度学习研究!", merge_sentences=True)
else:
print("lang should in be 'zh' here!")
N = 0
T = 0
merge_sentences = True
get_tone_ids = False
am_input_feed = {}
if am_name == 'speedyspeech':
get_tone_ids = True
for utt_id, sentence in sentences:
with timer() as t:
if args.lang == 'zh':
input_ids = frontend.get_input_ids(
sentence, merge_sentences=merge_sentences)
sentence,
merge_sentences=merge_sentences,
get_tone_ids=get_tone_ids)
phone_ids = input_ids["phone_ids"]
if get_tone_ids:
tone_ids = input_ids["tone_ids"]
else:
print("lang should in be 'zh' here!")
# merge_sentences=True here, so we only use the first item of phone_ids
phone_ids = phone_ids[0].numpy()
mel = am_sess.run(output_names=None, input_feed={'text': phone_ids})
if am_name == 'fastspeech2':
am_input_feed.update({'text': phone_ids})
elif am_name == 'speedyspeech':
tone_ids = tone_ids[0].numpy()
am_input_feed.update({'phones': phone_ids, 'tones': tone_ids})
mel = am_sess.run(output_names=None, input_feed=am_input_feed)
mel = mel[0]
wav = voc_sess.run(output_names=None, input_feed={'logmel': mel})
......@@ -125,9 +144,7 @@ def parse_args():
'--am',
type=str,
default='fastspeech2_csmsc',
choices=[
'fastspeech2_csmsc',
],
choices=['fastspeech2_csmsc', 'speedyspeech_csmsc'],
help='Choose acoustic model type of tts task.')
parser.add_argument(
"--phones_dict", type=str, default=None, help="phone vocabulary file.")
......
......@@ -68,13 +68,15 @@ def evaluate(args):
# but still not stopping in the end (NOTE by yuantian01 Feb 9 2022)
if am_name == 'tacotron2':
merge_sentences = True
get_tone_ids = False
if am_name == 'speedyspeech':
get_tone_ids = True
N = 0
T = 0
for utt_id, sentence in sentences:
with timer() as t:
get_tone_ids = False
if am_name == 'speedyspeech':
get_tone_ids = True
if args.lang == 'zh':
input_ids = frontend.get_input_ids(
sentence,
......
......@@ -667,8 +667,8 @@ class FastSpeech2(nn.Layer):
use_teacher_forcing(bool, optional): Whether to use teacher forcing.
If true, groundtruth of duration, pitch and energy will be used.
spk_emb(Tensor, optional, optional): peaker embedding vector (spk_embed_dim,). (Default value = None)
spk_id(Tensor, optional(int64), optional): Batch of padded spk ids (1,). (Default value = None)
tone_id(Tensor, optional(int64), optional): Batch of padded tone ids (T,). (Default value = None)
spk_id(Tensor, optional(int64), optional): spk ids (1,). (Default value = None)
tone_id(Tensor, optional(int64), optional): tone ids (T,). (Default value = None)
Returns:
......@@ -751,7 +751,6 @@ class FastSpeech2(nn.Layer):
Returns:
"""
if self.tone_embed_integration_type == "add":
# apply projection and then add to hidden states
......
......@@ -11,17 +11,35 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List
import paddle
from paddle import nn
from paddlespeech.t2s.modules.nets_utils import initialize
from paddlespeech.t2s.modules.positional_encoding import sinusoid_position_encoding
from paddlespeech.t2s.modules.predictor.length_regulator import LengthRegulator
from paddlespeech.t2s.modules.transformer.embedding import ScaledPositionalEncoding
class ResidualBlock(nn.Layer):
def __init__(self, channels, kernel_size, dilation, n=2):
def __init__(self,
channels: int=128,
kernel_size: int=3,
dilation: int=3,
n: int=2):
"""SpeedySpeech encoder module.
Args:
channels (int, optional): Feature size of the residual output(and also the input).
kernel_size (int, optional): Kernel size of the 1D convolution.
dilation (int, optional): Dilation of the 1D convolution.
n (int): Number of blocks.
"""
super().__init__()
total_pad = (dilation * (kernel_size - 1))
begin = total_pad // 2
end = total_pad - begin
# remove padding='same' here, cause onnx don't support dilation + 'same' padding
blocks = [
nn.Sequential(
nn.Conv1D(
......@@ -29,14 +47,20 @@ class ResidualBlock(nn.Layer):
channels,
kernel_size,
dilation=dilation,
padding="same",
data_format="NLC"),
# make sure output T == input T
padding=((0, 0), (0, 0), (begin, end))),
nn.ReLU(),
nn.BatchNorm1D(channels, data_format="NLC"), ) for _ in range(n)
nn.BatchNorm1D(channels), ) for _ in range(n)
]
self.blocks = nn.Sequential(*blocks)
def forward(self, x):
def forward(self, x: paddle.Tensor):
"""Calculate forward propagation.
Args:
x(Tensor): Batch of input sequences (B, hidden_size, Tmax).
Returns:
Tensor: The residual output (B, hidden_size, Tmax).
"""
return x + self.blocks(x)
......@@ -62,7 +86,15 @@ class TextEmbedding(nn.Layer):
tone_vocab_size, tone_embedding_size, tone_padding_idx)
self.concat = concat
def forward(self, text, tone=None):
def forward(self, text: paddle.Tensor, tone: paddle.Tensor=None):
"""Calculate forward propagation.
Args:
text(Tensor(int64)): Batch of padded token ids (B, Tmax).
tones(Tensor, optional(int64)): Batch of padded tone ids (B, Tmax).
Returns:
Tensor: The residual output (B, Tmax, embedding_size).
"""
text_embed = self.text_embedding(text)
if tone is None:
return text_embed
......@@ -75,13 +107,24 @@ class TextEmbedding(nn.Layer):
class SpeedySpeechEncoder(nn.Layer):
"""SpeedySpeech encoder module.
Args:
vocab_size (int): Dimension of the inputs.
tone_size (Optional[int]): Number of tones.
hidden_size (int): Number of encoder hidden units.
kernel_size (int): Kernel size of encoder.
dilations (List[int]): Dilations of encoder.
spk_num (Optional[int]): Number of speakers.
"""
def __init__(self,
vocab_size,
tone_size,
hidden_size,
kernel_size,
dilations,
vocab_size: int,
tone_size: int,
hidden_size: int=128,
kernel_size: int=3,
dilations: List[int]=[1, 3, 9, 27, 1, 3, 9, 27, 1, 1],
spk_num=None):
super().__init__()
self.embedding = TextEmbedding(
vocab_size,
......@@ -109,34 +152,71 @@ class SpeedySpeechEncoder(nn.Layer):
self.postnet1 = nn.Sequential(nn.Linear(hidden_size, hidden_size))
self.postnet2 = nn.Sequential(
nn.ReLU(),
nn.BatchNorm1D(hidden_size, data_format="NLC"),
nn.Linear(hidden_size, hidden_size), )
def forward(self, text, tones, spk_id=None):
nn.BatchNorm1D(hidden_size), )
self.linear = nn.Linear(hidden_size, hidden_size)
def forward(self,
text: paddle.Tensor,
tones: paddle.Tensor,
spk_id: paddle.Tensor=None):
"""Encoder input sequence.
Args:
text(Tensor(int64)): Batch of padded token ids (B, Tmax).
tones(Tensor, optional(int64)): Batch of padded tone ids (B, Tmax).
spk_id(Tnesor, optional(int64)): Batch of speaker ids (B,)
Returns:
Tensor: Output tensor (B, Tmax, hidden_size).
"""
embedding = self.embedding(text, tones)
if self.spk_emb:
embedding += self.spk_emb(spk_id).unsqueeze(1)
embedding = self.prenet(embedding)
x = self.res_blocks(embedding)
x = self.res_blocks(embedding.transpose([0, 2, 1])).transpose([0, 2, 1])
# (B, T, dim)
x = embedding + self.postnet1(x)
x = self.postnet2(x)
x = self.postnet2(x.transpose([0, 2, 1])).transpose([0, 2, 1])
x = self.linear(x)
return x
class DurationPredictor(nn.Layer):
def __init__(self, hidden_size):
def __init__(self, hidden_size: int=128):
super().__init__()
self.layers = nn.Sequential(
ResidualBlock(hidden_size, 4, 1, n=1),
ResidualBlock(hidden_size, 3, 1, n=1),
ResidualBlock(hidden_size, 1, 1, n=1), nn.Linear(hidden_size, 1))
ResidualBlock(hidden_size, 1, 1, n=1), )
self.linear = nn.Linear(hidden_size, 1)
def forward(self, x):
return paddle.squeeze(self.layers(x), -1)
def forward(self, x: paddle.Tensor):
"""Calculate forward propagation.
Args:
x(Tensor): Batch of input sequences (B, Tmax, hidden_size).
Returns:
Tensor: Batch of predicted durations in log domain (B, Tmax).
"""
x = self.layers(x.transpose([0, 2, 1])).transpose([0, 2, 1])
x = self.linear(x)
return paddle.squeeze(x, -1)
class SpeedySpeechDecoder(nn.Layer):
def __init__(self, hidden_size, output_size, kernel_size, dilations):
def __init__(self,
hidden_size: int=128,
output_size: int=80,
kernel_size: int=3,
dilations: List[int]=[
1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 1
]):
"""SpeedySpeech decoder module.
Args:
hidden_size (int): Number of decoder hidden units.
kernel_size (int): Kernel size of decoder.
output_size (int): Dimension of the outputs.
dilations (List[int]): Dilations of decoder.
"""
super().__init__()
res_blocks = [
ResidualBlock(hidden_size, kernel_size, d, n=2) for d in dilations
......@@ -144,14 +224,21 @@ class SpeedySpeechDecoder(nn.Layer):
self.res_blocks = nn.Sequential(*res_blocks)
self.postnet1 = nn.Sequential(nn.Linear(hidden_size, hidden_size))
self.postnet2 = nn.Sequential(
ResidualBlock(hidden_size, kernel_size, 1, n=2),
nn.Linear(hidden_size, output_size))
self.postnet2 = ResidualBlock(hidden_size, kernel_size, 1, n=2)
self.linear = nn.Linear(hidden_size, output_size)
def forward(self, x):
xx = self.res_blocks(x)
"""Decoder input sequence.
Args:
x(Tensor): Input tensor (B, time, hidden_size).
Returns:
Tensor: Output tensor (B, time, output_size).
"""
xx = self.res_blocks(x.transpose([0, 2, 1])).transpose([0, 2, 1])
x = x + self.postnet1(xx)
x = self.postnet2(x)
x = self.postnet2(x.transpose([0, 2, 1])).transpose([0, 2, 1])
x = self.linear(x)
return x
......@@ -159,17 +246,35 @@ class SpeedySpeech(nn.Layer):
def __init__(
self,
vocab_size,
encoder_hidden_size,
encoder_kernel_size,
encoder_dilations,
duration_predictor_hidden_size,
decoder_hidden_size,
decoder_output_size,
decoder_kernel_size,
decoder_dilations,
tone_size=None,
spk_num=None,
init_type: str="xavier_uniform", ):
encoder_hidden_size: int=128,
encoder_kernel_size: int=3,
encoder_dilations: List[int]=[1, 3, 9, 27, 1, 3, 9, 27, 1, 1],
duration_predictor_hidden_size: int=128,
decoder_hidden_size: int=128,
decoder_output_size: int=80,
decoder_kernel_size: int=3,
decoder_dilations: List[
int]=[1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 1],
tone_size: int=None,
spk_num: int=None,
init_type: str="xavier_uniform",
positional_dropout_rate: int=0.1):
"""Initialize SpeedySpeech module.
Args:
vocab_size (int): Dimension of the inputs.
encoder_hidden_size (int): Number of encoder hidden units.
encoder_kernel_size (int): Kernel size of encoder.
encoder_dilations (List[int]): Dilations of encoder.
duration_predictor_hidden_size (int): Number of duration predictor hidden units.
decoder_hidden_size (int): Number of decoder hidden units.
decoder_kernel_size (int): Kernel size of decoder.
decoder_dilations (List[int]): Dilations of decoder.
decoder_output_size (int): Dimension of the outputs.
tone_size (Optional[int]): Number of tones.
spk_num (Optional[int]): Number of speakers.
init_type (str): How to initialize transformer parameters.
"""
super().__init__()
# initialize parameters
......@@ -181,6 +286,8 @@ class SpeedySpeech(nn.Layer):
duration_predictor = DurationPredictor(duration_predictor_hidden_size)
decoder = SpeedySpeechDecoder(decoder_hidden_size, decoder_output_size,
decoder_kernel_size, decoder_dilations)
self.position_enc = ScaledPositionalEncoding(encoder_hidden_size,
positional_dropout_rate)
self.encoder = encoder
self.duration_predictor = duration_predictor
......@@ -190,7 +297,22 @@ class SpeedySpeech(nn.Layer):
nn.initializer.set_global_initializer(None)
def forward(self, text, tones, durations, spk_id: paddle.Tensor=None):
def forward(self,
text: paddle.Tensor,
tones: paddle.Tensor,
durations: paddle.Tensor,
spk_id: paddle.Tensor=None):
"""Calculate forward propagation.
Args:
text(Tensor(int64)): Batch of padded token ids (B, Tmax).
durations(Tensor(int64)): Batch of padded durations (B, Tmax).
tones(Tensor, optional(int64)): Batch of padded tone ids (B, Tmax).
spk_id(Tnesor, optional(int64)): Batch of speaker ids (B,)
Returns:
Tensor: Output tensor (B, T_frames, decoder_output_size).
Tensor: Predicted durations (B, Tmax).
"""
# input of embedding must be int64
text = paddle.cast(text, 'int64')
tones = paddle.cast(tones, 'int64')
......@@ -198,23 +320,30 @@ class SpeedySpeech(nn.Layer):
spk_id = paddle.cast(spk_id, 'int64')
durations = paddle.cast(durations, 'int64')
encodings = self.encoder(text, tones, spk_id)
pred_durations = self.duration_predictor(encodings.detach())
# expand encodings
durations_to_expand = durations
encodings = self.length_regulator(encodings, durations_to_expand)
encodings = self.position_enc(encodings)
# decode
# remove positional encoding here
_, t_dec, feature_size = encodings.shape
encodings += sinusoid_position_encoding(t_dec, feature_size)
decoded = self.decoder(encodings)
return decoded, pred_durations
def inference(self, text, tones=None, durations=None, spk_id=None):
# text: [T]
# tones: [T]
def inference(self,
text: paddle.Tensor,
tones: paddle.Tensor=None,
durations: paddle.Tensor=None,
spk_id: paddle.Tensor=None):
"""Generate the sequence of features given the sequences of characters.
Args:
text(Tensor(int64)): Input sequence of characters (T,).
tones(Tensor, optional(int64)): Batch of padded tone ids (T, ).
durations(Tensor, optional (int64)): Groundtruth of duration (T,).
spk_id(Tensor, optional(int64), optional): spk ids (1,). (Default value = None)
Returns:
Tensor: logmel (T, decoder_output_size).
"""
# input of embedding must be int64
text = paddle.cast(text, 'int64')
text = text.unsqueeze(0)
......@@ -233,10 +362,7 @@ class SpeedySpeech(nn.Layer):
durations_to_expand = durations
encodings = self.length_regulator(
encodings, durations_to_expand, is_inference=True)
shape = paddle.shape(encodings)
t_dec, feature_size = shape[1], shape[2]
encodings += sinusoid_position_encoding(t_dec, feature_size)
encodings = self.position_enc(encodings)
decoded = self.decoder(encodings)
return decoded[0]
......
......@@ -86,7 +86,7 @@ class LengthRegulator(nn.Layer):
M[:, i] = m - init
init = m
M = paddle.reshape(M, shape=[t_dec_1, batch_size, t_enc])
M = M[1:, :, :]
M = M[1:t_dec_1, :, :]
M = paddle.transpose(M, (1, 0, 2))
encodings = paddle.matmul(M, encodings)
return encodings
......
......@@ -30,7 +30,7 @@ class WaveNetResidualBlock(nn.Layer):
Args:
kernel_size (int, optional): Kernel size of the 1D convolution, by default 3
residual_channels (int, optional): Feature size of the resiaudl output(and also the input), by default 64
residual_channels (int, optional): Feature size of the residual output(and also the input), by default 64
gate_channels (int, optional): Output feature size of the 1D convolution, by default 128
skip_channels (int, optional): Feature size of the skip output, by default 64
aux_channels (int, optional): Feature size of the auxiliary input (e.g. spectrogram), by default 80
......
......@@ -347,7 +347,7 @@ class TransformerEncoder(BaseEncoder):
encoder_type="transformer")
def forward(self, xs, masks):
"""Encode input sequence.
"""Encoder input sequence.
Args:
xs(Tensor): Input tensor (#batch, time, idim).
......@@ -355,7 +355,7 @@ class TransformerEncoder(BaseEncoder):
Returns:
Tensor: Output tensor (#batch, time, attention_dim).
Tensor:Mask tensor (#batch, 1, time).
Tensor: Mask tensor (#batch, 1, time).
"""
xs = self.embed(xs)
xs, masks = self.encoders(xs, masks)
......
......@@ -38,10 +38,10 @@ def compute_dataset_embedding(data_loader, model, mean_var_norm_emb, config,
"""compute the dataset embeddings
Args:
data_loader (_type_): _description_
model (_type_): _description_
mean_var_norm_emb (_type_): _description_
config (_type_): _description_
data_loader (paddle.io.Dataloader): the dataset loader to be compute the embedding
model (paddle.nn.Layer): the speaker verification model
mean_var_norm_emb : compute the embedding mean and std norm
config (yacs.config.CfgNode): the yaml config
"""
logger.info(
f'Computing embeddings on {data_loader.dataset.csv_path} dataset')
......@@ -65,6 +65,17 @@ def compute_dataset_embedding(data_loader, model, mean_var_norm_emb, config,
def compute_verification_scores(id2embedding, train_cohort, config):
"""Compute the verification trial scores
Args:
id2embedding (dict): the utterance embedding
train_cohort (paddle.tensor): the cohort dataset embedding
config (yacs.config.CfgNode): the yaml config
Returns:
the scores and the trial labels,
1 refers the target and 0 refers the nontarget in labels
"""
labels = []
enroll_ids = []
test_ids = []
......@@ -119,20 +130,32 @@ def compute_verification_scores(id2embedding, train_cohort, config):
def main(args, config):
"""The main process for test the speaker verification model
Args:
args (argparse.Namespace): the command line args namespace
config (yacs.config.CfgNode): the yaml config
"""
# stage0: set the training device, cpu or gpu
# if set the gpu, paddlespeech will select a gpu according the env CUDA_VISIBLE_DEVICES
paddle.set_device(args.device)
# set the random seed, it is a must for multiprocess training
# set the random seed, it is the necessary measures for multiprocess training
seed_everything(config.seed)
# stage1: build the dnn backbone model network
# we will extract the audio embedding from the backbone model
ecapa_tdnn = EcapaTdnn(**config.model)
# stage2: build the speaker verification eval instance with backbone model
# because the checkpoint dict name has the SpeakerIdetification prefix
# so we need to create the SpeakerIdetification instance
# but we acutally use the backbone model to extact the audio embedding
model = SpeakerIdetification(
backbone=ecapa_tdnn, num_class=config.num_speakers)
# stage3: load the pre-trained model
# we get the last model from the epoch and save_interval
# generally, we get the last model from the epoch
args.load_checkpoint = os.path.abspath(
os.path.expanduser(args.load_checkpoint))
......@@ -143,7 +166,8 @@ def main(args, config):
logger.info(f'Checkpoint loaded from {args.load_checkpoint}')
# stage4: construct the enroll and test dataloader
# Now, wo think the enroll dataset is in the {args.data_dir}/vox/csv/enroll.csv,
# and the test dataset is in the {args.data_dir}/vox/csv/test.csv
enroll_dataset = CSVDataset(
os.path.join(args.data_dir, "vox/csv/enroll.csv"),
feat_type='melspectrogram',
......@@ -152,14 +176,14 @@ def main(args, config):
window_size=config.window_size,
hop_length=config.hop_size)
enroll_sampler = BatchSampler(
enroll_dataset, batch_size=config.batch_size,
shuffle=False) # Shuffle to make embedding normalization more robust.
enroll_dataset, batch_size=config.batch_size, shuffle=False)
enroll_loader = DataLoader(enroll_dataset,
batch_sampler=enroll_sampler,
collate_fn=lambda x: batch_feature_normalize(
x, mean_norm=True, std_norm=False),
num_workers=config.num_workers,
return_list=True,)
test_dataset = CSVDataset(
os.path.join(args.data_dir, "vox/csv/test.csv"),
feat_type='melspectrogram',
......@@ -167,7 +191,6 @@ def main(args, config):
n_mels=config.n_mels,
window_size=config.window_size,
hop_length=config.hop_size)
test_sampler = BatchSampler(
test_dataset, batch_size=config.batch_size, shuffle=False)
test_loader = DataLoader(test_dataset,
......@@ -180,16 +203,17 @@ def main(args, config):
model.eval()
# stage6: global embedding norm to imporve the performance
# and we create the InputNormalization instance to process the embedding mean and std norm
logger.info(f"global embedding norm: {config.global_embedding_norm}")
# stage7: Compute embeddings of audios in enrol and test dataset from model.
if config.global_embedding_norm:
mean_var_norm_emb = InputNormalization(
norm_type="global",
mean_norm=config.embedding_mean_norm,
std_norm=config.embedding_std_norm)
# stage 7: score norm need the imposters dataset
# we select the train dataset as the idea imposters dataset
# and we select the config.n_train_snts utterance to as the final imposters dataset
if "score_norm" in config:
logger.info(f"we will do score norm: {config.score_norm}")
train_dataset = CSVDataset(
......@@ -209,6 +233,7 @@ def main(args, config):
num_workers=config.num_workers,
return_list=True,)
# stage 8: Compute embeddings of audios in enrol and test dataset from model.
id2embedding = {}
# Run multi times to make embedding normalization more stable.
logger.info("First loop for enroll and test dataset")
......@@ -225,7 +250,7 @@ def main(args, config):
mean_var_norm_emb.save(
os.path.join(args.load_checkpoint, "mean_var_norm_emb"))
# stage 8: Compute cosine scores.
# stage 9: Compute cosine scores.
train_cohort = None
if "score_norm" in config:
train_embeddings = {}
......@@ -234,11 +259,11 @@ def main(args, config):
train_embeddings)
train_cohort = paddle.stack(list(train_embeddings.values()))
# compute the scores
# stage 10: compute the scores
scores, labels = compute_verification_scores(id2embedding, train_cohort,
config)
# compute the EER and threshold
# stage 11: compute the EER and threshold
scores = paddle.to_tensor(scores)
EER, threshold = compute_eer(np.asarray(labels), scores.numpy())
logger.info(
......
......@@ -42,6 +42,12 @@ logger = Log(__name__).getlog()
def main(args, config):
"""The main process for test the speaker verification model
Args:
args (argparse.Namespace): the command line args namespace
config (yacs.config.CfgNode): the yaml config
"""
# stage0: set the training device, cpu or gpu
paddle.set_device(args.device)
......@@ -49,11 +55,11 @@ def main(args, config):
paddle.distributed.init_parallel_env()
nranks = paddle.distributed.get_world_size()
local_rank = paddle.distributed.get_rank()
# set the random seed, it is a must for multiprocess training
# set the random seed, it is the necessary measures for multiprocess training
seed_everything(config.seed)
# stage2: data prepare, such vox1 and vox2 data, and augment noise data and pipline
# note: some cmd must do in rank==0, so wo will refactor the data prepare code
# note: some operations must be done in rank==0
train_dataset = CSVDataset(
csv_path=os.path.join(args.data_dir, "vox/csv/train.csv"),
label2id_path=os.path.join(args.data_dir, "vox/meta/label2id.txt"))
......@@ -61,12 +67,14 @@ def main(args, config):
csv_path=os.path.join(args.data_dir, "vox/csv/dev.csv"),
label2id_path=os.path.join(args.data_dir, "vox/meta/label2id.txt"))
# we will build the augment pipeline process list
if config.augment:
augment_pipeline = build_augment_pipeline(target_dir=args.data_dir)
else:
augment_pipeline = []
# stage3: build the dnn backbone model network
# in speaker verification period, we use the backbone mode to extract the audio embedding
ecapa_tdnn = EcapaTdnn(**config.model)
# stage4: build the speaker verification train instance with backbone model
......@@ -77,13 +85,15 @@ def main(args, config):
# 140000 is single gpu steps
# so, in multi-gpu mode, wo reduce the step_size to 140000//nranks to enable CyclicLRScheduler
lr_schedule = CyclicLRScheduler(
base_lr=config.learning_rate, max_lr=1e-3, step_size=140000 // nranks)
base_lr=config.learning_rate,
max_lr=config.max_lr,
step_size=config.step_size // nranks)
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_schedule, parameters=model.parameters())
# stage6: build the loss function, we now only support LogSoftmaxWrapper
criterion = LogSoftmaxWrapper(
loss_fn=AdditiveAngularMargin(margin=0.2, scale=30))
loss_fn=AdditiveAngularMargin(margin=config.margin, scale=config.scale))
# stage7: confirm training start epoch
# if pre-trained model exists, start epoch confirmed by the pre-trained model
......@@ -225,7 +235,7 @@ def main(args, config):
print_msg += ' avg_train_cost: {:.5f} sec,'.format(
train_run_cost / config.log_interval)
print_msg += ' lr={:.4E} step/sec={:.2f} ips:{:.5f}| ETA {}'.format(
print_msg += ' lr={:.4E} step/sec={:.2f} ips={:.5f}| ETA {}'.format(
lr, timer.timing, timer.ips, timer.eta)
logger.info(print_msg)
......
......@@ -57,14 +57,14 @@ class InputNormalization:
lengths (paddle.Tensor): A batch of tensors containing the relative length of each
sentence (e.g, [0.7, 0.9, 1.0]). It is used to avoid
computing stats on zero-padded steps.
spk_ids (_type_, optional): tensor containing the ids of each speaker (e.g, [0 10 6]).
spk_ids (paddle.Tensor, optional): tensor containing the ids of each speaker (e.g, [0 10 6]).
It is used to perform per-speaker normalization when
norm_type='speaker'. Defaults to paddle.to_tensor([], dtype="float32").
Returns:
paddle.Tensor: The normalized feature or embedding
"""
N_batches = x.shape[0]
# print(f"x shape: {x.shape[1]}")
current_means = []
current_stds = []
......@@ -75,6 +75,9 @@ class InputNormalization:
actual_size = paddle.round(lengths[snt_id] *
x.shape[1]).astype("int32")
# computing actual time data statistics
# we extract the snt_id embedding from the x
# and the target paddle.Tensor will reduce an 0-axis
# so we need unsqueeze operation to recover the all axis
current_mean, current_std = self._compute_current_stats(
x[snt_id, 0:actual_size, ...].unsqueeze(0))
current_means.append(current_mean)
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册