提交 1a3c1f75 编写于 作者: T TCChenLong

add tipc for models

上级 0173b12c

要显示的变更太多。

To preserve performance only 1000 of 1000+ files are displayed.
.ipynb_checkpoints/**
*.ipynb
nohup.out
__pycache__/
*.wav
*.m4a
obsolete/**
repos:
- repo: local
hooks:
- id: yapf
name: yapf
entry: yapf
language: system
args: [-i, --style .style.yapf]
files: \.py$
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: a11d9314b22d8f8c7556443875b731ef05965464
hooks:
- id: check-merge-conflict
- id: check-symlinks
- id: end-of-file-fixer
- id: trailing-whitespace
- id: detect-private-key
- id: check-symlinks
- id: check-added-large-files
- repo: https://github.com/pycqa/isort
rev: 5.8.0
hooks:
- id: isort
name: isort (python)
- id: isort
name: isort (cython)
types: [cython]
- id: isort
name: isort (pyi)
types: [pyi]
- repo: local
hooks:
- id: flake8
name: flake8
entry: flake8
language: system
args:
- --count
- --select=E9,F63,F7,F82
- --show-source
- --statistics
files: \.py$
[style]
based_on_style = pep8
column_limit = 80
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
# PaddleAudio: The audio library for PaddlePaddle
## Introduction
PaddleAudio is the audio toolkit to speed up your audio research and development loop in PaddlePaddle. It currently provides a collection of audio datasets, feature-extraction functions, audio transforms,state-of-the-art pre-trained models in sound tagging/classification and anomaly sound detection. More models and features are on the roadmap.
## Features
- Spectrogram and related features are compatible with librosa.
- State-of-the-art models in sound tagging on Audioset, sound classification on esc50, and more to come.
- Ready-to-use audio embedding with a line of code, includes sound embedding and more on the roadmap.
- Data loading supports for common open source audio in multiple languages including English, Mandarin and so on.
## Install
```
git clone https://github.com/PaddlePaddle/models
cd models/PaddleAudio
pip install .
```
## Quick start
### Audio loading and feature extraction
``` python
import paddleaudio
audio_file = 'test.flac'
wav, sr = paddleaudio.load(audio_file, sr=16000)
mel_feature = paddleaudio.melspectrogram(wav,
sr=sr,
window_size=320,
hop_length=160,
n_mels=80)
```
### Speech recognition using wav2vec 2.0
``` python
import paddleaudio
from paddleaudio.models.wav2vec2 import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
model = Wav2Vec2ForCTC('wav2vec2-base-960h', pretrained=True)
tokenizer = Wav2Vec2Tokenizer()
# Load audio and normalize
wav, _ = paddleaudio.load('your_audio.wav', sr=16000, normal=True, norm_type='gaussian')
with paddle.no_grad():
x = paddle.to_tensor(wav)
logits = model(x.unsqueeze(0))
# Get the token index prediction
idx = paddle.argmax(logits, -1)
# Decode prediction to text
text = tokenizer.decode(idx[0])
print(text)
```
### Examples
We provide a set of examples to help you get started in using PaddleAudio quickly.
- [Wav2vec 2.0 for speech recognition](./examples/wav2vec2)
- [PANNs: acoustic scene and events analysis using pre-trained models](./examples/panns)
- [Environmental Sound classification on ESC-50 dataset](./examples/sound_classification)
- [Training a audio-tagging network on Audioset](./examples/audioset_training)
Please refer to [example directory](./examples) for more details.
# Training Audio-tagging networks using PaddleAudio
In this example, we showcase how to train a typical CNN using PaddleAudio.
Different from [PANNS](https://github.com/qiuqiangkong/audioset_tagging_cnn)\[1\], which used a customized CNN14, here we are using [resnet50](https://arxiv.org/abs/1512.03385v1)\[2\], a kind of network more commonly-used in deep learning community and has much fewer parameters than the networks used in PANNS. We achieved similar [Audioset](https://research.google.com/audioset/) mAP in our version of balanced evaluation dataset.
## Introduction
Audio-tagging is the task to generate tags for input audio file. It is usually tackled as a multi-label classification problem. Different from commonly-seen multi-class classification which is a single-label problem of categorizing instances into precisely one of more than two classes\[3\], multi-label classification task categorizes instances into one or more classes.
To solve the audio-tagging we can borrow ideas from image-recognition community. For example, we can use residual networks trained on imagenet dataset but change the classification head from 1000-classes to 527-classes for Audioset. We can also use BCE-loss to do multi-label classification, avoiding the label-competition brought by softmax activation and cross-entropy loss (since we allow multiple labels or tags for the same audio instance). However, as mentioned by previous work\[1,4,5\], audioset is weakly labelled(No time information for the exact location of the labels) and highly imbalanced(The training dataset is largely dominated by speech and music), the training is highly prone to over-fitting and we have to change the network and training strategy accordingly as follows.
- Use of extra front-end feature extraction, i.e., convert the audio from waveform to mel-spectrogram. In image classification task, no extra feature extraction is necessary(e.g., not need to convert image to frequency domain)
- Use of weight averaging to reduce the variance and improve generalization. This is necessary to reduce over-fitting. (In our example, weight-averaging improved mAP from 0.375 to 0.401)
- Use resnet50 with 1-channel input and add 4 dropout layers each after the 4 convolution blocks in residual networks. This is motivated by PANNS.
- Use of mixup training. we set mixup\[4\] gamma to 0.5 and verify that mixup training is quite useful for this scenario.
- Use the pretrained weight from imagenet classification task . This approach is sometimes used in audio classification task. In our example, it accelerates the training process at the very first epoch.
- Use of learning-rate warmup heuristic. Learning-rate warmup is commonly used in training CNNs and transformers. We found it stabilize the training and improve final mAP.
- Use balanced sampling to make sure each class of the Audioset is treated evenly.
- Use random cropping, spectrogram permutation, time-freq masking to do training augmentation.
With the above strategies, we have achieved results that on par with the sota, and our network size is much smaller in terms of FLOPs of number of parameters(see the following table). When saved to disk, the weight of this work is only 166 MiB in size, in contrast to 466MiB for CNN14.
| Model | Flops | Params |
| :------------- | :----------: | -----------: |
| CNN14* | 1,658,961,018 | 80,769,871 |
| Resnet50(This example) | 1,327,513,088 | 28,831,567|
(* Note: we use ```paddle.flops()``` to calculate flops and parameters, which gives slightly different results from the origin paper)
## Features
The feature we use in this example is mel-spectrogram, similar to that of [PANNS](https://github.com/qiuqiangkong/audioset_tagging_cnn). The details of the feature parameters are listed in [config.yaml](config.yaml) and also described below:
```
sample_rate: 32000
window_size: 1024
hop_size: 640
mel_bins: 128
fmin: 50
fmax: 16000
```
We provide a script file [wav2mel](./wav2mel.py) to preprocess audio files into mel-spectrograms and stored them as multiple h5 files. You can use it to do preprocessing if you have already downloaded the Audioset.
## Network
Since the input audio feature is a one-channel spectrogram,
we modify the resnet50 to accept 1-channel inputs by setting conv1 as
```
self.conv1 = nn.Conv2D(1, self.inplanes, kernel_size=7, stride=2, padding=3, bias_attr=False)
```
As mentioned, we add dropout layers each after the convolution block in resnet50.
We also added a dropout layer before the last fully-connect layer and set the dropout rate to ```0.2```
The network is defined in [model.py](./model.py)
## Training
Everything about data, feature, training controls, etc, can be found in the [config file](./assets/config.yaml).
In this section we will describe training data and steps on how to run the training.
### Training data
The dataset used in both training and evaluation is [Audioset](https://research.google.com/audioset/). We manually download the video files from Youtube according to the youtube-id listed in the dataset, and convert the audio to wav format of 1-channel and 32K sample rate. We then extract the melspetrogram features as described above and store the features as numpy array into separated h5 file. Each h5 file contains features extracted from 10,000 audio files.
For this experience we have successfully downloaded ```1714174``` valid files for unbalance segment, ```16906``` for balanced training segment, and ```17713``` for balanced evaluation segment. The data statistics are summarized in the following table:
| | unbalanced | balanced train |Evaluation |
| :------------- | :----------: | :-----------: |-----------: |
| [Original](https://research.google.com/audioset/download.html) | 2,042,985 | 22,176 | 20,383 |
| [PANNS](https://arxiv.org/pdf/1912.10211.pdf) | 1,913,637 | 20,550 |18,887 |
| This example | 1,714,174 | 16,906 |17,713 |
Our version of dataset contains fewer audio than those of PANNs due to the reasons that video will gradually become private or simply deleted by the authors. We use all of the audio files from balanced segment and unbalanced segment for training and the rest evaluation segment for testing. This gives up 1,714,174 training files (unevenly) distributed across 527 labels. The label information can be found int [this location](https://research.google.com/audioset/ontology/index.html) and the paper\[7\]
### Run the training
Set all necessary path and training configurations in the file [config.yaml](./config.yaml), then run
```
python train.py --device <device_number>
```
for single gpu training. It takes about 3 hours for training one epoch with balance-sampling strategy.
To restore from a checkpoint, run
```
python train.py --device <device_number> --restore <epoch_num>
```
For multi-gpu training, run
```
python -m paddle.distributed.launch --selected_gpus='0,1,2,3' ./train.py --distributed=1
```
### Training loss function
We use mixup loss described in the paper \[4\]. It's better than simply using binary cross entropy loss in the
multi-label Audioset tagging problem.
### Training log
We use [VisualDL](https://github.com/PaddlePaddle/VisualDL.git) to record training loss and evaluation metrics.
![train_loss.png](./assets/train_loss.png)
## Evaluation
We evaluate audio tagging performance using the same metrics as described in PANNS, namely mAP, AUC,d-prime.
Since our version of evaluation dataset is different from PANNs, we re-evaluate the performance of PANNS using their code and pre-trained weights. For the TAL Net\[5\] and DeepRes\[6\], we directly use the results in the original paper.
To get the statistics of our pre-trained model in this example, run
```
python evaluation.py
```
| Model |mAP |AUC |d-prime|
| :------------- | :----------: |:-----------: |-----------: |
| TAL Net \[5\]* | 0.362| 0.965 |2.56|
| DeepRes \[6\]* | 0.392 | 0.971|2.68|
| PANNS \[1\] | 0.420 ** | 0.970|2.66|
| This example | 0.416 | 0.968 |2.62|
(* indicate different evaluation set than ours, ** stats are different from the paper as we re-evaluated on our version of dataset)
## Inference
You can do inference by passing an input audio file to [inference.py](./inference.py)
```
python inference.py --wav_file <path-to-your-wav-file> --top_k 5
```
which will give you a result like this:
```
labels prob
------------
Speech: 0.744
Cat: 0.721
Meow: 0.681
Domestic animal: 0.627
Animal: 0.488
```
## Reference
- \[1\] Kong, Qiuqiang, et al. “PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition.” IEEE Transactions on Audio, Speech, and Language Processing, vol. 28, 2020, pp. 2880–2894.
- \[2\] He, Kaiming, et al. “Deep Residual Learning for Image Recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
- \[3\] https://en.wikipedia.org/wiki/Multi-label_classification
- \[4\] Zhang, Hongyi, et al. “Mixup: Beyond Empirical Risk Minimization.” International Conference on Learning Representations, 2017.
- \[5\] Kong, Qiuqiang, et al. “Audio Set Classification with Attention Model: A Probabilistic Perspective.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 316–320.
- \[6\] Ford, Logan, et al. “A Deep Residual Network for Large-Scale Acoustic Scene Analysis.” Interspeech 2019, 2019, pp. 2568–2572.
- \[7]\ Gemmeke, Jort F., et al. “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 776–780.
/m/09x0r
/m/05zppz
/m/02zsn
/m/0ytgt
/m/01h8n0
/m/02qldy
/m/0261r1
/m/0brhx
/m/07p6fty
/m/07q4ntr
/m/07rwj3x
/m/07sr1lc
/m/04gy_2
/t/dd00135
/m/03qc9zr
/m/02rtxlg
/m/01j3sz
/t/dd00001
/m/07r660_
/m/07s04w4
/m/07sq110
/m/07rgt08
/m/0463cq4
/t/dd00002
/m/07qz6j3
/m/07qw_06
/m/07plz5l
/m/015lz1
/m/0l14jd
/m/01swy6
/m/02bk07
/m/01c194
/t/dd00003
/t/dd00004
/t/dd00005
/t/dd00006
/m/06bxc
/m/02fxyj
/m/07s2xch
/m/07r4k75
/m/01w250
/m/0lyf6
/m/07mzm6
/m/01d3sd
/m/07s0dtb
/m/07pyy8b
/m/07q0yl5
/m/01b_21
/m/0dl9sf8
/m/01hsr_
/m/07ppn3j
/m/06h7j
/m/07qv_x_
/m/07pbtc8
/m/03cczk
/m/07pdhp0
/m/0939n_
/m/01g90h
/m/03q5_w
/m/02p3nc
/m/02_nn
/m/0k65p
/m/025_jnm
/m/0l15bq
/m/01jg02
/m/01jg1z
/m/053hz1
/m/028ght
/m/07rkbfh
/m/03qtwd
/m/07qfr4h
/t/dd00013
/m/0jbk
/m/068hy
/m/0bt9lr
/m/05tny_
/m/07r_k2n
/m/07qf0zm
/m/07rc7d9
/m/0ghcn6
/t/dd00136
/m/01yrx
/m/02yds9
/m/07qrkrw
/m/07rjwbb
/m/07r81j2
/m/0ch8v
/m/03k3r
/m/07rv9rh
/m/07q5rw0
/m/01xq0k1
/m/07rpkh9
/m/0239kh
/m/068zj
/t/dd00018
/m/03fwl
/m/07q0h5t
/m/07bgp
/m/025rv6n
/m/09b5t
/m/07st89h
/m/07qn5dc
/m/01rd7k
/m/07svc2k
/m/09ddx
/m/07qdb04
/m/0dbvp
/m/07qwf61
/m/01280g
/m/0cdnk
/m/04cvmfc
/m/015p6
/m/020bb7
/m/07pggtn
/m/07sx8x_
/m/0h0rv
/m/07r_25d
/m/04s8yn
/m/07r5c2p
/m/09d5_
/m/07r_80w
/m/05_wcq
/m/01z5f
/m/06hps
/m/04rmv
/m/07r4gkf
/m/03vt0
/m/09xqv
/m/09f96
/m/0h2mp
/m/07pjwq1
/m/01h3n
/m/09ld4
/m/07st88b
/m/078jl
/m/07qn4z3
/m/032n05
/m/04rlf
/m/04szw
/m/0fx80y
/m/0342h
/m/02sgy
/m/018vs
/m/042v_gx
/m/06w87
/m/01glhc
/m/07s0s5r
/m/018j2
/m/0jtg0
/m/04rzd
/m/01bns_
/m/07xzm
/m/05148p4
/m/05r5c
/m/01s0ps
/m/013y1f
/m/03xq_f
/m/03gvt
/m/0l14qv
/m/01v1d8
/m/03q5t
/m/0l14md
/m/02hnl
/m/0cfdd
/m/026t6
/m/06rvn
/m/03t3fj
/m/02k_mr
/m/0bm02
/m/011k_j
/m/01p970
/m/01qbl
/m/03qtq
/m/01sm1g
/m/07brj
/m/05r5wn
/m/0xzly
/m/0mbct
/m/016622
/m/0j45pbj
/m/0dwsp
/m/0dwtp
/m/0dwt5
/m/0l156b
/m/05pd6
/m/01kcd
/m/0319l
/m/07gql
/m/07c6l
/m/0l14_3
/m/02qmj0d
/m/07y_7
/m/0d8_n
/m/01xqw
/m/02fsn
/m/085jw
/m/0l14j_
/m/06ncr
/m/01wy6
/m/03m5k
/m/0395lw
/m/03w41f
/m/027m70_
/m/0gy1t2s
/m/07n_g
/m/0f8s22
/m/026fgl
/m/0150b9
/m/03qjg
/m/0mkg
/m/0192l
/m/02bxd
/m/0l14l2
/m/07kc_
/m/0l14t7
/m/01hgjl
/m/064t9
/m/0glt670
/m/02cz_7
/m/06by7
/m/03lty
/m/05r6t
/m/0dls3
/m/0dl5d
/m/07sbbz2
/m/05w3f
/m/06j6l
/m/0gywn
/m/06cqb
/m/01lyv
/m/015y_n
/m/0gg8l
/m/02x8m
/m/02w4v
/m/06j64v
/m/03_d0
/m/026z9
/m/0ggq0m
/m/05lls
/m/02lkt
/m/03mb9
/m/07gxw
/m/07s72n
/m/0283d
/m/0m0jc
/m/08cyft
/m/0fd3y
/m/07lnk
/m/0g293
/m/0ln16
/m/0326g
/m/0155w
/m/05fw6t
/m/02v2lh
/m/0y4f8
/m/0z9c
/m/0164x2
/m/0145m
/m/02mscn
/m/016cjb
/m/028sqc
/m/015vgc
/m/0dq0md
/m/06rqw
/m/02p0sh1
/m/05rwpb
/m/074ft
/m/025td0t
/m/02cjck
/m/03r5q_
/m/0l14gg
/m/07pkxdp
/m/01z7dr
/m/0140xf
/m/0ggx5q
/m/04wptg
/t/dd00031
/t/dd00032
/t/dd00033
/t/dd00034
/t/dd00035
/t/dd00036
/t/dd00037
/m/03m9d0z
/m/09t49
/t/dd00092
/m/0jb2l
/m/0ngt1
/m/0838f
/m/06mb1
/m/07r10fb
/t/dd00038
/m/0j6m2
/m/0j2kx
/m/05kq4
/m/034srq
/m/06wzb
/m/07swgks
/m/02_41
/m/07pzfmf
/m/07yv9
/m/019jd
/m/0hsrw
/m/056ks2
/m/02rlv9
/m/06q74
/m/012f08
/m/0k4j
/m/0912c9
/m/07qv_d5
/m/02mfyn
/m/04gxbd
/m/07rknqz
/m/0h9mv
/t/dd00134
/m/0ltv
/m/07r04
/m/0gvgw0
/m/05x_td
/m/02rhddq
/m/03cl9h
/m/01bjv
/m/03j1ly
/m/04qvtq
/m/012n7d
/m/012ndj
/m/04_sv
/m/0btp2
/m/06d_3
/m/07jdr
/m/04zmvq
/m/0284vy3
/m/01g50p
/t/dd00048
/m/0195fx
/m/0k5j
/m/014yck
/m/04229
/m/02l6bg
/m/09ct_
/m/0cmf2
/m/0199g
/m/06_fw
/m/02mk9
/t/dd00065
/m/08j51y
/m/01yg9g
/m/01j4z9
/t/dd00066
/t/dd00067
/m/01h82_
/t/dd00130
/m/07pb8fc
/m/07q2z82
/m/02dgv
/m/03wwcy
/m/07r67yg
/m/02y_763
/m/07rjzl8
/m/07r4wb8
/m/07qcpgn
/m/07q6cd_
/m/0642b4
/m/0fqfqc
/m/04brg2
/m/023pjk
/m/07pn_8q
/m/0dxrf
/m/0fx9l
/m/02pjr4
/m/02jz0l
/m/0130jx
/m/03dnzn
/m/03wvsk
/m/01jt3m
/m/012xff
/m/04fgwm
/m/0d31p
/m/01s0vc
/m/03v3yw
/m/0242l
/m/01lsmm
/m/02g901
/m/05rj2
/m/0316dw
/m/0c2wf
/m/01m2v
/m/081rb
/m/07pp_mv
/m/07cx4
/m/07pp8cl
/m/01hnzm
/m/02c8p
/m/015jpf
/m/01z47d
/m/046dlr
/m/03kmc9
/m/0dgbq
/m/030rvx
/m/01y3hg
/m/0c3f7m
/m/04fq5q
/m/0l156k
/m/06hck5
/t/dd00077
/m/02bm9n
/m/01x3z
/m/07qjznt
/m/07qjznl
/m/0l7xg
/m/05zc1
/m/0llzx
/m/02x984l
/m/025wky1
/m/024dl
/m/01m4t
/m/0dv5r
/m/07bjf
/m/07k1x
/m/03l9g
/m/03p19w
/m/01b82r
/m/02p01q
/m/023vsd
/m/0_ksk
/m/01d380
/m/014zdl
/m/032s66
/m/04zjc
/m/02z32qm
/m/0_1c
/m/073cg4
/m/0g6b5
/g/122z_qxw
/m/07qsvvw
/m/07pxg6y
/m/07qqyl4
/m/083vt
/m/07pczhz
/m/07pl1bw
/m/07qs1cx
/m/039jq
/m/07q7njn
/m/07rn7sz
/m/04k94
/m/07rrlb6
/m/07p6mqd
/m/07qlwh6
/m/07r5v4s
/m/07prgkl
/m/07pqc89
/t/dd00088
/m/07p7b8y
/m/07qlf79
/m/07ptzwd
/m/07ptfmf
/m/0dv3j
/m/0790c
/m/0dl83
/m/07rqsjt
/m/07qnq_y
/m/07rrh0c
/m/0b_fwt
/m/02rr_
/m/07m2kt
/m/018w8
/m/07pws3f
/m/07ryjzk
/m/07rdhzs
/m/07pjjrj
/m/07pc8lb
/m/07pqn27
/m/07rbp7_
/m/07pyf11
/m/07qb_dv
/m/07qv4k0
/m/07pdjhy
/m/07s8j8t
/m/07plct2
/t/dd00112
/m/07qcx4z
/m/02fs_r
/m/07qwdck
/m/07phxs1
/m/07rv4dm
/m/07s02z0
/m/07qh7jl
/m/07qwyj0
/m/07s34ls
/m/07qmpdm
/m/07p9k1k
/m/07qc9xj
/m/07rwm0c
/m/07phhsh
/m/07qyrcz
/m/07qfgpx
/m/07rcgpl
/m/07p78v5
/t/dd00121
/m/07s12q4
/m/028v0c
/m/01v_m0
/m/0b9m1
/m/0hdsk
/m/0c1dj
/m/07pt_g0
/t/dd00125
/t/dd00126
/t/dd00127
/t/dd00128
/t/dd00129
/m/01b9nn
/m/01jnbd
/m/096m7z
/m/06_y0by
/m/07rgkc5
/m/06xkwv
/m/0g12c5
/m/08p9q4
/m/07szfh9
/m/0chx_
/m/0cj0r
/m/07p_0gm
/m/01jwx6
/m/07c52
/m/06bz3
/m/07hvw1
Speech
Male speech, man speaking
Female speech, woman speaking
Child speech, kid speaking
Conversation
Narration, monologue
Babbling
Speech synthesizer
Shout
Bellow
Whoop
Yell
Battle cry
Children shouting
Screaming
Whispering
Laughter
Baby laughter
Giggle
Snicker
Belly laugh
Chuckle, chortle
Crying, sobbing
Baby cry, infant cry
Whimper
Wail, moan
Sigh
Singing
Choir
Yodeling
Chant
Mantra
Male singing
Female singing
Child singing
Synthetic singing
Rapping
Humming
Groan
Grunt
Whistling
Breathing
Wheeze
Snoring
Gasp
Pant
Snort
Cough
Throat clearing
Sneeze
Sniff
Run
Shuffle
Walk, footsteps
Chewing, mastication
Biting
Gargling
Stomach rumble
Burping, eructation
Hiccup
Fart
Hands
Finger snapping
Clapping
Heart sounds, heartbeat
Heart murmur
Cheering
Applause
Chatter
Crowd
Hubbub, speech noise, speech babble
Children playing
Animal
Domestic animals, pets
Dog
Bark
Yip
Howl
Bow-wow
Growling
Whimper (dog)
Cat
Purr
Meow
Hiss
Caterwaul
Livestock, farm animals, working animals
Horse
Clip-clop
Neigh, whinny
Cattle, bovinae
Moo
Cowbell
Pig
Oink
Goat
Bleat
Sheep
Fowl
Chicken, rooster
Cluck
Crowing, cock-a-doodle-doo
Turkey
Gobble
Duck
Quack
Goose
Honk
Wild animals
Roaring cats (lions, tigers)
Roar
Bird
Bird vocalization, bird call, bird song
Chirp, tweet
Squawk
Pigeon, dove
Coo
Crow
Caw
Owl
Hoot
Bird flight, flapping wings
Canidae, dogs, wolves
Rodents, rats, mice
Mouse
Patter
Insect
Cricket
Mosquito
Fly, housefly
Buzz
Bee, wasp, etc.
Frog
Croak
Snake
Rattle
Whale vocalization
Music
Musical instrument
Plucked string instrument
Guitar
Electric guitar
Bass guitar
Acoustic guitar
Steel guitar, slide guitar
Tapping (guitar technique)
Strum
Banjo
Sitar
Mandolin
Zither
Ukulele
Keyboard (musical)
Piano
Electric piano
Organ
Electronic organ
Hammond organ
Synthesizer
Sampler
Harpsichord
Percussion
Drum kit
Drum machine
Drum
Snare drum
Rimshot
Drum roll
Bass drum
Timpani
Tabla
Cymbal
Hi-hat
Wood block
Tambourine
Rattle (instrument)
Maraca
Gong
Tubular bells
Mallet percussion
Marimba, xylophone
Glockenspiel
Vibraphone
Steelpan
Orchestra
Brass instrument
French horn
Trumpet
Trombone
Bowed string instrument
String section
Violin, fiddle
Pizzicato
Cello
Double bass
Wind instrument, woodwind instrument
Flute
Saxophone
Clarinet
Harp
Bell
Church bell
Jingle bell
Bicycle bell
Tuning fork
Chime
Wind chime
Change ringing (campanology)
Harmonica
Accordion
Bagpipes
Didgeridoo
Shofar
Theremin
Singing bowl
Scratching (performance technique)
Pop music
Hip hop music
Beatboxing
Rock music
Heavy metal
Punk rock
Grunge
Progressive rock
Rock and roll
Psychedelic rock
Rhythm and blues
Soul music
Reggae
Country
Swing music
Bluegrass
Funk
Folk music
Middle Eastern music
Jazz
Disco
Classical music
Opera
Electronic music
House music
Techno
Dubstep
Drum and bass
Electronica
Electronic dance music
Ambient music
Trance music
Music of Latin America
Salsa music
Flamenco
Blues
Music for children
New-age music
Vocal music
A capella
Music of Africa
Afrobeat
Christian music
Gospel music
Music of Asia
Carnatic music
Music of Bollywood
Ska
Traditional music
Independent music
Song
Background music
Theme music
Jingle (music)
Soundtrack music
Lullaby
Video game music
Christmas music
Dance music
Wedding music
Happy music
Funny music
Sad music
Tender music
Exciting music
Angry music
Scary music
Wind
Rustling leaves
Wind noise (microphone)
Thunderstorm
Thunder
Water
Rain
Raindrop
Rain on surface
Stream
Waterfall
Ocean
Waves, surf
Steam
Gurgling
Fire
Crackle
Vehicle
Boat, Water vehicle
Sailboat, sailing ship
Rowboat, canoe, kayak
Motorboat, speedboat
Ship
Motor vehicle (road)
Car
Vehicle horn, car horn, honking
Toot
Car alarm
Power windows, electric windows
Skidding
Tire squeal
Car passing by
Race car, auto racing
Truck
Air brake
Air horn, truck horn
Reversing beeps
Ice cream truck, ice cream van
Bus
Emergency vehicle
Police car (siren)
Ambulance (siren)
Fire engine, fire truck (siren)
Motorcycle
Traffic noise, roadway noise
Rail transport
Train
Train whistle
Train horn
Railroad car, train wagon
Train wheels squealing
Subway, metro, underground
Aircraft
Aircraft engine
Jet engine
Propeller, airscrew
Helicopter
Fixed-wing aircraft, airplane
Bicycle
Skateboard
Engine
Light engine (high frequency)
Dental drill, dentist's drill
Lawn mower
Chainsaw
Medium engine (mid frequency)
Heavy engine (low frequency)
Engine knocking
Engine starting
Idling
Accelerating, revving, vroom
Door
Doorbell
Ding-dong
Sliding door
Slam
Knock
Tap
Squeak
Cupboard open or close
Drawer open or close
Dishes, pots, and pans
Cutlery, silverware
Chopping (food)
Frying (food)
Microwave oven
Blender
Water tap, faucet
Sink (filling or washing)
Bathtub (filling or washing)
Hair dryer
Toilet flush
Toothbrush
Electric toothbrush
Vacuum cleaner
Zipper (clothing)
Keys jangling
Coin (dropping)
Scissors
Electric shaver, electric razor
Shuffling cards
Typing
Typewriter
Computer keyboard
Writing
Alarm
Telephone
Telephone bell ringing
Ringtone
Telephone dialing, DTMF
Dial tone
Busy signal
Alarm clock
Siren
Civil defense siren
Buzzer
Smoke detector, smoke alarm
Fire alarm
Foghorn
Whistle
Steam whistle
Mechanisms
Ratchet, pawl
Clock
Tick
Tick-tock
Gears
Pulleys
Sewing machine
Mechanical fan
Air conditioning
Cash register
Printer
Camera
Single-lens reflex camera
Tools
Hammer
Jackhammer
Sawing
Filing (rasp)
Sanding
Power tool
Drill
Explosion
Gunshot, gunfire
Machine gun
Fusillade
Artillery fire
Cap gun
Fireworks
Firecracker
Burst, pop
Eruption
Boom
Wood
Chop
Splinter
Crack
Glass
Chink, clink
Shatter
Liquid
Splash, splatter
Slosh
Squish
Drip
Pour
Trickle, dribble
Gush
Fill (with liquid)
Spray
Pump (liquid)
Stir
Boiling
Sonar
Arrow
Whoosh, swoosh, swish
Thump, thud
Thunk
Electronic tuner
Effects unit
Chorus effect
Basketball bounce
Bang
Slap, smack
Whack, thwack
Smash, crash
Breaking
Bouncing
Whip
Flap
Scratch
Scrape
Rub
Roll
Crushing
Crumpling, crinkling
Tearing
Beep, bleep
Ping
Ding
Clang
Squeal
Creak
Rustle
Whir
Clatter
Sizzle
Clicking
Clickety-clack
Rumble
Plop
Jingle, tinkle
Hum
Zing
Boing
Crunch
Silence
Sine wave
Harmonic
Chirp tone
Sound effect
Pulse
Inside, small room
Inside, large room or hall
Inside, public space
Outside, urban or manmade
Outside, rural or natural
Reverberation
Echo
Noise
Environmental noise
Static
Mains hum
Distortion
Sidetone
Cacophony
White noise
Pink noise
Throbbing
Vibration
Television
Radio
Field recording
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
sample_rate: 32000
window_size: 1024
hop_size: 640
mel_bins: 128
fmin: 50
fmax: 16000
batch_size : 64
val_batch_size: 8
num_classes: 527
mixup : True
mixup_alpha : 1.0
max_mel_len : 501
mel_crop_len : 480 # used for augmentation
balanced_sampling : True
epoch_num : 500
# for training from scratch
start_lr : 0.0003
warm_steps: 1000
# for fine-tuning
#start_lr : 0.00001
#add dropout in resnet precedding the fc layer
dropout: 0.20
# set the data path accordingly
balanced_train_h5 : './audioset/balanced_train.h5'
unbalanced_train_h5 : './audioset/unbalanced/*.h5'
balanced_eval_h5 : './audioset/balanced_eval.h5'
audioset_label: './assets/audioset_labels.txt'
model_dir : './checkpoints/'
log_path : './log'
checkpoint_step : 5000
lr_dec_per_step : 60000
num_workers : 0
max_time_mask: 5
max_freq_mask: 5
max_time_mask_width: 60
max_freq_mask_width: 60
model_type : 'resnet50' # resnet18,resnet50,resnet101
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import glob
import json
import os
import subprocess
import time
import warnings
import h5py
import librosa
import numpy as np
import paddle
import paddleaudio
import yaml
from paddle.io import DataLoader, Dataset, IterableDataset
from paddleaudio.utils import augments
from utils import get_labels, get_ytid_clsidx_mapping
def spect_permute(spect, tempo_axis, nblocks):
"""spectrogram permutaion"""
assert spect.ndim == 2., 'only supports 2d tensor or numpy array'
if tempo_axis == 0:
nt, nf = spect.shape
else:
nf, nt = spect.shape
if nblocks <= 1:
return spect
block_width = nt // nblocks + 1
if tempo_axis == 1:
blocks = [
spect[:, block_width * i:(i + 1) * block_width]
for i in range(nblocks)
]
np.random.shuffle(blocks)
new_spect = np.concatenate(blocks, 1)
else:
blocks = [
spect[block_width * i:(i + 1) * block_width, :]
for i in range(nblocks)
]
np.random.shuffle(blocks)
new_spect = np.concatenate(blocks, 0)
return new_spect
def random_choice(a):
i = np.random.randint(0, high=len(a))
return a[int(i)]
def get_keys(file_pointers):
all_keys = []
key2file = {}
for fp in file_pointers:
all_keys += list(fp.keys())
key2file.update({k: fp for k in fp.keys()})
return all_keys, key2file
class H5AudioSet(Dataset):
"""
Dataset class for Audioset, with mel features stored in multiple hdf5 files.
The h5 files store mel-spectrogram features pre-extracted from wav files.
Use wav2mel.py to do feature extraction.
"""
def __init__(self,
h5_files,
config,
augment=True,
training=True,
balanced_sampling=True):
super(H5AudioSet, self).__init__()
self.h5_files = h5_files
self.config = config
self.file_pointers = [h5py.File(f) for f in h5_files]
self.all_keys, self.key2file = get_keys(self.file_pointers)
self.augment = augment
self.training = training
self.balanced_sampling = balanced_sampling
print(
f'{len(self.h5_files)} h5 files, totally {len(self.all_keys)} audio files listed'
)
self.ytid2clsidx, self.clsidx2ytid = get_ytid_clsidx_mapping()
def _process(self, x):
assert x.shape[0] == self.config[
'mel_bins'], 'the first dimension must be mel frequency'
target_len = self.config['max_mel_len']
if x.shape[1] <= target_len:
pad_width = (target_len - x.shape[1]) // 2 + 1
x = np.pad(x, ((0, 0), (pad_width, pad_width)))
x = x[:, :target_len]
if self.training and self.augment:
x = augments.random_crop2d(x,
self.config['mel_crop_len'],
tempo_axis=1)
x = spect_permute(x, tempo_axis=1, nblocks=random_choice([0, 2, 3]))
aug_level = random_choice([0.2, 0.1, 0])
x = augments.adaptive_spect_augment(x,
tempo_axis=1,
level=aug_level)
return x.T
def __getitem__(self, idx):
if self.balanced_sampling:
cls_id = int(np.random.randint(0, self.config['num_classes']))
keys = self.clsidx2ytid[cls_id]
k = random_choice(self.all_keys)
cls_ids = self.ytid2clsidx[k]
else:
idx = idx % len(self.all_keys)
k = self.all_keys[idx]
cls_ids = self.ytid2clsidx[k]
fp = self.key2file[k]
x = fp[k][:, :]
x = self._process(x)
y = np.zeros((self.config['num_classes'], ), 'float32')
y[cls_ids] = 1.0
return x, y
def __len__(self):
return len(self.all_keys)
def get_ytid2labels(segment_csv):
"""
compute the mapping (dict object) from youtube id to audioset labels.
"""
with open(segment_csv) as F:
lines = F.read().split('\n')
lines = [l for l in lines if len(l) > 0 and l[0] != '#']
ytid2labels = {l.split(',')[0]: l.split('"')[-2] for l in lines}
return ytid2labels
def worker_init(worker_id):
time.sleep(worker_id / 32)
np.random.seed(int(time.time()) % 100 + worker_id)
def get_train_loader(config):
train_h5_files = glob.glob(config['unbalanced_train_h5'])
train_h5_files += [config['balanced_train_h5']]
train_dataset = H5AudioSet(train_h5_files,
config,
balanced_sampling=config['balanced_sampling'],
augment=True,
training=True)
train_loader = DataLoader(train_dataset,
shuffle=True,
batch_size=config['batch_size'],
drop_last=True,
num_workers=config['num_workers'],
use_buffer_reader=True,
use_shared_memory=True,
worker_init_fn=worker_init)
return train_loader
def get_val_loader(config):
val_dataset = H5AudioSet([config['balanced_eval_h5']],
config,
balanced_sampling=False,
augment=False)
val_loader = DataLoader(val_dataset,
shuffle=False,
batch_size=config['val_batch_size'],
drop_last=False,
num_workers=config['num_workers'])
return val_loader
if __name__ == '__main__':
# do some testing here
with open('./assets/config.yaml') as f:
config = yaml.safe_load(f)
train_h5_files = glob.glob(config['unbalanced_train_h5'])
dataset = H5AudioSet(train_h5_files,
config,
balanced_sampling=True,
augment=True,
training=True)
x, y = dataset[1]
print(x.shape, y.shape)
dataset = H5AudioSet([config['balanced_eval_h5']],
config,
balanced_sampling=False,
augment=False)
x, y = dataset[0]
print(x.shape, y.shape)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import numpy as np
import paddle
import paddle.nn.functional as F
import paddleaudio as pa
import yaml
from dataset import get_val_loader
from model import resnet50
from paddle.utils import download
from sklearn.metrics import average_precision_score, roc_auc_score
from utils import compute_dprime, download_assets
checkpoint_url = 'https://bj.bcebos.com/paddleaudio/paddleaudio/resnet50_weight_averaging_mAP0.416.pdparams'
def evaluate(epoch, val_loader, model, loss_fn):
model.eval()
avg_loss = 0.0
all_labels = []
all_preds = []
for batch_id, (x, y) in enumerate(val_loader()):
x = x.unsqueeze((1))
label = y
logits = model(x)
loss_val = loss_fn(logits, label)
pred = F.sigmoid(logits)
all_labels += [label.numpy()]
all_preds += [pred.numpy()]
avg_loss = (avg_loss * batch_id + loss_val.numpy()[0]) / (1 + batch_id)
msg = f'eval epoch:{epoch}, batch:{batch_id}'
msg += f'|{len(val_loader)}'
msg += f',loss:{avg_loss:.3}'
if batch_id % 20 == 0:
print(msg)
all_preds = np.concatenate(all_preds, 0)
all_labels = np.concatenate(all_labels, 0)
mAP_score = np.mean(
average_precision_score(all_labels, all_preds, average=None))
auc_score = np.mean(roc_auc_score(all_labels, all_preds, average=None))
dprime = compute_dprime(auc_score)
return avg_loss, mAP_score, auc_score, dprime
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Audioset inference')
parser.add_argument('--config',
type=str,
required=False,
default='./assets/config.yaml')
parser.add_argument('--device',
help='set the gpu device number',
type=int,
required=False,
default=0)
parser.add_argument('--weight', type=str, required=False, default='')
args = parser.parse_args()
download_assets()
with open(args.config) as f:
c = yaml.safe_load(f)
paddle.set_device('gpu:{}'.format(args.device))
ModelClass = eval(c['model_type'])
model = ModelClass(pretrained=False,
num_classes=c['num_classes'],
dropout=c['dropout'])
if args.weight.strip() == '':
print(f'Using pretrained weight: {checkpoint_url}')
args.weight = download.get_weights_path_from_url(checkpoint_url)
model.load_dict(paddle.load(args.weight))
model.eval()
val_loader = get_val_loader(c)
print(f'Evaluating...')
avg_loss, mAP_score, auc_score, dprime = evaluate(
0, val_loader, model, F.binary_cross_entropy_with_logits)
print(f'mAP: {mAP_score:.3}')
print(f'auc: {auc_score:.3}')
print(f'd-prime: {dprime:.3}')
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import numpy as np
import paddle
import paddle.nn.functional as F
import paddleaudio as pa
import yaml
from model import resnet50
from paddle.utils import download
from paddleaudio.functional import melspectrogram
from utils import (download_assets, get_label_name_mapping, get_labels,
get_metrics)
download_assets()
checkpoint_url = 'https://bj.bcebos.com/paddleaudio/paddleaudio/resnet50_weight_averaging_mAP0.416.pdparams'
def load_and_extract_feature(file, c):
s, _ = pa.load(file, sr=c['sample_rate'])
x = melspectrogram(paddle.to_tensor(s),
sr=c['sample_rate'],
win_length=c['window_size'],
n_fft=c['window_size'],
hop_length=c['hop_size'],
n_mels=c['mel_bins'],
f_min=c['fmin'],
f_max=c['fmax'],
window='hann',
center=True,
pad_mode='reflect',
to_db=True,
amin=1e-3,
top_db=None)
x = x.transpose((0, 2, 1))
x = x.unsqueeze((0, ))
return x
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Audioset inference')
parser.add_argument('--device',
help='set the gpu device number',
type=int,
required=False,
default=0)
parser.add_argument('--config',
type=str,
required=False,
default='./assets/config.yaml')
parser.add_argument('--weight', type=str, required=False, default='')
parser.add_argument('--wav_file',
type=str,
required=False,
default='./wav/TKtNAJa-mbQ_11.000.wav')
parser.add_argument('--top_k', type=int, required=False, default=20)
args = parser.parse_args()
top_k = args.top_k
label2name, name2label = get_label_name_mapping()
with open(args.config) as f:
c = yaml.safe_load(f)
paddle.set_device('gpu:{}'.format(args.device))
ModelClass = eval(c['model_type'])
model = ModelClass(pretrained=False,
num_classes=c['num_classes'],
dropout=c['dropout'])
if args.weight.strip() == '':
args.weight = download.get_weights_path_from_url(checkpoint_url)
model.load_dict(paddle.load(args.weight))
model.eval()
x = load_and_extract_feature(args.wav_file, c)
labels = get_labels()
logits = model(x)
pred = F.sigmoid(logits)
pred = pred[0].cpu().numpy()
clsidx = np.argsort(pred)[-top_k:][::-1]
probs = np.sort(pred)[-top_k:][::-1]
for i, idx in enumerate(clsidx):
name = label2name[labels[idx]]
print(name, probs[i])
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division, print_function
import paddle
import paddle.nn as nn
from paddle.utils.download import get_weights_path_from_url
__all__ = [
'ResNet', 'resnet18', 'resnet34', 'resnet50', 'resnet101', 'resnet152'
]
model_urls = {
'resnet18': ('https://paddle-hapi.bj.bcebos.com/models/resnet18.pdparams',
'cf548f46534aa3560945be4b95cd11c4'),
'resnet34': ('https://paddle-hapi.bj.bcebos.com/models/resnet34.pdparams',
'8d2275cf8706028345f78ac0e1d31969'),
'resnet50': ('https://paddle-hapi.bj.bcebos.com/models/resnet50.pdparams',
'ca6f485ee1ab0492d38f323885b0ad80'),
'resnet101': ('https://paddle-hapi.bj.bcebos.com/models/resnet101.pdparams',
'02f35f034ca3858e1e54d4036443c92d'),
'resnet152': ('https://paddle-hapi.bj.bcebos.com/models/resnet152.pdparams',
'7ad16a2f1e7333859ff986138630fd7a'),
}
class BasicBlock(nn.Layer):
expansion = 1
def __init__(self,
inplanes,
planes,
stride=1,
downsample=None,
groups=1,
base_width=64,
dilation=1,
norm_layer=None):
super(BasicBlock, self).__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2D
if dilation > 1:
raise NotImplementedError(
"Dilation > 1 not supported in BasicBlock")
self.conv1 = nn.Conv2D(inplanes,
planes,
3,
padding=1,
stride=stride,
bias_attr=False)
self.bn1 = norm_layer(planes)
self.relu = nn.ReLU()
self.conv2 = nn.Conv2D(planes, planes, 3, padding=1, bias_attr=False)
self.bn2 = norm_layer(planes)
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
class BottleneckBlock(nn.Layer):
expansion = 4
def __init__(self,
inplanes,
planes,
stride=1,
downsample=None,
groups=1,
base_width=64,
dilation=1,
norm_layer=None):
super(BottleneckBlock, self).__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2D
width = int(planes * (base_width / 64.)) * groups
self.conv1 = nn.Conv2D(inplanes, width, 1, bias_attr=False)
self.bn1 = norm_layer(width)
self.conv2 = nn.Conv2D(width,
width,
3,
padding=dilation,
stride=stride,
groups=groups,
dilation=dilation,
bias_attr=False)
self.bn2 = norm_layer(width)
self.conv3 = nn.Conv2D(width,
planes * self.expansion,
1,
bias_attr=False)
self.bn3 = norm_layer(planes * self.expansion)
self.relu = nn.ReLU()
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
out = self.relu(out)
out = self.conv3(out)
out = self.bn3(out)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
class ResNet(nn.Layer):
"""ResNet model from
`"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_
Args:
Block (BasicBlock|BottleneckBlock): block module of model.
depth (int): layers of resnet, default: 50.
num_classes (int): output dim of last fc layer. If num_classes <=0, last fc layer
will not be defined. Default: 1000.
with_pool (bool): use pool before the last fc layer or not. Default: True.
Examples:
.. code-block:: python
from paddle.vision.models import ResNet
from paddle.vision.models.resnet import BottleneckBlock, BasicBlock
resnet50 = ResNet(BottleneckBlock, 50)
resnet18 = ResNet(BasicBlock, 18)
"""
def __init__(self,
block,
depth,
num_classes=1000,
with_pool=True,
dropout=0.5):
super(ResNet, self).__init__()
layer_cfg = {
18: [2, 2, 2, 2],
34: [3, 4, 6, 3],
50: [3, 4, 6, 3],
101: [3, 4, 23, 3],
152: [3, 8, 36, 3]
}
layers = layer_cfg[depth]
self.num_classes = num_classes
self.with_pool = with_pool
self._norm_layer = nn.BatchNorm2D
self.inplanes = 64
self.dilation = 1
self.bn0 = nn.BatchNorm2D(128)
self.conv1 = nn.Conv2D(1,
self.inplanes,
kernel_size=7,
stride=2,
padding=3,
bias_attr=False)
self.bn1 = self._norm_layer(self.inplanes)
self.relu = nn.ReLU()
self.relu2 = nn.ReLU()
self.maxpool = nn.MaxPool2D(kernel_size=3, stride=2, padding=1)
self.layer1 = self._make_layer(block, 64, layers[0])
self.drop1 = nn.Dropout2D(dropout)
self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
self.drop2 = nn.Dropout2D(dropout)
self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
self.drop3 = nn.Dropout2D(dropout)
self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
self.drop4 = nn.Dropout2D(dropout)
self.drop = nn.Dropout(dropout)
self.extra_fc = nn.Linear(512 * block.expansion, 1024 * 2)
if with_pool:
self.avgpool = nn.AdaptiveAvgPool2D((1, 1))
if num_classes > 0:
self.fc = nn.Linear(1024 * 2, num_classes)
def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
norm_layer = self._norm_layer
downsample = None
previous_dilation = self.dilation
if dilate:
self.dilation *= stride
stride = 1
if stride != 1 or self.inplanes != planes * block.expansion:
downsample = nn.Sequential(
nn.Conv2D(self.inplanes,
planes * block.expansion,
1,
stride=stride,
bias_attr=False),
norm_layer(planes * block.expansion),
)
layers = []
layers.append(
block(self.inplanes, planes, stride, downsample, 1, 64,
previous_dilation, norm_layer))
self.inplanes = planes * block.expansion
for _ in range(1, blocks):
layers.append(block(self.inplanes, planes, norm_layer=norm_layer))
return nn.Sequential(*layers)
def forward(self, x):
x = x.transpose([0, 3, 2, 1])
x = self.bn0(x)
x = x.transpose([0, 3, 2, 1])
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x)
x = self.drop1(x)
x = self.layer2(x)
x = self.drop2(x)
x = self.layer3(x)
x = self.drop3(x)
x = self.layer4(x)
x = self.drop4(x)
if self.with_pool:
x = self.avgpool(x)
if self.num_classes > 0:
x = paddle.flatten(x, 1)
x = self.drop(x)
x = self.extra_fc(x)
x = self.relu2(x)
x = self.fc(x)
return x
def _resnet(arch, Block, depth, pretrained, **kwargs):
model = ResNet(Block, depth, **kwargs)
if pretrained:
assert arch in model_urls, "{} model do not have a pretrained model now, you should set pretrained=False".format(
arch)
weight_path = get_weights_path_from_url(model_urls[arch][0],
model_urls[arch][1])
param = paddle.load(weight_path)
model.set_dict(param)
return model
def resnet18(pretrained=False, **kwargs):
"""ResNet 18-layer model
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
Examples:
.. code-block:: python
from paddle.vision.models import resnet18
# build model
model = resnet18()
# build model and load imagenet pretrained weight
# model = resnet18(pretrained=True)
"""
return _resnet('resnet18', BasicBlock, 18, pretrained, **kwargs)
def resnet34(pretrained=False, **kwargs):
"""ResNet 34-layer model
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
Examples:
.. code-block:: python
from paddle.vision.models import resnet34
# build model
model = resnet34()
# build model and load imagenet pretrained weight
# model = resnet34(pretrained=True)
"""
return _resnet('resnet34', BasicBlock, 34, pretrained, **kwargs)
def resnet50(pretrained=False, **kwargs):
"""ResNet 50-layer model
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
Examples:
.. code-block:: python
from paddle.vision.models import resnet50
# build model
model = resnet50()
# build model and load imagenet pretrained weight
# model = resnet50(pretrained=True)
"""
return _resnet('resnet50', BottleneckBlock, 50, pretrained, **kwargs)
def resnet101(pretrained=False, **kwargs):
"""ResNet 101-layer model
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
Examples:
.. code-block:: python
from paddle.vision.models import resnet101
# build model
model = resnet101()
# build model and load imagenet pretrained weight
# model = resnet101(pretrained=True)
"""
return _resnet('resnet101', BottleneckBlock, 101, pretrained, **kwargs)
def resnet152(pretrained=False, **kwargs):
"""ResNet 152-layer model
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
Examples:
.. code-block:: python
from paddle.vision.models import resnet152
# build model
model = resnet152()
# build model and load imagenet pretrained weight
# model = resnet152(pretrained=True)
"""
return _resnet('resnet152', BottleneckBlock, 152, pretrained, **kwargs)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import glob
import os
import time
import numpy as np
import paddle
import paddle.distributed as dist
import paddle.nn.functional as F
import yaml
from dataset import get_train_loader, get_val_loader
from evaluate import evaluate
from model import resnet18, resnet50, resnet101
from paddle.io import DataLoader, Dataset, IterableDataset
from paddle.optimizer import Adam
from utils import (MixUpLoss, get_metrics, load_checkpoint, mixup_data,
save_checkpoint)
from visualdl import LogWriter
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Audioset training')
parser.add_argument('--device', type=int, required=False, default=1)
parser.add_argument('--restore', type=int, required=False, default=-1)
parser.add_argument('--config',
type=str,
required=False,
default='./assets/config.yaml')
parser.add_argument('--distributed', type=int, required=False, default=0)
args = parser.parse_args()
with open(args.config) as f:
c = yaml.safe_load(f)
log_writer = LogWriter(logdir=c['log_path'])
prefix = 'mixup_{}'.format(c['model_type'])
if args.distributed != 0:
dist.init_parallel_env()
local_rank = dist.get_rank()
else:
paddle.set_device('gpu:{}'.format(args.device))
local_rank = 0
print(f'using ' + c['model_type'])
ModelClass = eval(c['model_type'])
#define loss
bce_loss = F.binary_cross_entropy_with_logits
loss_fn = MixUpLoss(bce_loss)
warm_steps = c['warm_steps']
lrs = np.linspace(1e-10, c['start_lr'], warm_steps)
# restore checkpoint
if args.restore != -1:
model = ModelClass(pretrained=False,
num_classes=c['num_classes'],
dropout=c['dropout'])
model_dict, optim_dict = load_checkpoint(c['model_dir'], args.restore,
prefix)
model.load_dict(model_dict)
optimizer = Adam(learning_rate=c['start_lr'],
parameters=model.parameters())
optimizer.set_state_dict(optim_dict)
start_epoch = args.restore
else:
model = ModelClass(pretrained=True,
num_classes=c['num_classes'],
dropout=c['dropout']) # use imagenet pretrained
optimizer = Adam(learning_rate=c['start_lr'],
parameters=model.parameters())
start_epoch = 0
#for name,p in list(model.named_parameters())[:-2]:
# print(name,p.stop_gradient)
# p.stop_gradient = True
os.makedirs(c['model_dir'], exist_ok=True)
if args.distributed != 0:
model = paddle.DataParallel(model)
train_loader = get_train_loader(c)
val_loader = get_val_loader(c)
epoch_num = c['epoch_num']
if args.restore != -1:
avg_loss, mAP_score, auc_score, dprime = evaluate(
args.restore, val_loader, model, bce_loss)
print(f'average map at epoch {args.restore} is {mAP_score}')
print(f'auc_score: {auc_score}')
print(f'd-prime: {dprime}')
best_mAP = mAP_score
log_writer.add_scalar(tag="eval mAP",
step=args.restore,
value=mAP_score)
log_writer.add_scalar(tag="eval auc",
step=args.restore,
value=auc_score)
log_writer.add_scalar(tag="eval dprime",
step=args.restore,
value=dprime)
else:
best_mAP = 0.0
step = 0
for epoch in range(start_epoch, epoch_num):
avg_loss = 0.0
avg_preci = 0.0
avg_recall = 0.0
model.train()
model.clear_gradients()
t0 = time.time()
for batch_id, (x, y) in enumerate(train_loader()):
if step < warm_steps:
optimizer.set_lr(lrs[step])
x.stop_gradient = False
if c['balanced_sampling']:
x = x.squeeze()
y = y.squeeze()
x = x.unsqueeze((1))
if c['mixup']:
mixed_x, mixed_y = mixup_data(x, y, c['mixup_alpha'])
logits = model(mixed_x)
loss_val = loss_fn(logits, mixed_y)
loss_val.backward()
else:
logits = model(x)
loss_val = bce_loss(logits, y)
loss_val.backward()
optimizer.step()
model.clear_gradients()
pred = F.sigmoid(logits)
preci, recall = get_metrics(y.squeeze().numpy(), pred.numpy())
avg_loss = (avg_loss * batch_id + loss_val.numpy()[0]) / (1 +
batch_id)
avg_preci = (avg_preci * batch_id + preci) / (1 + batch_id)
avg_recall = (avg_recall * batch_id + recall) / (1 + batch_id)
elapsed = (time.time() - t0) / 3600
remain = elapsed / (1 + batch_id) * (len(train_loader) - batch_id)
msg = f'epoch:{epoch}, batch:{batch_id}'
msg += f'|{len(train_loader)}'
msg += f',loss:{avg_loss:.3}'
msg += f',recall:{avg_recall:.3}'
msg += f',preci:{avg_preci:.3}'
msg += f',elapsed:{elapsed:.1}h'
msg += f',remained:{remain:.1}h'
if batch_id % 20 == 0 and local_rank == 0:
print(msg)
log_writer.add_scalar(tag="train loss",
step=step,
value=avg_loss)
log_writer.add_scalar(tag="train preci",
step=step,
value=avg_preci)
log_writer.add_scalar(tag="train recall",
step=step,
value=avg_recall)
step += 1
if step % c['checkpoint_step'] == 0 and local_rank == 0:
save_checkpoint(c['model_dir'], epoch, model, optimizer, prefix)
avg_loss, avg_map, auc_score, dprime = evaluate(
epoch, val_loader, model, bce_loss)
print(f'average map at epoch {epoch} is {avg_map}')
print(f'auc: {auc_score}')
print(f'd-prime: {dprime}')
log_writer.add_scalar(tag="eval mAP", step=epoch, value=avg_map)
log_writer.add_scalar(tag="eval auc",
step=epoch,
value=auc_score)
log_writer.add_scalar(tag="eval dprime",
step=epoch,
value=dprime)
model.train()
model.clear_gradients()
if avg_map > best_mAP:
print('mAP improved from {} to {}'.format(
best_mAP, avg_map))
best_mAP = avg_map
fn = os.path.join(
c['model_dir'],
f'{prefix}_epoch{epoch}_mAP{avg_map:.3}.pdparams')
paddle.save(model.state_dict(), fn)
else:
print(f'mAP {avg_map} did not improved from {best_mAP}')
if step % c['lr_dec_per_step'] == 0 and step != 0:
if optimizer.get_lr() <= 1e-6:
factor = 0.95
else:
factor = 0.8
optimizer.set_lr(optimizer.get_lr() * factor)
print('decreased lr to {}'.format(optimizer.get_lr()))
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import os
import numpy as np
import paddle
import paddle.nn.functional as F
from scipy import stats
from sklearn.metrics import average_precision_score
__all__ = [
'save_checkpoint', 'load_checkpoint', 'get_labels', 'random_choice',
'get_label_name_mapping'
]
def random_choice(a):
i = np.random.randint(0, high=len(a), size=(1, ))
return a[int(i)]
def get_labels():
with open('./assets/audioset_labels.txt') as F:
labels = F.read().split('\n')
return labels
def get_ytid_clsidx_mapping():
"""
Compute the mapping between youtube id and class index.
The class index range from 0 to 527, correspoding to the labels stored in audioset_labels.txt file
"""
labels = get_labels()
label2clsidx = {l: i for i, l in enumerate(labels)}
lines = open('./assets/unbalanced_train_segments.csv').read().split('\n')
lines += open('./assets/balanced_train_segments.csv').read().split('\n')
lines += open('./assets/eval_segments.csv').read().split('\n')
lines = [l for l in lines if len(l) > 0 and l[0] != '#']
ytid2clsidx = {}
for l in lines:
ytid = l.split(',')[0]
labels = l.split(',')[3:]
cls_idx = []
for label in labels:
label = label.replace('"', '').strip()
cls_idx.append(label2clsidx[label])
ytid2clsidx.update({ytid: cls_idx})
clsidx2ytid = {i: [] for i in range(527)}
for k in ytid2clsidx.keys():
for v in ytid2clsidx[k]:
clsidx2ytid[v] += [k]
return ytid2clsidx, clsidx2ytid
def get_metrics(label, pred):
a = label
b = (pred > 0.5).astype('int32')
eps = 1e-8
tp = np.sum(b[a == 1])
fp = np.sum(b[a == 0])
precision = tp / (fp + tp + eps)
fn = np.sum(b[a == 1] == 0)
recall = tp / (tp + fn)
return precision, recall
def compute_dprime(auc):
"""Compute d_prime metric.
Reference:
J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology and humanlabeled dataset for audio events,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 776–780.
"""
dp = stats.norm().ppf(auc) * np.sqrt(2.0)
return dp
def save_checkpoint(model_dir, step, model, optimizer, prefix):
print(f'checkpointing at step {step}')
paddle.save(model.state_dict(),
model_dir + '/{}_checkpoint{}.pdparams'.format(prefix, step))
paddle.save(optimizer.state_dict(),
model_dir + '/{}_checkpoint{}.pdopt'.format(prefix, step))
def load_checkpoint(model_dir, epoch, prefix):
file = model_dir + '/{}_checkpoint_model{}.tar'.format(prefix, epoch)
print('loading checkpoing ' + file)
model_dict = paddle.load(model_dir +
'/{}_checkpoint{}.pdparams'.format(prefix, epoch))
optim_dict = paddle.load(model_dir +
'/{}_checkpoint{}.pdopt'.format(prefix, epoch))
return model_dict, optim_dict
def get_label_name_mapping():
with open('./assets/ontology.json') as F:
ontology = json.load(F)
label2name = {o['id']: o['name'] for o in ontology}
name2label = {o['name']: o['id'] for o in ontology}
return label2name, name2label
def download_assets():
os.makedirs('./assets/', exist_ok=True)
urls = [
'https://raw.githubusercontent.com/audioset/ontology/master/ontology.json',
'http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/eval_segments.csv',
'http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/balanced_train_segments.csv',
'http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/unbalanced_train_segments.csv'
]
for url in urls:
fname = './assets/' + url.split('/')[-1]
if os.path.exists(fname):
continue
cmd = 'wget ' + url + ' -O ' + fname
print(cmd)
os.system(cmd)
class MixUpLoss(paddle.nn.Layer):
"""Define the mixup loss used in training audioset.
Reference:
Zhang, Hongyi, et al. “Mixup: Beyond Empirical Risk Minimization.” International Conference on Learning Representations, 2017.
"""
def __init__(self, criterion):
super(MixUpLoss, self).__init__()
self.criterion = criterion
def forward(self, pred, mixup_target):
assert type(mixup_target) in [
tuple, list
] and len(mixup_target
) == 3, 'mixup data should be tuple consists of (ya,yb,lamda)'
ya, yb, lamda = mixup_target
return lamda * self.criterion(pred, ya) \
+ (1 - lamda) * self.criterion(pred, yb)
def extra_repr(self):
return 'MixUpLoss with {}'.format(self.criterion)
def mixup_data(x, y, alpha=1.0):
"""Mix the input data and label using mixup strategy, returns mixed inputs,
pairs of targets, and lambda
Reference:
Zhang, Hongyi, et al. “Mixup: Beyond Empirical Risk Minimization.” International Conference on Learning Representations, 2017.
"""
if alpha > 0:
lam = np.random.beta(alpha, alpha)
else:
lam = 1
batch_size = x.shape[0]
index = paddle.randperm(batch_size)
mixed_x = lam * x + (1 - lam) * paddle.index_select(x, index)
y_a, y_b = y, paddle.index_select(y, index)
mixed_target = (y_a, y_b, lam)
return mixed_x, mixed_target
import argparse
import glob
import os
import h5py
import numpy as np
import paddle
import paddleaudio as pa
import tqdm
from paddleaudio.functional import melspectrogram
parser = argparse.ArgumentParser(description='wave2mel')
parser.add_argument('--wav_file', type=str, required=False, default='')
parser.add_argument('--wav_list', type=str, required=False, default='')
parser.add_argument('--wav_h5_file', type=str, required=False, default='')
parser.add_argument('--wav_h5_list', type=str, required=False, default='')
parser.add_argument('--output_folder', type=str, required=False, default='./')
parser.add_argument('--output_h5', type=bool, required=False, default=True)
parser.add_argument('--dst_h5_file', type=str, required=False, default='')
parser.add_argument('--sample_rate', type=int, required=False, default=32000)
parser.add_argument('--window_size', type=int, required=False, default=1024)
parser.add_argument('--mel_bins', type=int, required=False, default=128)
parser.add_argument('--hop_length', type=int, required=False,
default=640) #20ms
parser.add_argument('--fmin', type=int, required=False, default=50) #25ms
parser.add_argument('--fmax', type=int, required=False, default=16000) #25ms
parser.add_argument('--skip_existed', type=int, required=False,
default=1) #25ms
args = parser.parse_args()
assert not (args.wav_h5_file == '' and args.wav_h5_list == ''\
and args.wav_list == '' and args.wav_file == ''), 'one of wav_file,wav_list,\
wav_h5_file,wav_h5_list needs to specify'
h5_files = []
wav_files = []
if args.wav_h5_file != '':
h5_files = [args.wav_h5_file]
elif args.wav_h5_list != '':
h5_files = open(args.wav_h5_list).read().split('\n')
h5_files = [h for h in h5_files if len(h.strip()) != 0]
elif args.wav_list != '':
wav_files = open(args.wav_list).read().split('\n')
wav_files = [h for h in wav_files if len(h.strip()) != 0]
elif args.wav_file != '':
wav_files = [args.wav_file]
dst_folder = args.output_folder
if len(h5_files) > 0:
print(f'{len(h5_files)} h5 files listed')
for f in h5_files:
print(f'processing {f}')
dst_file = os.path.join(dst_folder, f.split('/')[-1])
print(f'target file {dst_file}')
if args.skip_existed != 0 and os.path.exists(dst_file):
print(f'skipped file {f}')
continue
assert not os.path.exists(dst_file), f'target file {dst_file} existed'
src_h5 = h5py.File(f)
dst_h5 = h5py.File(dst_file, "w")
for key in tqdm.tqdm(src_h5.keys()):
s = src_h5[key][:]
s = pa.depth_convert(s, 'float32')
# s = pa.resample(s,32000,args.sample_rate)
x = melspectrogram(paddle.to_tensor(s),
sr=args.sample_rate,
win_length=args.window_size,
n_fft=args.window_size,
hop_length=args.hop_length,
n_mels=args.mel_bins,
f_min=args.fmin,
f_max=args.fmax,
window='hann',
center=True,
pad_mode='reflect',
to_db=True,
amin=1e-3,
top_db=None)
dst_h5.create_dataset(key, data=x[0].numpy())
src_h5.close()
dst_h5.close()
if len(wav_files) > 0:
assert args.dst_h5_file != '', 'for using wav file or wav list, dst_h5_file must be specified'
dst_file = args.dst_h5_file
assert not os.path.exists(dst_file), f'target file {dst_file} existed'
dst_h5 = h5py.File(dst_file, "w")
print(f'{len(wav_files)} wav files listed')
for f in tqdm.tqdm(wav_files):
s, _ = pa.load(f, sr=args.sample_rate)
x = melspectrogram(paddle.to_tensor(s),
sr=args.sample_rate,
win_length=args.window_size,
n_fft=args.window_size,
hop_length=args.hop_length,
n_mels=args.mel_bins,
f_min=args.fmin,
f_max=args.fmax,
window='hann',
center=True,
pad_mode='reflect',
to_db=True,
amin=1e-3,
top_db=None)
# figure(figsize=(8,8))
# imshow(x)
# show()
# print(x.shape)
key = f.split('/')[-1][:11]
dst_h5.create_dataset(key, data=x[0].numpy())
dst_h5.close()
# Audio Tagging
声音分类的任务是单标签的分类任务,但是对于一段音频来说,它可以是多标签的。譬如在一般的室内办公环境进行录音,这段音频里可能包含人们说话的声音、键盘敲打的声音、鼠标点击的声音,还有室内的一些其他背景声音。对于通用的声音识别和声音检测场景而言,对一段音频预测多个标签是具有很强的实用性的。
在IEEE ICASSP 2017 大会上,谷歌开放了一个大规模的音频数据集[Audioset](https://research.google.com/audioset/)。该数据集包含了 632 类的音频类别以及 2,084,320 条人工标记的每段 10 秒长度的声音剪辑片段(来源于YouTube视频)。目前该数据集已经有210万个已标注的视频数据,5800小时的音频数据,经过标记的声音样本的标签类别为527。
`PANNs`([PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition](https://arxiv.org/pdf/1912.10211.pdf))是基于Audioset数据集训练的声音分类/识别的模型。其预训练的任务是多标签的声音识别,因此可用于声音的实时tagging。
本示例采用`PANNs`预训练模型,基于Audioset的标签类别对输入音频实时tagging,并最终以文本形式输出对应时刻的top k类别和对应的得分。
## 模型简介
PaddleAudio提供了PANNs的CNN14、CNN10和CNN6的预训练模型,可供用户选择使用:
- CNN14: 该模型主要包含12个卷积层和2个全连接层,模型参数的数量为79.6M,embbedding维度是2048。
- CNN10: 该模型主要包含8个卷积层和2个全连接层,模型参数的数量为4.9M,embbedding维度是512。
- CNN6: 该模型主要包含4个卷积层和2个全连接层,模型参数的数量为4.5M,embbedding维度是512。
## 快速开始
### 模型预测
```shell
export CUDA_VISIBLE_DEVICES=0
python audio_tag.py --device gpu --wav ./cat.wav --sample_duration 2 --hop_duration 0.3 --output_dir ./output_dir
```
可支持配置的参数:
- `device`: 选用什么设备进行训练,可选cpu或gpu,默认为gpu。如使用gpu训练则参数gpus指定GPU卡号。
- `wav`: 指定预测的音频文件。
- `sample_duration`: 模型每次预测的音频时间长度,单位为秒,默认为2s。
- `hop_duration`: 每两个预测音频的时间间隔,单位为秒,默认为0.3s。
- `output_dir`: 模型预测结果存放的路径,默认为`./output_dir`
示例代码中使用的预训练模型为`CNN14`,如果想更换为其他预训练模型,可通过以下方式执行:
```python
from paddleaudio.models.panns import cnn14, cnn10, cnn6
# CNN14
model = cnn14(pretrained=True, extract_embedding=False)
# CNN10
model = cnn10(pretrained=True, extract_embedding=False)
# CNN6
model = cnn6(pretrained=True, extract_embedding=False)
```
执行结果:
```
[2021-04-30 19:15:41,025] [ INFO] - Saved tagging results to ./output_dir/audioset_tagging_sr_44100.npz
```
执行后得分结果保存在`output_dir``.npz`文件中。
### 生成tagging标签文本
```shell
python parse_result.py --tagging_file ./output_dir/audioset_tagging_sr_44100.npz --top_k 10 --smooth True --smooth_size 5 --label_file ./assets/audioset_labels.txt --output_dir ./output_dir
```
可支持配置的参数:
- `tagging_file`: 模型预测结果文件。
- `top_k`: 获取预测结果中,得分最高的前top_k个标签,默认为10。
- `smooth`: 预测结果的后验概率平滑,默认为True,表示应用平滑。
- `smooth_size`: 平滑计算过程中的样本数量,默认为5。
- `label_file`: 模型预测结果对应的Audioset类别的文本文件。
- `output_dir`: 标签文本存放的路径,默认为`./output_dir`
执行结果:
```
[2021-04-30 19:26:58,743] [ INFO] - Posterior smoothing...
[2021-04-30 19:26:58,746] [ INFO] - Saved tagging labels to ./output_dir/audioset_tagging_sr_44100.txt
```
执行后文本结果保存在`output_dir``.txt`文件中。
## Tagging标签文本
最终输出的文本结果如下所示。
样本每个时间范围的top k结果用空行分隔。在每一个结果中,第一行是时间信息,数字表示tagging结果在时间起点信息,比例值代表当前时刻`t`与音频总长度`T`的比值;紧接的k行是对应的标签和得分。
```
0.0
Cat: 0.9144676923751831
Animal: 0.8855036497116089
Domestic animals, pets: 0.804577112197876
Meow: 0.7422927021980286
Music: 0.19959309697151184
Inside, small room: 0.12550437450408936
Caterwaul: 0.021584441885352135
Purr: 0.020247288048267365
Speech: 0.018197158351540565
Vehicle: 0.007446660194545984
0.059197544398158296
Cat: 0.9250872135162354
Animal: 0.8957151174545288
Domestic animals, pets: 0.8228275775909424
Meow: 0.7650775909423828
Music: 0.20210561156272888
Inside, small room: 0.12290887534618378
Caterwaul: 0.029371455311775208
Purr: 0.018731823191046715
Speech: 0.017130598425865173
Vehicle: 0.007748497650027275
0.11839508879631659
Cat: 0.9336574673652649
Animal: 0.9111202359199524
Domestic animals, pets: 0.8349071145057678
Meow: 0.7761964797973633
Music: 0.20467285811901093
Inside, small room: 0.10709915310144424
Caterwaul: 0.05370649695396423
Purr: 0.018830426037311554
Speech: 0.017361722886562347
Vehicle: 0.006929398979991674
...
...
```
以下[Demo](https://bj.bcebos.com/paddleaudio/media/audio_tagging_demo.mp4)展示了一个将tagging标签输出到视频的例子,可以实时地对音频进行多标签预测。
![](https://bj.bcebos.com/paddleaudio/media/audio_tagging_demo.gif)
Speech
Male speech, man speaking
Female speech, woman speaking
Child speech, kid speaking
Conversation
Narration, monologue
Babbling
Speech synthesizer
Shout
Bellow
Whoop
Yell
Battle cry
Children shouting
Screaming
Whispering
Laughter
Baby laughter
Giggle
Snicker
Belly laugh
Chuckle, chortle
Crying, sobbing
Baby cry, infant cry
Whimper
Wail, moan
Sigh
Singing
Choir
Yodeling
Chant
Mantra
Male singing
Female singing
Child singing
Synthetic singing
Rapping
Humming
Groan
Grunt
Whistling
Breathing
Wheeze
Snoring
Gasp
Pant
Snort
Cough
Throat clearing
Sneeze
Sniff
Run
Shuffle
Walk, footsteps
Chewing, mastication
Biting
Gargling
Stomach rumble
Burping, eructation
Hiccup
Fart
Hands
Finger snapping
Clapping
Heart sounds, heartbeat
Heart murmur
Cheering
Applause
Chatter
Crowd
Hubbub, speech noise, speech babble
Children playing
Animal
Domestic animals, pets
Dog
Bark
Yip
Howl
Bow-wow
Growling
Whimper (dog)
Cat
Purr
Meow
Hiss
Caterwaul
Livestock, farm animals, working animals
Horse
Clip-clop
Neigh, whinny
Cattle, bovinae
Moo
Cowbell
Pig
Oink
Goat
Bleat
Sheep
Fowl
Chicken, rooster
Cluck
Crowing, cock-a-doodle-doo
Turkey
Gobble
Duck
Quack
Goose
Honk
Wild animals
Roaring cats (lions, tigers)
Roar
Bird
Bird vocalization, bird call, bird song
Chirp, tweet
Squawk
Pigeon, dove
Coo
Crow
Caw
Owl
Hoot
Bird flight, flapping wings
Canidae, dogs, wolves
Rodents, rats, mice
Mouse
Patter
Insect
Cricket
Mosquito
Fly, housefly
Buzz
Bee, wasp, etc.
Frog
Croak
Snake
Rattle
Whale vocalization
Music
Musical instrument
Plucked string instrument
Guitar
Electric guitar
Bass guitar
Acoustic guitar
Steel guitar, slide guitar
Tapping (guitar technique)
Strum
Banjo
Sitar
Mandolin
Zither
Ukulele
Keyboard (musical)
Piano
Electric piano
Organ
Electronic organ
Hammond organ
Synthesizer
Sampler
Harpsichord
Percussion
Drum kit
Drum machine
Drum
Snare drum
Rimshot
Drum roll
Bass drum
Timpani
Tabla
Cymbal
Hi-hat
Wood block
Tambourine
Rattle (instrument)
Maraca
Gong
Tubular bells
Mallet percussion
Marimba, xylophone
Glockenspiel
Vibraphone
Steelpan
Orchestra
Brass instrument
French horn
Trumpet
Trombone
Bowed string instrument
String section
Violin, fiddle
Pizzicato
Cello
Double bass
Wind instrument, woodwind instrument
Flute
Saxophone
Clarinet
Harp
Bell
Church bell
Jingle bell
Bicycle bell
Tuning fork
Chime
Wind chime
Change ringing (campanology)
Harmonica
Accordion
Bagpipes
Didgeridoo
Shofar
Theremin
Singing bowl
Scratching (performance technique)
Pop music
Hip hop music
Beatboxing
Rock music
Heavy metal
Punk rock
Grunge
Progressive rock
Rock and roll
Psychedelic rock
Rhythm and blues
Soul music
Reggae
Country
Swing music
Bluegrass
Funk
Folk music
Middle Eastern music
Jazz
Disco
Classical music
Opera
Electronic music
House music
Techno
Dubstep
Drum and bass
Electronica
Electronic dance music
Ambient music
Trance music
Music of Latin America
Salsa music
Flamenco
Blues
Music for children
New-age music
Vocal music
A capella
Music of Africa
Afrobeat
Christian music
Gospel music
Music of Asia
Carnatic music
Music of Bollywood
Ska
Traditional music
Independent music
Song
Background music
Theme music
Jingle (music)
Soundtrack music
Lullaby
Video game music
Christmas music
Dance music
Wedding music
Happy music
Funny music
Sad music
Tender music
Exciting music
Angry music
Scary music
Wind
Rustling leaves
Wind noise (microphone)
Thunderstorm
Thunder
Water
Rain
Raindrop
Rain on surface
Stream
Waterfall
Ocean
Waves, surf
Steam
Gurgling
Fire
Crackle
Vehicle
Boat, Water vehicle
Sailboat, sailing ship
Rowboat, canoe, kayak
Motorboat, speedboat
Ship
Motor vehicle (road)
Car
Vehicle horn, car horn, honking
Toot
Car alarm
Power windows, electric windows
Skidding
Tire squeal
Car passing by
Race car, auto racing
Truck
Air brake
Air horn, truck horn
Reversing beeps
Ice cream truck, ice cream van
Bus
Emergency vehicle
Police car (siren)
Ambulance (siren)
Fire engine, fire truck (siren)
Motorcycle
Traffic noise, roadway noise
Rail transport
Train
Train whistle
Train horn
Railroad car, train wagon
Train wheels squealing
Subway, metro, underground
Aircraft
Aircraft engine
Jet engine
Propeller, airscrew
Helicopter
Fixed-wing aircraft, airplane
Bicycle
Skateboard
Engine
Light engine (high frequency)
Dental drill, dentist's drill
Lawn mower
Chainsaw
Medium engine (mid frequency)
Heavy engine (low frequency)
Engine knocking
Engine starting
Idling
Accelerating, revving, vroom
Door
Doorbell
Ding-dong
Sliding door
Slam
Knock
Tap
Squeak
Cupboard open or close
Drawer open or close
Dishes, pots, and pans
Cutlery, silverware
Chopping (food)
Frying (food)
Microwave oven
Blender
Water tap, faucet
Sink (filling or washing)
Bathtub (filling or washing)
Hair dryer
Toilet flush
Toothbrush
Electric toothbrush
Vacuum cleaner
Zipper (clothing)
Keys jangling
Coin (dropping)
Scissors
Electric shaver, electric razor
Shuffling cards
Typing
Typewriter
Computer keyboard
Writing
Alarm
Telephone
Telephone bell ringing
Ringtone
Telephone dialing, DTMF
Dial tone
Busy signal
Alarm clock
Siren
Civil defense siren
Buzzer
Smoke detector, smoke alarm
Fire alarm
Foghorn
Whistle
Steam whistle
Mechanisms
Ratchet, pawl
Clock
Tick
Tick-tock
Gears
Pulleys
Sewing machine
Mechanical fan
Air conditioning
Cash register
Printer
Camera
Single-lens reflex camera
Tools
Hammer
Jackhammer
Sawing
Filing (rasp)
Sanding
Power tool
Drill
Explosion
Gunshot, gunfire
Machine gun
Fusillade
Artillery fire
Cap gun
Fireworks
Firecracker
Burst, pop
Eruption
Boom
Wood
Chop
Splinter
Crack
Glass
Chink, clink
Shatter
Liquid
Splash, splatter
Slosh
Squish
Drip
Pour
Trickle, dribble
Gush
Fill (with liquid)
Spray
Pump (liquid)
Stir
Boiling
Sonar
Arrow
Whoosh, swoosh, swish
Thump, thud
Thunk
Electronic tuner
Effects unit
Chorus effect
Basketball bounce
Bang
Slap, smack
Whack, thwack
Smash, crash
Breaking
Bouncing
Whip
Flap
Scratch
Scrape
Rub
Roll
Crushing
Crumpling, crinkling
Tearing
Beep, bleep
Ping
Ding
Clang
Squeal
Creak
Rustle
Whir
Clatter
Sizzle
Clicking
Clickety-clack
Rumble
Plop
Jingle, tinkle
Hum
Zing
Boing
Crunch
Silence
Sine wave
Harmonic
Chirp tone
Sound effect
Pulse
Inside, small room
Inside, large room or hall
Inside, public space
Outside, urban or manmade
Outside, rural or natural
Reverberation
Echo
Noise
Environmental noise
Static
Mains hum
Distortion
Sidetone
Cacophony
White noise
Pink noise
Throbbing
Vibration
Television
Radio
Field recording
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import ast
import os
from typing import List
import numpy as np
import paddle
from paddleaudio.backends import load as load_audio
from paddleaudio.models.panns import cnn14
from paddleaudio.transforms import LogMelSpectrogram
from paddleaudio.utils import Timer, get_logger
logger = get_logger()
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument('--device', choices=['cpu', 'gpu'], default='gpu', help='Select which device to predict, defaults to gpu.')
parser.add_argument('--wav', type=str, required=True, help='Audio file to infer.')
parser.add_argument('--sample_duration', type=float, default=2.0, help='Duration(in seconds) of tagging samples to predict.')
parser.add_argument('--hop_duration', type=float, default=0.3, help='Duration(in seconds) between two samples.')
parser.add_argument('--output_dir', type=str, default='./output_dir', help='Directory to save tagging result.')
args = parser.parse_args()
# yapf: enable
def split(waveform: np.ndarray, win_size: int, hop_size: int):
"""
Split into N waveforms.
N is decided by win_size and hop_size.
"""
assert isinstance(waveform, np.ndarray)
time = []
data = []
for i in range(0, len(waveform), hop_size):
segment = waveform[i:i + win_size]
if len(segment) < win_size:
segment = np.pad(segment, (0, win_size - len(segment)))
data.append(segment)
time.append(i / len(waveform))
return time, data
def batchify(data: List[List[float]], sample_rate: int, batch_size: int,
**kwargs):
"""
Extract features from waveforms and create batches.
"""
examples = data
# Seperates data into some batches.
one_batch = []
for example in examples:
one_batch.append(example)
if len(one_batch) == batch_size:
yield one_batch
one_batch = []
if one_batch:
yield one_batch
def predict(model,
feature_extractor,
data: List[List[float]],
sample_rate: int,
batch_size: int = 1):
"""
Use pretrained model to make predictions.
"""
batches = batchify(data, sample_rate, batch_size)
results = None
model.eval()
for waveforms in batches:
feats = feature_extractor(paddle.to_tensor(waveforms))
audioset_scores = model(feats)
if results is None:
results = audioset_scores.numpy()
else:
results = np.concatenate((results, audioset_scores.numpy()))
return results
if __name__ == '__main__':
paddle.set_device(args.device)
feature_extractor = LogMelSpectrogram(
sr=16000, n_fft=512, hop_length=320, n_mels=64, f_min=50)
model = cnn14(pretrained=True, extract_embedding=False)
waveform, sr = load_audio(args.wav, sr=None)
time, data = split(waveform, int(args.sample_duration * sr),
int(args.hop_duration * sr))
results = predict(model, feature_extractor, data, sr, batch_size=8)
if not os.path.exists(args.output_dir):
os.makedirs(args.output_dir)
time = np.arange(0, 1, int(args.hop_duration * sr) / len(waveform))
output_file = os.path.join(args.output_dir, f'audioset_tagging_sr_{sr}.npz')
np.savez(output_file, time=time, scores=results)
logger.info(f'Saved tagging results to {output_file}')
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import ast
import os
from typing import Dict, List
import numpy as np
from paddleaudio.utils import logger
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument('--tagging_file', type=str, required=True, help='')
parser.add_argument('--top_k', type=int, default=10, help='Get top k predicted results of audioset labels.')
parser.add_argument('--smooth', type=ast.literal_eval, default=True, help='Set "True" to apply posterior smoothing.')
parser.add_argument('--smooth_size', type=int, default=5, help='Window size of posterior smoothing.')
parser.add_argument('--label_file', type=str, default='./assets/audioset_labels.txt', help='File of audioset labels.')
parser.add_argument('--output_dir', type=str, default='./output_dir', help='Directory to save tagging labels.')
args = parser.parse_args()
# yapf: enable
def smooth(results: np.ndarray, win_size: int):
"""
Execute posterior smoothing in-place.
"""
for i in range(len(results) - 1, -1, -1):
if i < win_size - 1:
left = 0
else:
left = i + 1 - win_size
results[i] = np.sum(results[left:i + 1], axis=0) / (i - left + 1)
def generate_topk_label(k: int, label_map: Dict, result: np.ndarray):
"""
Return top k result.
"""
result = np.asarray(result)
topk_idx = (-result).argsort()[:k]
ret = ''
for idx in topk_idx:
label, score = label_map[idx], result[idx]
ret += f'{label}: {score}\n'
return ret
if __name__ == "__main__":
label_map = {}
with open(args.label_file, 'r') as f:
for i, l in enumerate(f.readlines()):
label_map[i] = l.strip()
results = np.load(args.tagging_file, allow_pickle=True)
times, scores = results['time'], results['scores']
if args.smooth:
logger.info('Posterior smoothing...')
smooth(scores, win_size=args.smooth_size)
if not os.path.exists(args.output_dir):
os.makedirs(args.output_dir)
output_file = os.path.join(
args.output_dir,
os.path.basename(args.tagging_file).split('.')[0] + '.txt')
with open(output_file, 'w') as f:
for time, score in zip(times, scores):
f.write(f'{time}\n')
f.write(generate_topk_label(args.top_k, label_map, score) + '\n')
logger.info(f'Saved tagging labels to {output_file}')
# 声音分类
声音分类和检测是声音算法的一个热门研究方向。
对于声音分类任务,传统机器学习的一个常用做法是首先人工提取音频的时域和频域的多种特征并做特征选择、组合、变换等,然后基于SVM或决策树进行分类。而端到端的深度学习则通常利用深度网络如RNN,CNN等直接对声间波形(waveform)或时频特征(time-frequency)进行特征学习(representation learning)和分类预测。
在IEEE ICASSP 2017 大会上,谷歌开放了一个大规模的音频数据集[Audioset](https://research.google.com/audioset/)。该数据集包含了 632 类的音频类别以及 2,084,320 条人工标记的每段 10 秒长度的声音剪辑片段(来源于YouTube视频)。目前该数据集已经有210万个已标注的视频数据,5800小时的音频数据,经过标记的声音样本的标签类别为527。
`PANNs`([PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition](https://arxiv.org/pdf/1912.10211.pdf))是基于Audioset数据集训练的声音分类/识别的模型。经过预训练后,模型可以用于提取音频的embbedding。本示例将使用`PANNs`的预训练模型Finetune完成声音分类的任务。
## 模型简介
PaddleAudio提供了PANNs的CNN14、CNN10和CNN6的预训练模型,可供用户选择使用:
- CNN14: 该模型主要包含12个卷积层和2个全连接层,模型参数的数量为79.6M,embbedding维度是2048。
- CNN10: 该模型主要包含8个卷积层和2个全连接层,模型参数的数量为4.9M,embbedding维度是512。
- CNN6: 该模型主要包含4个卷积层和2个全连接层,模型参数的数量为4.5M,embbedding维度是512。
## 快速开始
### 模型训练
以环境声音分类数据集`ESC50`为示例,运行下面的命令,可在训练集上进行模型的finetune,支持单机的单卡训练和多卡分布式训练。关于如何使用`paddle.distributed.launch`启动分布式训练,请查看[PaddlePaddle2.0分布式训练](https://www.paddlepaddle.org.cn/documentation/docs/zh/tutorial/quick_start/high_level_api/high_level_api.html#danjiduoka)
```shell
$ unset CUDA_VISIBLE_DEVICES
$ python -m paddle.distributed.launch --gpus "0" train.py --device gpu --epochs 50 --batch_size 16 --num_worker 16 --checkpoint_dir ./checkpoint --save_freq 10
```
可支持配置的参数:
- `device`: 选用什么设备进行训练,可选cpu或gpu,默认为gpu。如使用gpu训练则参数gpus指定GPU卡号。
- `epochs`: 训练轮次,默认为50。
- `learning_rate`: Fine-tune的学习率;默认为5e-5。
- `batch_size`: 批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为16。
- `num_workers`: Dataloader获取数据的子进程数。默认为0,加载数据的流程在主进程执行。
- `checkpoint_dir`: 模型参数文件和optimizer参数文件的保存目录,默认为`./checkpoint`
- `save_freq`: 训练过程中的模型保存频率,默认为10。
- `log_freq`: 训练过程中的信息打印频率,默认为10。
示例代码中使用的预训练模型为`CNN14`,如果想更换为其他预训练模型,可通过以下方式执行:
```python
from model import SoundClassifier
from paddleaudio.datasets import ESC50
from paddleaudio.models.panns import cnn14, cnn10, cnn6
# CNN14
backbone = cnn14(pretrained=True, extract_embedding=True)
model = SoundClassifier(backbone, num_class=len(ESC50.label_list))
# CNN10
backbone = cnn10(pretrained=True, extract_embedding=True)
model = SoundClassifier(backbone, num_class=len(ESC50.label_list))
# CNN6
backbone = cnn6(pretrained=True, extract_embedding=True)
model = SoundClassifier(backbone, num_class=len(ESC50.label_list))
```
### 模型预测
```shell
export CUDA_VISIBLE_DEVICES=0
python -u predict.py --device gpu --wav ./dog.wav --top_k 3 --checkpoint ./checkpoint/epoch_50/model.pdparams
```
可支持配置的参数:
- `device`: 选用什么设备进行训练,可选cpu或gpu,默认为gpu。如使用gpu训练则参数gpus指定GPU卡号。
- `wav`: 指定预测的音频文件。
- `top_k`: 预测显示的top k标签的得分,默认为1。
- `checkpoint`: 模型参数checkpoint文件。
输出的预测结果如下:
```
[/audio/dog.wav]
Dog: 0.9999538660049438
Clock tick: 1.3341237718123011e-05
Cat: 6.579841738130199e-06
```
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
class SoundClassifier(nn.Layer):
"""
Model for sound classification which uses panns pretrained models to extract
embeddings from audio files.
"""
def __init__(self, backbone, num_class, dropout=0.1):
super(SoundClassifier, self).__init__()
self.backbone = backbone
self.dropout = nn.Dropout(dropout)
self.fc = nn.Linear(self.backbone.emb_size, num_class)
def forward(self, x):
# x: (batch_size, num_frames, num_melbins) -> (batch_size, 1, num_frames, num_melbins)
x = x.unsqueeze(1)
x = self.backbone(x)
x = self.dropout(x)
logits = self.fc(x)
return logits
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import ast
import os
import numpy as np
import paddle
import paddle.nn.functional as F
from model import SoundClassifier
from paddleaudio.backends import load as load_audio
from paddleaudio.datasets import ESC50
from paddleaudio.models.panns import cnn14
from paddleaudio.transforms import LogMelSpectrogram
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to predict, defaults to gpu.")
parser.add_argument("--wav", type=str, required=True, help="Audio file to infer.")
parser.add_argument("--top_k", type=int, default=1, help="Show top k predicted results")
parser.add_argument("--checkpoint", type=str, required=True, help="Checkpoint of model.")
args = parser.parse_args()
# yapf: enable
if __name__ == '__main__':
paddle.set_device(args.device)
feature_extractor = LogMelSpectrogram(
sr=16000, n_fft=512, hop_length=320, n_mels=64, f_min=50)
model = SoundClassifier(
backbone=cnn14(pretrained=False, extract_embedding=True),
num_class=len(ESC50.label_list))
model.set_state_dict(paddle.load(args.checkpoint))
model.eval()
waveform, sr = load_audio(args.wav)
feats = feature_extractor(paddle.to_tensor(waveform).unsqueeze(0))
logits = model(feats)
probs = F.softmax(logits, axis=1).numpy()
sorted_indices = (-probs[0]).argsort()
msg = f'[{args.wav}]\n'
for idx in sorted_indices[:args.top_k]:
msg += f'{ESC50.label_list[idx]}: {probs[0][idx]}\n'
print(msg)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import ast
import os
import paddle
import paddle.nn.functional as F
from model import SoundClassifier
from paddleaudio.datasets import ESC50
from paddleaudio.models.panns import cnn14
from paddleaudio.transforms import LogMelSpectrogram
from paddleaudio.utils import Timer, get_logger
logger = get_logger()
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
parser.add_argument("--epochs", type=int, default=50, help="Number of epoches for fine-tuning.")
parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train with warmup.")
parser.add_argument("--batch_size", type=int, default=16, help="Total examples' number in batch for training.")
parser.add_argument("--num_workers", type=int, default=0, help="Number of workers in dataloader.")
parser.add_argument("--checkpoint_dir", type=str, default='./checkpoint', help="Directory to save model checkpoints.")
parser.add_argument("--save_freq", type=int, default=10, help="Save checkpoint every n epoch.")
parser.add_argument("--log_freq", type=int, default=10, help="Log the training infomation every n steps.")
args = parser.parse_args()
# yapf: enable
if __name__ == "__main__":
paddle.set_device(args.device)
nranks = paddle.distributed.get_world_size()
local_rank = paddle.distributed.get_rank()
feature_extractor = LogMelSpectrogram(
sr=16000, n_fft=512, hop_length=320, n_mels=64, f_min=50)
backbone = cnn14(pretrained=True, extract_embedding=True)
model = SoundClassifier(backbone, num_class=len(ESC50.label_list))
optimizer = paddle.optimizer.Adam(
learning_rate=args.learning_rate, parameters=model.parameters())
criterion = paddle.nn.loss.CrossEntropyLoss()
train_ds = ESC50(mode='train')
dev_ds = ESC50(mode='dev')
train_sampler = paddle.io.DistributedBatchSampler(
train_ds, batch_size=args.batch_size, shuffle=True, drop_last=False)
train_loader = paddle.io.DataLoader(
train_ds,
batch_sampler=train_sampler,
num_workers=args.num_workers,
return_list=True,
use_buffer_reader=True,
)
steps_per_epoch = len(train_sampler)
timer = Timer(steps_per_epoch * args.epochs)
timer.start()
for epoch in range(1, args.epochs + 1):
model.train()
avg_loss = 0
num_corrects = 0
num_samples = 0
for batch_idx, batch in enumerate(train_loader):
waveforms, labels = batch
feats = feature_extractor(waveforms)
logits = model(feats)
loss = criterion(logits, labels)
loss.backward()
optimizer.step()
if isinstance(optimizer._learning_rate,
paddle.optimizer.lr.LRScheduler):
optimizer._learning_rate.step()
optimizer.clear_grad()
# Calculate loss
avg_loss += loss.numpy()[0]
# Calculate metrics
preds = paddle.argmax(logits, axis=1)
num_corrects += (preds == labels).numpy().sum()
num_samples += feats.shape[0]
timer.count()
if (batch_idx + 1) % args.log_freq == 0 and local_rank == 0:
lr = optimizer.get_lr()
avg_loss /= args.log_freq
avg_acc = num_corrects / num_samples
print_msg = 'Epoch={}/{}, Step={}/{}'.format(
epoch, args.epochs, batch_idx + 1, steps_per_epoch)
print_msg += ' loss={:.4f}'.format(avg_loss)
print_msg += ' acc={:.4f}'.format(avg_acc)
print_msg += ' lr={:.6f} step/sec={:.2f} | ETA {}'.format(
lr, timer.timing, timer.eta)
logger.train(print_msg)
avg_loss = 0
num_corrects = 0
num_samples = 0
if epoch % args.save_freq == 0 and batch_idx + 1 == steps_per_epoch and local_rank == 0:
dev_sampler = paddle.io.BatchSampler(
dev_ds,
batch_size=args.batch_size,
shuffle=False,
drop_last=False)
dev_loader = paddle.io.DataLoader(
dev_ds,
batch_sampler=dev_sampler,
num_workers=args.num_workers,
return_list=True,
)
model.eval()
num_corrects = 0
num_samples = 0
with logger.processing('Evaluation on validation dataset'):
for batch_idx, batch in enumerate(dev_loader):
waveforms, labels = batch
feats = feature_extractor(waveforms)
logits = model(feats)
preds = paddle.argmax(logits, axis=1)
num_corrects += (preds == labels).numpy().sum()
num_samples += feats.shape[0]
print_msg = '[Evaluation result]'
print_msg += ' dev_acc={:.4f}'.format(num_corrects / num_samples)
logger.eval(print_msg)
# Save model
save_dir = os.path.join(args.checkpoint_dir,
'epoch_{}'.format(epoch))
logger.info('Saving model checkpoint to {}'.format(save_dir))
paddle.save(model.state_dict(),
os.path.join(save_dir, 'model.pdparams'))
paddle.save(optimizer.state_dict(),
os.path.join(save_dir, 'model.pdopt'))
repos:
- repo: https://github.com/PaddlePaddle/mirrors-yapf.git
rev: 0d79c0c469bab64f7229c9aca2b1186ef47f0e37
hooks:
- id: yapf
files: \.py$
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: a11d9314b22d8f8c7556443875b731ef05965464
hooks:
- id: check-merge-conflict
- id: check-symlinks
- id: detect-private-key
files: (?!.*paddle)^.*$
- id: end-of-file-fixer
files: \.md$
- id: trailing-whitespace
files: \.md$
- repo: https://github.com/Lucas-C/pre-commit-hooks
rev: v1.0.1
hooks:
- id: forbid-crlf
files: \.md$
- id: remove-crlf
files: \.md$
- id: forbid-tabs
files: \.md$
- id: remove-tabs
files: \.md$
- repo: local
hooks:
- id: clang-format
name: clang-format
description: Format files with ClangFormat
entry: bash .clang_format.hook -i
language: system
files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|cuh|proto)$
[style]
based_on_style = pep8
column_limit = 80
# Speaker verification using and ResnetSE ECAPA-TDNN
## Introduction
In this example, we demonstrate how to use PaddleAudio to train two types of networks for speaker verification.
The networks supported here are
- Resnet34 with Squeeze-and-excite block \[1\] to adaptively re-weight the feature maps.
- ECAPA-TDNN \[2\]
## Requirements
Install the requirements via
```
# install paddleaudio
git clone https://github.com/PaddlePaddle/models.git
cd models/PaddleAudio
pip install -e .
```
Then install additional requirements by
```
cd examples/speaker
pip install -r requirements.txt
```
## Training
### Training datasets
Following from this example and this example, we use the dev split [VoxCeleb 1](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) which consists aof `1,211` speakers and the dev split of [VoxCeleb 2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html) consisting of `5,994` speakers for training. Thus there are `7,502` speakers totally in our training set.
Please download the two datasets from the [official website](https://www.robots.ox.ac.uk/~vgg/data/voxceleb) and unzip all audio into a folder, e.g., `./data/voxceleb/`. Make sure there are `7502` subfolders with prefix `id1****` under the folder. You don't need to further process the data because all data processing such as adding noise / reverberation / speed perturbation will be done on-the-fly. However, to speed up audio decoding, you can manually convert the m4a file in VoxCeleb 2 to wav file format, at the expanse of using more storage.
Finally, create a txt file that contains the list of audios for training by
```
cd ./data/voxceleb/
find `pwd`/ --type f > vox_files.txt
```
### Augmentation datasets
The following datasets are required for dataset augmentation
- [Room Impulse Response and Noise Database](https://openslr.org/28/)
- [MUSAN](https://openslr.org/17/)
For the RIR dataset, you must list all audio files under the folder `RIRS_NOISES/simulated_rirs/` into a text file, e.g., data/rir.list and config it as rir_path in the `config.yaml` file.
Likewise, you have to config the the following fields in the config file for noise augmentation
``` yaml
muse_speech: <musan_split/speech.list> #replace with your actual path
muse_speech_srn_high: 15.0
muse_speech_srn_low: 12.0
muse_music: <musan_split/music.list> #replace with your actual path
muse_music_srn_high: 15.0
muse_music_srn_low: 5.0
muse_noise: <musan_split/noise.list> #replace with your actual path
muse_noise_srn_high: 15
muse_noise_srn_low: 5.0
```
To train your model from scratch, first create a folder(workspace) by
``` bash
cd egs
mkdir <your_example>
cd <your_example>
cp ../resnet/config.yaml . #Copy an example config to your workspace
```
Then change the config file accordingly to make sure all audio files can be correctly located(including the files used for data augmentation). Also you can change the training and model hyper-parameters to suit your need.
Finally start your training by
``` bash
python ../../train.py -c config.yaml -d gpu:0
```
## Testing
## <a name="test_dataset"></a>Testing datasets
The testing split of VoxCeleb 1 is used for measuring the performance of speaker verification duration training and after the training completes. You will need to download the data and unzip into a folder, e.g, `./data/voxceleb/test/`.
Then download the text files which list utterance pairs to compare and the true labels indicating whether the utterances come from the same speaker. There are multiple trials and we will use [veri_test2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt).
To start testing, first download the checkpoints for resnet or ecapa-tdnn,
| checkpoint |size| eer |
| --------------- | --------------- | --------------- |
| [ResnetSE34 + SAP + CMSoftmax](https://bj.bcebos.com/paddleaudio/models/speaker/resnetse34_epoch92_eer0.00931.pdparams) |26MB | 0.93%|
| [ecapa-tdnn + AAMSoftmax ](https://bj.bcebos.com/paddleaudio/models/speaker/tdnn_amsoftmax_epoch51_eer0.011.pdparams)| 80MB |1.10%|
Then prepare the test dataset as described in [Testing datasets](#test_dataset), and set the following path in the config file,
``` yaml
mean_std_file: ../../data/stat.pd
test_list: ../../data/veri_test2.txt
test_folder: ../../data/voxceleb1/
```
To compute the eer using resnet, run:
``` bash
cd egs/resnet/
python ../../test.py -w <checkpoint path> -c config.yaml -d gpu:0
```
which will result in eer 0.00931.
for ecapa-tdnn, run:
``` bash
cd egs/ecapa-tdnn/
python ../../test.py -w <checkpoint path> -c config.yaml -d gpu:0
```
which gives you eer 0.0105.
## Results
We compare our results with [voxceleb_trainer](https://github.com/clovaai/voxceleb_trainer).
### Pretrained model of voxceleb_trainer
The test list is veri_test2.txt, which can be download from here [VoxCeleb1 (cleaned)](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt)
| model |config|checkpoint |eval frames| eer |
| --------------- | --------------- | --------------- |--------------- |--------------- |
| ResnetSE34 + ASP + softmaxproto| - | [baseline_v2_ap](http://www.robots.ox.ac.uk/~joon/data/baseline_v2_ap.model)|400|1.06%|
| ResnetSE34 + ASP + softmaxproto| - | [baseline_v2_ap](http://www.robots.ox.ac.uk/~joon/data/baseline_v2_ap.model)|all|1.18%|
### This example
| model |config|checkpoint |eval frames| eer |
| --------------- | --------------- | --------------- |--------------- |--------------- |
| ResnetSE34 + SAP + CMSoftmax| [config.yaml](./egs/resent/config.yaml) |[checkpoint](https://bj.bcebos.com/paddleaudio/models/speaker/resnetse34_epoch92_eer0.00931.pdparams) | all|0.93%|
| ECAPA-TDNN + AAMSoftmax | [config.yaml](./egs/ecapa-tdnn/config.yaml) | [checkpoint](https://bj.bcebos.com/paddleaudio/models/speaker/tdnn_amsoftmax_epoch51_eer0.011.pdparams) | all|1.10%|
## References
[1] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141
[2] Desplanques B, Thienpondt J, Demuynck K. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification[J]. arXiv preprint arXiv:2005.07143, 2020.
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import glob
import json
import os
import pickle
import random
import subprocess
import time
import warnings
import numpy as np
import paddle
import paddleaudio
import yaml
#from paddle.io import DataLoader, Dataset, IterableDataset
from paddle.utils import download
from paddleaudio.utils import augments, get_logger
logger = get_logger(__file__)
def random_choice(a):
i = np.random.randint(0, high=len(a))
return a[int(i)]
def read_scp(file):
lines = open(file).read().split('\n')
keys = [l.split()[0] for l in lines if l.startswith('id')]
speakers = [l.split()[0].split('-')[0] for l in lines if l.startswith('id')]
files = [l.split()[1] for l in lines if l.startswith('id')]
return keys, speakers, files
def read_list(file):
lines = open(file).read().split('\n')
keys = [
'-'.join(l.split('/')[-3:]).split('.')[0] for l in lines
if l.startswith('/')
]
speakers = [k.split('-')[0] for k in keys]
files = [l for l in lines if l.startswith('/')]
return keys, speakers, files
class Dataset(paddle.io.Dataset):
"""
Dataset class for Audioset, with mel features stored in multiple hdf5 files.
The h5 files store mel-spectrogram features pre-extracted from wav files.
Use wav2mel.py to do feature extraction.
"""
def __init__(self,
scp,
keys=None,
sample_rate=16000,
duration=None,
augment=True,
speaker_set=None,
augment_prob=0.5,
training=True,
balanced_sampling=False):
super(Dataset, self).__init__()
self.keys, self.speakers, self.files = read_list(scp)
self.key2file = {k: f for k, f in zip(self.keys, self.files)}
self.n_files = len(self.files)
if speaker_set:
if isinstance(speaker_set, str):
with open(speaker_set) as f:
self.speaker_set = f.read().split('\n')
print(self.speaker_set[:10])
else:
self.speaker_set = speaker_set
else:
self.speaker_set = list(set(self.speakers))
self.speaker_set.sort()
self.spk2cls = {s: i for i, s in enumerate(self.speaker_set)}
self.n_class = len(self.speaker_set)
logger.info(f'speaker size: {self.n_class}')
logger.info(f'file size: {self.n_files}')
self.augment = augment
self.augment_prob = augment_prob
self.training = training
self.sample_rate = sample_rate
self.balanced_sampling = balanced_sampling
self.duration = duration
if augment:
assert duration, 'if augment is True, duration must not be None'
if self.duration:
self.duration = int(self.sample_rate * self.duration)
if keys is not None:
if isinstance(keys, list):
self.keys = keys
elif isinstance(keys, str):
with open(keys) as f:
self.keys = f.read().split('\n')
self.keys = [k for k in self.keys if k.startswith('id')]
logger.info(f'using {len(self.keys)} keys')
def __getitem__(self, idx):
idx = idx % len(self.keys)
key = self.keys[idx]
spk = key.split('-')[0]
cls_idx = self.spk2cls[spk]
file = self.key2file[key]
file_duration = None
if not self.augment and self.duration:
file_duration = self.duration
while True:
try:
wav, sr = paddleaudio.load(file,
sr=self.sample_rate,
duration=file_duration)
break
except:
key = self.keys[idx]
spk = key.split('-')[0]
#spk = self.speakers[idx]
cls_idx = self.spk2cls[spk]
file = self.key2file[key]
print(f'error loading file {file}')
speed = random.choice([0, 1, 2])
if speed == 1:
wav = paddleaudio.resample(wav, 16000, 16000 * 0.9)
cls_idx = cls_idx * 3 + 1
elif speed == 2:
wav = paddleaudio.resample(wav, 16000, 16000 * 1.1)
cls_idx = cls_idx * 3 + 2
else:
cls_idx = cls_idx * 3
if self.augment:
wav = augments.random_crop_or_pad1d(wav, self.duration)
elif self.duration:
wav = augments.center_crop_or_pad1d(wav, self.duration)
return wav, cls_idx
def __len__(self):
return len(self.keys)
def worker_init(worker_id):
time.sleep(worker_id / 32)
seed = int(time.time()) % 10000 + worker_id
np.random.seed(seed)
random.seed(seed)
paddle.seed(seed)
def get_train_loader(config):
dataset = Dataset(config['spk_scp'],
keys=config['train_keys'],
speaker_set=config['speaker_set'],
augment=True,
duration=config['duration'])
train_loader = paddle.io.DataLoader(dataset,
shuffle=True,
batch_size=config['batch_size'],
drop_last=True,
num_workers=config['num_workers'],
use_buffer_reader=True,
use_shared_memory=True,
worker_init_fn=worker_init)
return train_loader
def get_val_loader(config):
dataset = Dataset(config['spk_scp'],
keys=config['val_keys'],
speaker_set=config['speaker_set'],
augment=False,
duration=config['duration'])
val_loader = paddle.io.DataLoader(dataset,
shuffle=False,
batch_size=config['val_batch_size'],
drop_last=False,
num_workers=config['num_workers'])
return val_loader
if __name__ == '__main__':
# do some testing here
with open('config.yaml') as f:
config = yaml.safe_load(f)
train_loader = get_train_loader(config)
# val_loader = get_val_loader(config)
for i, (x, y) in enumerate(train_loader()):
print(x, y)
break
# for i, (x, y) in enumerate(val_loader()):
# print(x, y)
# break
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIaONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
fbank:
sr: 16000 # sample rate
n_fft: 512
win_length: 400 #25ms
hop_length: 160 #10ms
n_mels: 80
f_min: 20
f_max: 7600
window: hann
amin: !!float 1e-5
top_db: 75
augment_with_sox: False
augment_mel: False
augment_wav: True
rir_path: ../../data/RIRS_NOISES/rir.list
muse_speech: ../../data/musan_split/speech.list
muse_speech_srn_high: 15.0
muse_speech_srn_low: 12.0
muse_music: ../../data/musan_split/music.list
muse_music_srn_high: 10.0
muse_music_srn_low: 5.0
muse_noise: ../../data/musan_split/noise.list
muse_noise_srn_high: 10
muse_noise_srn_low: 5.0
batch_size: 64
val_batch_size: 16
num_workers: 16
num_classes: 7205
duration: 5
balanced_sampling: False
epoch_num: 500
max_lr: !!float 1e-05
base_lr: !!float 1e-04
reverse_lr: True
half_cycle: 10 # epoch
# set the data path accordingly
spk_scp: ../../data/voxceleb1and2_list.txt
mean_std_file: ../../data/stat.pd
#for testing
test_list: ../../data/veri_test2.txt
test_folder: ../../data/voxceleb1/
speaker_set: ../../data/speaker_set_vox12.txt
train_keys: ~
model_dir : ./checkpoints/
model_prefix: 'tdnn'
log_dir : ./log/
log_file : ./log.txt
log_step: 10
checkpoint_step : 5000
eval_step: 10000
max_time_mask: 2
max_freq_mask: 1
max_time_mask_width: 20
max_freq_mask_width: 10
model:
name: EcapaTDNN
params:
input_size: 80 # should be the same as in the fbank config
normalize: True
loss:
name: AdditiveAngularMargin
params:
margin: 0.3
scale: 30.0
easy_margin: False
feature_dim: 192
n_classes: 22506
# loss:
# name: CMSoftmax
# params:
# margin: 0.10
# margin2: 0.10
# scale: 30.0
# feature_dim: 256
# n_classes: 22506
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIaONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
fbank:
sr: 16000 # sample rate
n_fft: 512
win_length: 400 #25ms
hop_length: 160 #10ms
n_mels: 80
f_min: 20
f_max: 7600
window: hann
amin: !!float 1e-5
top_db: 75
augment_with_sox: False
augment_mel: False
augment_wav: True
rir_path: ../../data/RIRS_NOISES/rir.list
muse_speech: ../../data/musan_split/speech.list
muse_speech_srn_high: 15.0
muse_speech_srn_low: 12.0
muse_music: ../../data/musan_split/music.list
muse_music_srn_high: 15.0
muse_music_srn_low: 5.0
muse_noise: ../../data/musan_split/noise.list
muse_noise_srn_high: 15
muse_noise_srn_low: 5.0
freeze_param: False
freezed_layers: -8
batch_size: 64
val_batch_size: 16
num_workers: 32
num_classes: 7205
duration: 5
balanced_sampling: False
epoch_num: 500
max_lr: !!float 1e-06
base_lr: !!float 1e-04
reverse_lr: False
half_cycle: 10 #epoch
# set the data path accordingly
spk_scp: ../../data/voxceleb1and2_list.txt
speaker_set: ../../data/speaker_set.txt
train_keys: ~
# for testing
mean_std_file: ../../data/stat.pd
test_list: ../../data/veri_test2.txt
test_folder: ../../data/voxceleb1/
model_dir : ./checkpoints/
model_prefix: 'resnet'
log_dir : ./log/
log_file : ./log.txt
log_step: 10
checkpoint_step : 5000
eval_step: 600
max_time_mask: 2
max_freq_mask: 1
max_time_mask_width: 20
max_freq_mask_width: 10
model:
name: ResNetSE34V2 # or ResNetSE34
params:
feature_dim: 256
scale_factor: 1
encoder_type: SAP #ASP #ASP with attention
n_mels: 80 # should be the same as in the fbank config
normalize: True
# loss:
# name: AdditiveAngularMargin
# params:
# margin: 0.35
# scale: 30.0
# easy_margin: False
# feature_dim: 192
# n_classes: 22506
loss:
name: CMSoftmax
params:
margin: 0.10
margin2: 0.10
scale: 30.0
feature_dim: 256
n_classes: 22506
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import numpy as np
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
__all__ = ['ProtoTypical', 'AMSoftmaxLoss', 'CMSoftmax']
class AMSoftmaxLoss(nn.Layer):
"""Additive margin softmax loss.
Additive margin softmax loss is usefully for training neural networks for speaker recognition/verification.
Notes:
The loss itself contains parameters that need to pass to optimizer for gradient descends.
References:
Wang, Feng, et al. “Additive Margin Softmax for Face Verification.”
IEEE Signal Processing Letters, vol. 25, no. 7, 2018, pp. 926–930.
"""
def __init__(self,
feature_dim: int,
n_classes: int,
eps: float = 1e-5,
margin: float = 0.3,
scale: float = 30.0):
super(AMSoftmaxLoss, self).__init__()
self.w = paddle.create_parameter((feature_dim, n_classes), 'float32')
self.eps = eps
self.scale = scale
self.margin = margin
self.nll_loss = nn.NLLLoss()
self.n_classes = n_classes
def forward(self, logits, label):
logits = F.normalize(logits, p=2, axis=1, epsilon=self.eps)
wn = F.normalize(self.w, p=2, axis=0, epsilon=self.eps)
cosine = paddle.matmul(logits, wn)
y = paddle.zeros((logits.shape[0], self.n_classes))
for i in range(logits.shape[0]):
y[i, label[i]] = self.margin
pred = F.log_softmax((cosine - y) * self.scale, -1)
return self.nll_loss(pred, label), pred
class ProtoTypical(nn.Layer):
"""Proto-typical loss as described in [1].
Reference:
[1] Chung, Joon Son, et al. “In Defence of Metric Learning for Speaker Recognition.”
Interspeech 2020, 2020, pp. 2977–2981.
"""
def __init__(self, s=20.0, eps=1e-8):
super(ProtoTypical, self).__init__()
self.nll_loss = nn.NLLLoss()
self.eps = eps
self.s = s
def forward(self, logits):
assert logits.ndim == 3, (
f'the input logits must be a ' +
f'3d tensor of shape [n_spk,n_uttns,emb_dim],' +
f'but received logits.ndim = {logits.ndim}')
import pdb
pdb.set_trace()
logits = F.normalize(logits, p=2, axis=-1, epsilon=self.eps)
proto = paddle.mean(logits[:, 1:, :], axis=1, keepdim=False).transpose(
(1, 0)) # [emb_dim, n_spk]
query = logits[:, 0, :] # [n_spk, emb_dim]
similarity = paddle.matmul(query, proto) * self.s #[n_spk,n_spk]
label = paddle.arange(0, similarity.shape[0])
log_sim = F.log_softmax(similarity, -1)
return self.nll_loss(log_sim, label), log_sim
class AngularMargin(nn.Layer):
def __init__(self, margin=0.0, scale=1.0):
super(AngularMargin, self).__init__()
self.margin = margin
self.scale = scale
def forward(self, outputs, targets):
outputs = outputs - self.margin * targets
return self.scale * outputs
class LogSoftmaxWrapper(nn.Layer):
def __init__(self, loss_fn):
super(LogSoftmaxWrapper, self).__init__()
self.loss_fn = loss_fn
self.criterion = paddle.nn.KLDivLoss(reduction="sum")
def forward(self, outputs, targets, length=None):
targets = F.one_hot(targets, outputs.shape[1])
try:
predictions = self.loss_fn(outputs, targets)
except TypeError:
predictions = self.loss_fn(outputs)
predictions = F.log_softmax(predictions, axis=1)
loss = self.criterion(predictions, targets) / targets.sum()
return loss
class AdditiveAngularMargin(AngularMargin):
def __init__(self,
margin=0.0,
scale=1.0,
feature_dim=256,
n_classes=1000,
easy_margin=False):
super(AdditiveAngularMargin, self).__init__(margin, scale)
self.easy_margin = easy_margin
self.w = paddle.create_parameter((feature_dim, n_classes), 'float32')
self.cos_m = math.cos(self.margin)
self.sin_m = math.sin(self.margin)
self.th = math.cos(math.pi - self.margin)
self.mm = math.sin(math.pi - self.margin) * self.margin
self.nll_loss = nn.NLLLoss()
self.n_classes = n_classes
def forward(self, logits, targets):
# logits = self.drop(logits)
logits = F.normalize(logits, p=2, axis=1, epsilon=1e-8)
wn = F.normalize(self.w, p=2, axis=0, epsilon=1e-8)
cosine = logits @ wn
#cosine = outputs.astype('float32')
sine = paddle.sqrt(1.0 - paddle.square(cosine))
phi = cosine * self.cos_m - sine * self.sin_m # cos(theta + m)
if self.easy_margin:
phi = paddle.where(cosine > 0, phi, cosine)
else:
phi = paddle.where(cosine > self.th, phi, cosine - self.mm)
target_one_hot = F.one_hot(targets, self.n_classes)
outputs = (target_one_hot * phi) + ((1.0 - target_one_hot) * cosine)
outputs = self.scale * outputs
pred = F.log_softmax(outputs, axis=-1)
return self.nll_loss(pred, targets), pred
class CMSoftmax(AngularMargin):
def __init__(self,
margin=0.0,
margin2=0.0,
scale=1.0,
feature_dim=256,
n_classes=1000,
easy_margin=False):
super(CMSoftmax, self).__init__(margin, scale)
self.easy_margin = easy_margin
self.w = paddle.create_parameter((feature_dim, n_classes), 'float32')
self.cos_m = math.cos(self.margin)
self.sin_m = math.sin(self.margin)
self.th = math.cos(math.pi - self.margin)
self.mm = math.sin(math.pi - self.margin) * self.margin
self.nll_loss = nn.NLLLoss()
self.n_classes = n_classes
self.margin2 = margin2
def forward(self, logits, targets):
logits = F.normalize(logits, p=2, axis=1, epsilon=1e-8)
wn = F.normalize(self.w, p=2, axis=0, epsilon=1e-8)
cosine = logits @ wn
sine = paddle.sqrt(1.0 - paddle.square(cosine))
phi = cosine * self.cos_m - sine * self.sin_m # cos(theta + m)
if self.easy_margin:
phi = paddle.where(cosine > 0, phi, cosine)
else:
phi = paddle.where(cosine > self.th, phi, cosine - self.mm)
target_one_hot = F.one_hot(targets, self.n_classes)
outputs = (target_one_hot * phi) + (
(1.0 - target_one_hot) * cosine) - target_one_hot * self.margin2
outputs = self.scale * outputs
pred = F.log_softmax(outputs, axis=-1)
return self.nll_loss(pred, targets), pred
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from collections import namedtuple
from typing import List, Tuple, Union
import numpy as np
def compute_eer(
scores: Union[np.ndarray, List[float]], labels: Union[np.ndarray, List[int]]
) -> Tuple[float, float, np.ndarray, np.ndarray]:
"""Compute equal error rate(EER) given matching scores and corresponding labels
Parameters:
scores(np.ndarray,list): the cosine similarity between two speaker embeddings.
labels(np.ndarray,list): the labels of the speaker pairs, with value 1 indicates same speaker and 0 otherwise.
Returns:
eer(float): the equal error rate.
thresh_for_eer(float): the thresh value at which false acceptance rate equals to false rejection rate.
fr_rate(np.ndarray): the false rejection rate as a function of increasing thresholds.
fa_rate(np.ndarray): the false acceptance rate as a function of increasing thresholds.
"""
if isinstance(labels, list):
labels = np.array(labels)
if isinstance(scores, list):
scores = np.array(scores)
label_set = list(np.unique(labels))
assert len(
label_set
) == 2, f'the input labels must contains both two labels, but recieved set(labels) = {label_set}'
label_set.sort()
assert label_set == [
0, 1
], 'the input labels must contain 0 and 1 for distinct and identical id. '
eps = 1e-8
#assert np.min(scores) >= -1.0 - eps and np.max(
# scores
# ) < 1.0 + eps, 'the score must be in the range between -1.0 and 1.0'
same_id_scores = scores[labels == 1]
diff_id_scores = scores[labels == 0]
thresh = np.linspace(np.min(diff_id_scores), np.max(same_id_scores), 1000)
thresh = np.expand_dims(thresh, 1)
fr_matrix = same_id_scores < thresh
fa_matrix = diff_id_scores >= thresh
fr_rate = np.mean(fr_matrix, 1)
fa_rate = np.mean(fa_matrix, 1)
thresh_idx = np.argmin(np.abs(fa_rate - fr_rate))
result = namedtuple('speaker', ('eer', 'thresh', 'fa', 'fr'))
result.eer = (fr_rate[thresh_idx] + fa_rate[thresh_idx]) / 2
result.thresh = thresh[thresh_idx, 0]
result.fr = fr_rate
result.fa = fa_rate
return result
def compute_min_dcf(fr_rate, fa_rate, p_target=0.05, c_miss=1.0, c_fa=1.0):
""" Compute normalized minimum detection cost function (minDCF) given
the costs for false accepts and false rejects as well as a priori
probability for target speakers
Parameters:
fr_rate(np.ndarray): the false rejection rate as a function of increasing thresholds.
fa_rate(np.ndarray): the false acceptance rate as a function of increasing thresholds.
p_target(float): the prior probability of being a target.
c_miss(float): cost of miss detection(false rejects).
c_fa(float): cost of miss detection(false accepts).
Returns:
min_cdf(float): the normalized minimum detection cost function (minDCF)
"""
dcf = c_miss * fr_rate * p_target + c_fa * fa_rate * (1 - p_target)
c_det = np.min(dcf)
c_def = min(c_miss * p_target, c_fa * (1 - p_target))
min_cdf = c_det / c_def
return min_cdf
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .ecapa_tdnn import EcapaTDNN
from .resnet_se34 import ResNetSE34
from .resnet_se34v2 import ResNetSE34V2
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
# ESC: Environmental Sound Classification
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
numpy >= 1.15.0
scipy >= 1.0.0
resampy >= 0.2.2
soundfile >= 0.9.0
\ No newline at end of file
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
import paddleaudio
if __name__ == '__main__':
paddleaudio.load('./test_audio.m4a')
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册