未验证 提交 8e8ce17f 编写于 作者: X xfcygaocan 提交者: GitHub

Repro (#661)

* add ernie-unimo

* add ernie-unimo
上级 51239d91
# <p align=center>`UNIMO`</p>
Code for the main conference of ACL 2021 long paper [UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning](https://arxiv.org/pdf/2012.15409.pdf)
## Abstract
Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other.
They can only utilize single-modal data (i.e., text or image) or limited multi-modal data (i.e., image-text pairs).
In this work, we propose a UNIfied-MOdal pre-training architecture, namely `UNIMO`, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks.
Large scale of free text corpus and image collections are utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space over a corpus of image-text pairs augmented with related images and texts.
With the help of rich non-paired single-modal data, our model is able to learn more generalizable representations, by allowing textual knowledge and visual knowledge to enhance each other in the unified semantic space.
The experimental results show that `UNIMO` greatly improves the performance of several single-modal and multi-modal downstream tasks.
![UNIMO](images/framework.png#pic_center)
## Performance
Results on multi-modal understanding and generation tasks:
![UNIMO](images/multiple.png#pic_center)
Results on single-modal understanding and generation tasks:
![UNIMO](images/single.png#pic_center)
---
## TODOs
- [] Add all downstream tasks
- [] Add unimo large model
## Dependencies
python 3.7.4\
paddlepaddle-gpu==1.8.4.post107\
pyrouge==0.1.3
## Pre-trained Models
`UNIMO` adopts large-scale text corpus, image collections and image-text aligned datasets as the pre-training data.
We provide `UNIMO` models of 1 scale settings which are pretrained:
[UNIMO base](https://unimo.bj.bcebos.com/model/unimo_base_en.tar.gz) (lowercased | 12 layers)
```
MODEL_SIZE=base
cd /path/to/model_files
wget --no-check-certificate -q https://unimo.bj.bcebos.com/model/unimo_${MODEL_SIZE}_en.tar.gz
tar -zxf unimo_${MODEL_SIZE}_en.tar.gz
```
## Experiments
Our fine-tuning experiments are carried on V100 GPU. Here are the results from the `UNIMO` model:
<table>
<tr>
<td><strong><center>Task Type</strong></td>
<td><strong><center>Datatset</strong></td>
<td><strong><center>Pre-trained Models</strong></td>
<td><strong><center>Start Command</strong></td>
<td><strong><center>V100 GPU Cards</strong></td>
<td><strong><center>Running Time</strong></td>
</tr>
<tr>
<td rowspan="1"><center>Text Understanding<center></td>
<td rowspan="1"><center>SST-2<center></td>
<td><center>UNIMO base</td>
<td><center>sh ./script/classification/SST-2/run.sh</td>
<td><center>8</td>
<td><center>9h</td>
</tr>
<tr>
<td rowspan="1"><center>Text Generation<center></td>
<td rowspan="1"><center>CoQA<center></td>
<td><center>UNIMO base</td>
<td><center>sh ./script/seq2seq/coqa/run.sh</td>
<td><center>4</td>
<td><center>7h</td>
</tr>
<tr>
<td rowspan="1"><center>Multi-Modal Understanding<center></td>
<td rowspan="1"><center>Flickr30k<center></td>
<td><center>UNIMO base</td>
<td><center>sh ./script/retrieval/Flickr30k/run.sh</td>
<td><center>16</td>
<td><center>3d</td>
</tr>
<table>
---
## Text Understanding Tasks
### (1) Sentiment Classification
#### Download SST-2 dataset:
```
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/SST-2.tar.gz
tar -zxf SST.tar.gz
```
#### Run the following common to train and evaluate on the SST-2 dataset:
For base model:
```
bash ./script/classification/SST-2/run.sh
```
#### Evaluation Results:
<table>
<tr>
<td><strong><center>Model</strong></td>
<td><strong><center>Acc</strong></td>
</tr>
<tr>
<td><center>UNIMO-base</td>
<td><center>95.1</td>
</tr>
<table>
## Text Generation Tasks
### (1) Conversation Question Answering
#### Download CoQA dataset:
```
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/coqa.tar.gz
tar -zxf coqa.tar.gz
```
#### Download evaluation script:
```
cd src/eval/tasks
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/coqa.tar.gz
tar -zxf coqa.tar.gz
```
#### Run the following common to train and evaluate on the CoQA dataset:
For base model:
```
bash ./script/seq2seq/coqa/run.sh
```
#### Evaluation Results:
<table>
<tr>
<td><strong><center>Model</strong></td>
<td><strong><center>Acc</strong></td>
</tr>
<tr>
<td><center>UNIMO-base</td>
<td><center>80.2</td>
</tr>
<table>
## Multi-Modal Understanding Tasks
### (1) Image-Text Retrieval
#### Download Flickr30k dataset:
##### Note: Visual features are extracted by [bottom-up-attention](https://github.com/peteanderson80/bottom-up-attention)
```
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/Flickr30k.tar.gz # occupies about 37G disk space
tar -zxf Flickr30k.tar.gz
```
#### Run the following common to train and evaluate on the Flickr30k dataset:
For base model:
```
bash ./script/retrieval/Flickr30k/run.sh
```
#### Evaluation Results:
Results of Image Retrieval task on Flickr30k dataset
<table>
<tr>
<td><strong><center>Model</strong></td>
<td><strong><center>R@1</strong></td>
<td><strong><center>R@5</strong></td>
<td><strong><center>R@10</strong></td>
</tr>
<tr>
<td><center>UNIMO-base</td>
<td><center>74.66</td>
<td><center>93.40</td>
<td><center>96.08</td>
</tr>
<table>
Results of Text Retrieval task on Flickr30k dataset
<table>
<tr>
<td><strong><center>Model</strong></td>
<td><strong><center>R@1</strong></td>
<td><strong><center>R@5</strong></td>
<td><strong><center>R@10</strong></td>
</tr>
<tr>
<td><center>UNIMO-base</td>
<td><center>89.70</td>
<td><center>98.40</td>
<td><center>99.10</td>
</tr>
<table>
---
Citation
---
If you find our paper and code useful, please cite the following paper:
```
@article{li2020unimo,
title={UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning},
author={Li, Wei and Gao, Can and Niu, Guocheng and Xiao, Xinyan and Liu, Hao and Liu, Jiachen and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2012.15409},
year={2020}
}
```
Contact information
---
For help or issues using `UNIMO`, please submit a GitHub issue.
For personal communication related to `UNIMO`, please contact Wei Li (liwei85@baidu.com), Guocheng Niu (niuguocheng@baidu.com) , Can Gao (gaocan01@baidu.com).
\ No newline at end of file
data_name=SST-2
data_tar=${data_name}.tar.gz
bos_url=https://unimo.bj.bcebos.com/data/SST-2.tar.gz
rm -rf $data_name
wget --no-check-certificate -q $bos_url
if [[ $? -ne 0 ]]; then
echo "url link: $bos_url"
echo "download data failed"
exit 1
fi
tar zxf $data_tar
rm -f $data_tar
exit 0
#!/usr/bin/env bash
set -x
# add CUDA, cuDNN and NCCL to environment variable
# export LD_LIBRARY_PATH=/home/work/cuda-10.0/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
# export LD_LIBRARY_PATH=/home/work/cuda-10.0/extras/CUPTI/lib64:$LD_LIBRARY_PATH
# export LD_LIBRARY_PATH=/home/work/cudnn/cudnn_v7.6/cuda/lib64:$LD_LIBRARY_PATH
# export LD_LIBRARY_PATH=/home/work/nccl/nccl2.4.2_cuda10.1/lib:$LD_LIBRARY_PATH
export FLAGS_sync_nccl_allreduce=1
export FLAGS_fraction_of_gpu_memory_to_use=1
export FLAGS_eager_delete_tensor_gb=1.0
export FLAGS_fast_eager_deletion_mode=1
export FLAGS_memory_fraction_of_eager_deletion=1
export iplist=`hostname -i`
unset http_proxy
unset https_proxy
set +x
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"max_position_embeddings": 514,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 0,
"sent_type_vocab_size": 0,
"task_type_vocab_size": 0,
"vocab_size": 50265,
"max_img_len": 37,
"max_obj_len": 50,
"image_class_size": 1601,
"image_attr_size": 401,
"image_embedding_size": 2048,
"image_predict_feature": true,
"image_predict_class": true,
"image_use_attr": false,
"image_use_soft_label": true,
"use_neg_lm_loss": false,
"fusion_method": "mul",
"similarity_method": "softmax",
"txt_mask_ratio": 0.15,
"vl_mask_ratio": 0.15,
"scenegraph_mask_ratio": 0.3,
"overlap_ratio": 0.4,
"num_labels": 2,
"max_pixel_len": 256,
"max_pixel_position_embeddings": 196
}
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"initializer_range": 0.02,
"max_position_embeddings": 514,
"num_attention_heads": 16,
"num_hidden_layers": 24,
"type_vocab_size": 0,
"sent_type_vocab_size": 0,
"task_type_vocab_size": 0,
"vocab_size": 50265,
"max_img_len": 101,
"max_obj_len": 100,
"image_class_size": 1601,
"image_attr_size": 401,
"image_embedding_size": 2048,
"image_predict_feature": true,
"image_predict_class": true,
"image_use_attr": false,
"image_use_soft_label": true,
"use_neg_lm_loss": false,
"fusion_method": "mul",
"txt_mask_ratio": 0.15,
"vl_mask_ratio": 0.15,
"scenegraph_mask_ratio": 0.4,
"overlap_ratio": 0.3,
"num_labels": 2,
"max_pixel_len": 256,
"max_pixel_position_embeddings": 196
}
因为 它太大了无法显示 source diff 。你可以改为 查看blob
因为 它太大了无法显示 source diff 。你可以改为 查看blob
因为 它太大了无法显示 source diff 。你可以改为 查看blob
data_name=unimo_base_en
data_tar=${data_name}.tar.gz
bos_url=https://unimo.bj.bcebos.com/model/$data_tar
rm -rf $data_name
wget --no-check-certificate -q $bos_url
if [[ $? -ne 0 ]]; then
echo "url link: $bos_url"
echo "download data failed"
exit 1
fi
tar zxf $data_tar
rm -f $data_tar
exit 0
paddlepaddle-gpu==1.8.4.post107
pyrouge==0.1.3
regex==2020.7.14
output_name="classification"
task=SST-2
## hyper param
use_fp16="False"
do_train="True"
do_val="True"
do_test="False"
do_pred="True"
num_labels=2
weight_decay=0
max_len=512
warmup_ratio=0.06
save_checkpoints="False"
save_steps=2000
validation_steps=2000
skip_steps=10
eval_mertrics=simple_accuracy
EPOCH=("10")
BATCH_SIZE=("16" "32")
LR_RATE=("1e-5" "2e-5" "3e-5")
DD_RAND_SEED=("1" "2" "3" "4" "5")
init_model="./model_files/unimo_base_en"
config_path="./model_files/config/unimo_base_en.json"
vocab_file="./model_files/dict/unimo_en.vocab.txt"
bpe_json="./model_files/dict/unimo_en.encoder.json"
bpe_file="./model_files/dict/unimo_en.vocab.bpe"
#!/usr/bin/env bash
set -eux
R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
cd ${MYDIR}/../../../
# config env
source ${MYDIR}/model_conf
source ./env.sh
source ./utils.sh
check_iplist
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
output_dir=./output/${task}
log_dir=${output_dir}/log
save_model_base_dir=$output_dir/save_model
mkdir -p $output_dir $log_dir $save_model_base_dir
if [[ ${do_pred} == "True" ]]; then
pred_save_prefix="${output_dir}/predict"
mkdir -p $pred_save_prefix
fi
for seed in "${DD_RAND_SEED[@]}"; do
echo "seed "$seed
for epoch in "${EPOCH[@]}"; do
echo "epoch "$epoch
for lr in "${LR_RATE[@]}"; do
echo "learning rate "$lr
for bs in "${BATCH_SIZE[@]}"; do
echo "batch_size "$bs
log_prefix=$seed"_"$epoch"_"$lr"_"$bs"."
if [[ ${do_pred} == "True" ]]; then
pred_save="${pred_save_prefix}/test.${seed}.${epoch}.${lr}.${bs}"
fi
if [[ ${save_checkpoints} == "True" ]]; then
save_model_dir="${save_model_base_dir}/params.${seed}.${epoch}.${lr}.${bs}"
mkdir -p $save_model_dir
fi
if [[ ${bs} == "32" ]]; then
validation_steps=1000
fi
python -u ./src/run_classifier.py --use_cuda "True" \
--is_distributed ${is_distributed:-"False"} \
--weight_sharing ${weight_sharing:-"True"} \
--use_fast_executor ${e_executor:-"true"} \
--use_fp16 ${use_fp16:-"false"} \
--nccl_comm_num ${nccl_comm_num:-1} \
--use_hierarchical_allreduce ${use_hierarchical_allreduce:-"False"} \
--in_tokens ${in_tokens:-"false"} \
--use_dynamic_loss_scaling ${use_fp16} \
--init_loss_scaling ${loss_scaling:-12800} \
--beta1 ${beta1:-0.9} \
--beta2 ${beta2:-0.98} \
--epsilon ${epsilon:-1e-06} \
--verbose true \
--do_train ${do_train:-"True"} \
--do_val ${do_val:-"True"} \
--do_test ${do_test:-"True"} \
--do_pred ${do_pred:-"True"} \
--pred_save ${pred_save:-"./output/predict/test"} \
--batch_size ${bs:-16} \
--init_pretraining_params ${init_model:-""} \
--train_set ./data/SST-2/train.tsv \
--dev_set ./data/SST-2/dev.tsv \
--test_set ./data/SST-2/test.tsv \
--checkpoints ${save_model_dir:-""} \
--save_checkpoints ${save_checkpoints:-"True"} \
--save_steps ${save_steps:-1000} \
--weight_decay ${weight_decay:-"0.1"} \
--warmup_proportion ${warmup_ratio:-"0.06"} \
--validation_steps ${validation_steps:-"100"} \
--epoch $epoch \
--max_seq_len ${max_len:-512} \
--learning_rate ${lr:-"5e-5"} \
--lr_scheduler ${lr_scheduler:-"linear_warmup_decay"} \
--skip_steps ${skip_steps:-"10"} \
--num_iteration_per_drop_scope 10 \
--num_labels ${num_labels:-2} \
--unimo_vocab_file ${vocab_file} \
--encoder_json_file ${bpe_json} \
--vocab_bpe_file ${bpe_file} \
--unimo_config_path ${config_path} \
--eval_mertrics ${eval_mertrics:-"simple_accuracy"} \
--random_seed ${seed:-1} >> $log_dir/${log_prefix}lanch.log 2>&1
done
done
done
done
if [[ $? -ne 0 ]]; then
echo "run failed"
exit 1
fi
python ./src/utils/stat_res.py --log_dir=$log_dir
exit 0
output_name="retrieval"
task=Flickr30k
## hyper param
epoch=40
do_train="True"
do_val="True"
do_test="True"
save_checkpoints="False"
save_steps=10000
validation_steps=10000
samples_num=20
bbox="bbox100"
max_img_len=101
seed=1
batch_size=4
test_batch_size=128
lr=5e-6
learning_rate_scale=0.1
learning_rate_decay_epoch1=24
learning_rate_decay_epoch2=32
init_model="./model_files/unimo_base_en"
config_path="./model_files/config/unimo_base_en.json"
vocab_file="./model_files/dict/unimo_en.vocab.txt"
bpe_json="./model_files/dict/unimo_en.encoder.json"
bpe_file="./model_files/dict/unimo_en.vocab.bpe"
#!/usr/bin/env bash
set -eux
R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
cd ${MYDIR}/../../../
# config env
source ${MYDIR}/model_conf
source ./env.sh
source ./utils.sh
check_iplist
export FLAGS_fuse_parameter_memory_size=64
set -eu
output_dir=./output/${task}
log_dir=${output_dir}/log
save_model_base_dir=$output_dir/save_model
mkdir -p $output_dir $log_dir $save_model_base_dir
log_prefix=$seed"_"$epoch"_"$lr"_"$batch_size"."
eval_dir="${output_dir}/tmp/params.${seed}.${epoch}.${lr}.${batch_size}"
mkdir -p $eval_dir
if [[ ${save_checkpoints} == "True" ]]; then
save_model_dir="${save_model_base_dir}/params.${seed}.${epoch}.${lr}.${batch_size}"
mkdir -p $save_model_dir
fi
distributed_args="--node_ips ${PADDLE_TRAINERS} \
--node_id ${PADDLE_TRAINER_ID} \
--current_node_ip ${POD_IP} \
--selected_gpus 0,1,2,3,4,5,6,7 \
--split_log_path $log_dir \
--log_prefix $log_prefix \
--nproc_per_node 8"
lanch_start=" -u ./src/launch.py ${distributed_args} "
python $lanch_start ./src/run_retrieval.py \
--use_cuda "True" \
--is_distributed ${is_distributed:-"True"} \
--weight_sharing ${weight_sharing:-"True"} \
--use_fuse ${use_fuse:-"True"} \
--use_fast_executor ${e_executor:-"true"} \
--use_fp16 ${use_fp16:-"false"} \
--nccl_comm_num ${nccl_comm_num:-2} \
--use_hierarchical_allreduce ${use_hierarchical_allreduce:-"False"} \
--use_dynamic_loss_scaling ${use_fp16:-"False"} \
--use_sigmoid ${use_sigmoid:-"False"} \
--init_loss_scaling ${loss_scaling:-12800} \
--beta1 ${beta1:-0.9} \
--beta2 ${beta2:-0.98} \
--epsilon ${epsilon:-1e-06} \
--scale_circle ${scale_circle:-1.0} \
--margin ${margin:-0.2} \
--verbose true \
--samples_num ${samples_num:-20} \
--run_random ${run_random:-"False"} \
--do_train ${do_train:-"True"} \
--do_val ${do_val:-"True"} \
--do_test ${do_test:-"True"} \
--batch_size ${batch_size:-16} \
--test_batch_size ${test_batch_size:-96} \
--init_pretraining_params ${init_model:-""} \
--train_image_caption ./data/Flickr30k/flickr30k-textids/train.ids \
--train_image_feature_dir ./data/Flickr30k/flickr30k-features/$bbox/train \
--dev_image_caption ./data/Flickr30k/flickr30k-textids/val.all.ids \
--dev_image_feature_dir ./data/Flickr30k/flickr30k-features/$bbox/dev \
--test_image_caption ./data/Flickr30k/flickr30k-textids/test.all.ids \
--test_image_feature_dir ./data/Flickr30k/flickr30k-features/$bbox/test \
--img_id_path ./data/Flickr30k/flickr30k-textids/dataset_flickr30k_name_id.txt \
--checkpoints ${save_model_dir:-""} \
--save_checkpoints ${save_checkpoints:-"True"} \
--save_steps ${save_steps:-1000} \
--weight_decay ${weight_decay:-"0.1"} \
--warmup_step ${warmup_step:-"1"} \
--validation_steps ${validation_steps:-"100"} \
--epoch $epoch \
--max_seq_len ${max_len:-512} \
--max_img_len ${max_img_len:-37} \
--learning_rate ${lr:-"5e-6"} \
--learning_rate_scale ${learning_rate_scale:-0.1} \
--learning_rate_decay_epoch1 ${learning_rate_decay_epoch1:-24} \
--learning_rate_decay_epoch2 ${learning_rate_decay_epoch2:-32} \
--lr_scheduler ${lr_scheduler:-"scale_by_epoch_decay"} \
--skip_steps ${skip_steps:-"50"} \
--num_iteration_per_drop_scope 10 \
--unimo_vocab_file ${vocab_file} \
--encoder_json_file ${bpe_json} \
--vocab_bpe_file ${bpe_file} \
--unimo_config_path ${config_path} \
--eval_mertrics ${eval_mertrics:-"recall@k"} \
--eval_dir $eval_dir \
--random_seed ${seed:-1} \
>> $log_dir/${log_prefix}lanch.log 2>&1
if [[ $? -ne 0 ]]; then
echo "run failed"
exit 1
fi
exit 0
output_name="seq2seq"
init_model="./model_files/unimo_base_en"
data_path='./data/coqa'
# hyper param
lr_scheduler="linear_warmup_decay"
use_fp16="False"
# Merge the ALLReduce times of a layer
use_fuse="True"
use_hierarchical_allreduce="True"
loss_scaling=12800
skip_steps=100
save_steps=10000
validation_steps=10000
label_smooth=0.1
weight_decay=0.01
max_seq_len=512
random_seed=666
#for multi-turn dialog/qa
task_type="dialog"
role_type_size=3
turn_type_size=16
#decoding params
do_decode="true"
max_src_len=480
max_tgt_len=32
max_out_len=30
min_out_len=0
beam_size=3
length_penalty=0.0
block_trigram="False"
use_multi_gpu_test="True"
#adam optimizer
beta1=0.9
beta2=0.98
epsilon=1e-06
#data
tokenized_input="True"
continuous_position="False"
#dataset
train_set="train.tsv"
dev_set="dev.tsv"
test_set="dev.tsv"
do_train="true"
do_val="true"
do_test="false"
do_pred="false"
#evaluate
eval_script="bash ./src/eval/tasks/coqa/eval.sh"
eval_mertrics="f1"
## turning params
in_tokens="False"
pred_batch_size=4
epoch=20
BATCH_SIZE=("8")
LR_RATE=("1e-5")
DD_RAND_SEED=("1")
WARMUP_PROP=("0.06")
config_path="./model_files/config/unimo_base_en.json"
vocab_file="./model_files/dict/unimo_en.vocab.txt"
bpe_json="./model_files/dict/unimo_en.encoder.json"
bpe_file="./model_files/dict/unimo_en.vocab.bpe"
\ No newline at end of file
#!/usr/bin/env bash
set -eux
R_DIR=`dirname $0`; MYDIR=`cd $R_DIR;pwd`
cd ${MYDIR}/../../../
# config env
source ${MYDIR}/model_conf
source ./env.sh
source ./utils.sh
# check
check_iplist
set -eu
output_dir=../output-coqa
log_dir=../log-coqa
mkdir -p $output_dir $log_dir
e_executor=$(echo ${use_experimental_executor-'True'} | tr '[A-Z]' '[a-z]')
use_fuse=$(echo ${use_fuse-'False'} | tr '[A-Z]' '[a-z]')
if [[ ${use_fuse} == "true" ]]; then
#MB
export FLAGS_fuse_parameter_memory_size=64
fi
export DEV_PREFIX=`echo ${dev_set:-"dev.tsv"} | sed 's/\.tsv$//'`
export TEST_PREFIX=`echo ${test_set:-"test.tsv"} | sed 's/\.tsv$//'`
export PRED_PREFIX=`echo ${pred_set:-"pred.tsv"} | sed 's/\.tsv$//'`
export EVAL_SCRIPT_LOG=${MYDIR}/../../../${output_dir}/eval.log
export TASK_DATA_PATH=${data_path}
distributed_args="--node_ips ${PADDLE_TRAINERS} \
--node_id ${PADDLE_TRAINER_ID} \
--current_node_ip ${POD_IP} \
--selected_gpus 4,5,6,7 \
--split_log_path $log_dir \
--nproc_per_node 4"
for random_seed in "${DD_RAND_SEED[@]}"; do
echo "random_seed "${random_seed}
for batch_size in "${BATCH_SIZE[@]}"; do
echo "batch_size "${batch_size}
for warmup_proportion in "${WARMUP_PROP[@]}"; do
echo "warmup_proportion "${warmup_proportion}
for learning_rate in "${LR_RATE[@]}"; do
echo "learning rate "${learning_rate}
python -u ./src/launch.py ${distributed_args} \
./src/run_seq2seq.py --use_cuda "True" \
--is_distributed "True" \
--use_multi_gpu_test ${use_multi_gpu_test:-"True"} \
--use_fp16 ${use_fp16:-"False"} \
--use_dynamic_loss_scaling ${use_fp16} \
--init_loss_scaling ${loss_scaling:-128} \
--use_fast_executor ${e_executor:-"True"} \
--use_fuse ${use_fuse:-"False"} \
--nccl_comm_num ${nccl_comm_num:-1} \
--use_hierarchical_allreduce ${use_hierarchical_allreduce:-"False"} \
--do_train ${do_train:-"true"} \
--do_val ${do_val:-"false"} \
--do_test ${do_test:-"true"} \
--do_pred ${do_pred:-"false"} \
--do_decode ${do_decode:-"True"} \
--train_set ${data_path}/${train_set:-""} \
--dev_set ${data_path}/${dev_set:-""} \
--test_set ${data_path}/${test_set:-""} \
--pred_set ${data_path}/${pred_set:-""} \
--epoch ${epoch} \
--tokenized_input ${tokenized_input:-"True"} \
--task_type ${task_type:-"dialog"} \
--role_type_size ${role_type_size:-3} \
--turn_type_size ${turn_type_size:-16} \
--max_seq_len ${max_seq_len} \
--max_src_len ${max_src_len} \
--max_tgt_len ${max_tgt_len} \
--max_out_len ${max_out_len} \
--min_out_len ${min_out_len} \
--block_trigram ${block_trigram:-"True"} \
--beam_size ${beam_size:-5} \
--length_penalty ${length_penalty:-0.6} \
--hidden_dropout_prob ${hidden_dropout_prob:-0.1} \
--attention_probs_dropout_prob ${attention_probs_dropout_prob:-0.1} \
--beta1 ${beta1:-0.9} \
--beta2 ${beta2:-0.98} \
--epsilon ${epsilon:-1e-06} \
--continuous_position ${continuous_position:-"false"} \
--tgt_type_id ${tgt_type_id:-1}\
--batch_size ${batch_size} \
--pred_batch_size ${pred_batch_size} \
--in_tokens ${in_tokens:-"True"} \
--learning_rate ${learning_rate} \
--lr_scheduler ${lr_scheduler:-"linear_warmup_decay"} \
--warmup_proportion ${warmup_proportion:-0.02} \
--weight_decay ${weight_decay:-0.01} \
--weight_sharing ${weight_sharing:-"True"} \
--label_smooth ${label_smooth:-0.1} \
--init_pretraining_params ${init_model:-""} \
--unimo_vocab_file ${vocab_file} \
--encoder_json_file ${bpe_json} \
--vocab_bpe_file ${bpe_file} \
--unimo_config_path ${config_path} \
--checkpoints $output_dir \
--save_steps ${save_steps:-10000} \
--validation_steps ${validation_steps:-10000} \
--skip_steps ${skip_steps:-10} \
--save_and_valid_by_epoch ${save_and_valid_by_epoch:-"False"} \
--eval_script ${eval_script:-""} \
--eval_mertrics ${eval_mertrics:-"bleu_1"} \
--random_seed ${random_seed:-"666"} >> $log_dir/lanch.log 2>&1
done
done
done
done
python ./src/utils/extract_eval_res.py --log_dir=$log_dir
exit 0
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""args for classification task"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
from utils.args import ArgumentGroup
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("init_pretraining_params", str, None,
"Init pre-training params which preforms fine-tuning from. If the "
"arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
model_g.add_arg("save_checkpoints", bool, True, "Whether to save checkpoints")
model_g.add_arg("weight_sharing", bool, True, "If set, share weights between word embedding and masked lm.")
model_g.add_arg("unimo_vocab_file", str, './model_files/dict/unimo_en.vocab.txt', "unimo vocab")
model_g.add_arg("encoder_json_file", str, './model_files/dict/unimo_en.encoder.json', 'bpt map')
model_g.add_arg("vocab_bpe_file", str, './model_files/dict/unimo_en.vocab.bpe', "vocab bpe")
model_g.add_arg("unimo_config_path", str, "./model_files/config/unimo_base_en.json",
"The file to save unimo configuration.")
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.")
train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
"scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
train_g.add_arg("warmup_proportion", float, 0.1,
"Proportion of training steps to perform linear learning rate warmup for.")
train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.")
train_g.add_arg("nccl_comm_num", int, 1, "NCCL comm num.")
train_g.add_arg("hierarchical_allreduce_inter_nranks", int, 8, "Hierarchical allreduce inter ranks.")
train_g.add_arg("use_hierarchical_allreduce", bool, False, "Use hierarchical allreduce or not.")
train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.")
train_g.add_arg("use_dynamic_loss_scaling", bool, False, "Whether to use dynamic loss scaling.")
train_g.add_arg("init_loss_scaling", float, 1.0,
"Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.")
train_g.add_arg("incr_every_n_steps", int, 100, "Increases loss scaling every n consecutive.")
train_g.add_arg("decr_every_n_nan_or_inf", int, 2,
"Decreases loss scaling every n accumulated steps with nan or inf gradients.")
train_g.add_arg("incr_ratio", float, 2.0,
"The multiplier to use when increasing the loss scaling.")
train_g.add_arg("decr_ratio", float, 0.8,
"The less-than-one-multiplier to use when decreasing.")
train_g.add_arg("beta1", float, 0.9, "beta1 for adam")
train_g.add_arg("beta2", float, 0.98, "beta2 for adam.")
train_g.add_arg("epsilon", float, 1e-06, "epsilon for adam.")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("train_set", str, None, "Path to training data.")
data_g.add_arg("test_set", str, None, "Path to test data.")
data_g.add_arg("test_hard_set", str, None, "Path to test_hard data.")
data_g.add_arg("dev_set", str, None, "Path to validation data.")
data_g.add_arg("dev_hard_set", str, None, "Path to validation_hard data.")
data_g.add_arg("diagnostic_set", str, None, "Path to diagnostic data.")
data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.")
data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training. see also --in_tokens.")
data_g.add_arg("in_tokens", bool, False,
"If set, the batch size will be the maximum number of tokens in one batch. "
"Otherwise, it will be the maximum number of examples in one batch.")
data_g.add_arg("do_lower_case", bool, True,
"Whether to lower case the input text. Should be True for uncased models and False for cased models.")
data_g.add_arg("random_seed", int, 0, "Random seed.")
data_g.add_arg("num_labels", int, 2, "label number")
data_g.add_arg("max_query_length", int, 64, "Max query length.")
data_g.add_arg("max_answer_length", int, 100, "Max answer length.")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.")
run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
run_type_g.add_arg("num_iteration_per_drop_scope", int, 10, "Iteration intervals to drop scope.")
run_type_g.add_arg("do_train", bool, False, "Whether to perform training.")
run_type_g.add_arg("do_val", bool, False, "Whether to perform evaluation on dev data set.")
run_type_g.add_arg("do_val_hard", bool, False, "Whether to perform evaluation on dev hard data set.")
run_type_g.add_arg("do_test", bool, False, "Whether to perform evaluation on test data set.")
run_type_g.add_arg("do_test_hard", bool, False, "Whether to perform evaluation on test hard data set.")
run_type_g.add_arg("do_pred", bool, False, "Whether to predict on test data set.")
run_type_g.add_arg("do_pred_hard", bool, False, "Whether to predict on test hard data set.")
run_type_g.add_arg("do_diagnostic", bool, False, "Whether to predict on diagnostic data set.")
run_type_g.add_arg("pred_save", str, "./output/predict/test", "Whether to predict on test data set.")
run_type_g.add_arg("use_multi_gpu_test", bool, False, "Whether to perform evaluation using multiple gpu cards")
run_type_g.add_arg("eval_mertrics", str, "simple_accuracy", "eval_mertrics")
# yapf: enable
\ No newline at end of file
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""args for image-to-text generation"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
import argparse
from utils.args import ArgumentGroup
class CustomAction(argparse.Action):
"""custom action"""
def __call__(self, parser, namespace, values, option_string=None):
setattr(namespace, self.dest, " ".join(values))
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("init_pretraining_params", str, None,
"Init pre-training params which preforms fine-tuning from. If the "
"arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
model_g.add_arg("weight_sharing", bool, True, "If set, share weights between word embedding and masked lm.")
model_g.add_arg("unimo_vocab_file", str, './model_files/dict/unimo_en.vocab.txt', "unimo vocab")
model_g.add_arg("encoder_json_file", str, './model_files/dict/unimo_en.encoder.json', 'bpt map')
model_g.add_arg("vocab_bpe_file", str, './model_files/dict/unimo_en.vocab.bpe', "vocab bpe")
model_g.add_arg("unimo_config_path", str, "./model_files/config/unimo_base_en.json",
"The file to save unimo configuration.")
model_g.add_arg("object_file", str, "./data/coco_object_0.35_tot.ids", "The object file for image bounding boxes.")
model_g.add_arg("adv_type", str, "villa", "The adversial learning type: freelb_image, freelb_text, villa")
model_g.add_arg("adv_step", int, 4, "adv_step")
model_g.add_arg("adv_lr", float, 0.05, "adv_lr")
model_g.add_arg("norm_type", str, 'l2', "norm_type")
model_g.add_arg("adv_max_norm", float, 0.4, "adv_max_norm")
model_g.add_arg("adv_init_mag", float, 0.4, "adv_init_mag")
model_g.add_arg("adv_kl_weight", float, 1.5, "adv_kl_weight")
model_g.add_arg("with_pure_model", bool, True, "whether include the pure model during adv learning")
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 50, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 4e-5, "Learning rate used to train with warmup.")
train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
"scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
train_g.add_arg("warmup_proportion", float, 0.02,
"Proportion of training steps to perform linear learning rate warmup for.")
train_g.add_arg("save_steps", int, 100000, "The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 100000, "The steps interval to evaluate model performance.")
train_g.add_arg("use_fuse", bool, False, "Whether to use fuse_allreduce_ops.")
train_g.add_arg("nccl_comm_num", int, 1, "NCCL comm num.")
train_g.add_arg("hierarchical_allreduce_inter_nranks", int, 8, "Hierarchical allreduce inter ranks.")
train_g.add_arg("use_hierarchical_allreduce", bool, False, "Use hierarchical allreduce or not.")
train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.")
train_g.add_arg("use_dynamic_loss_scaling", bool, False, "Whether to use dynamic loss scaling.")
train_g.add_arg("init_loss_scaling", float, 128.0,
"Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.")
train_g.add_arg("incr_every_n_steps", int, 100, "Increases loss scaling every n consecutive.")
train_g.add_arg("decr_every_n_nan_or_inf", int, 2,
"Decreases loss scaling every n accumulated steps with nan or inf gradients.")
train_g.add_arg("incr_ratio", float, 2.0,
"The multiplier to use when increasing the loss scaling.")
train_g.add_arg("decr_ratio", float, 0.8,
"The less-than-one-multiplier to use when decreasing.")
train_g.add_arg("beta1", float, 0.9, "beta1 for adam")
train_g.add_arg("beta2", float, 0.98, "beta2 for adam.")
train_g.add_arg("epsilon", float, 1e-06, "epsilon for adam.")
train_g.add_arg("tgt_type_id", int, 1, "for seq2seq task.")
train_g.add_arg("do_decode", bool, False, "for seq2seq task.")
train_g.add_arg("label_smooth", float, 0.1, "label smooth")
train_g.add_arg("hidden_dropout_prob", float, 0.1, "hidden_dropout_prob")
train_g.add_arg("attention_probs_dropout_prob", float, 0.1, "attention_probs_dropout_prob")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 100, "The steps interval to print loss.")
log_g.add_arg("verbose", bool, True, "Whether to output verbose log.")
data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("task_type", str, "normal", "is task type")
data_g.add_arg("train_filelist", str, None, "Path to training data.")
data_g.add_arg("test_filelist", str, None, "Path to test data.")
data_g.add_arg("valid_filelist", str, None, "Path to validation data.")
data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.")
data_g.add_arg("max_tgt_len", int, 512, "for seq2seq task.")
data_g.add_arg("max_out_len", int, 512, "for seq2seq task.")
data_g.add_arg("min_out_len", int, 20, "for seq2seq task.")
data_g.add_arg("block_trigram", bool, True, "utilize trigram blocking during beam search")
data_g.add_arg("beam_size", int, 5, "for seq2seq task.")
data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training.")
data_g.add_arg("pred_batch_size", int, 0, "Total examples' number in batch for training.")
data_g.add_arg("do_lower_case", bool, True,
"Whether to lower case the input text. Should be True for uncased models and False for cased models.")
data_g.add_arg("length_penalty", float, 0.6, "length_penalty")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
run_type_g.add_arg("visualdl_log", bool, False, "If set, use visualdl_log on paddlecloud.")
run_type_g.add_arg("is_distributed", bool, True, "If set, then start distributed training.")
run_type_g.add_arg("use_fast_executor", bool, True, "If set, use fast parallel executor (in experiment).")
run_type_g.add_arg("num_iteration_per_drop_scope", int, 1, "Iteration intervals to drop scope.")
run_type_g.add_arg("do_train", bool, True, "Whether to perform training.")
run_type_g.add_arg("do_val", bool, True, "Whether to perform evaluation on dev data set.")
run_type_g.add_arg("do_test", bool, True, "Whether to perform evaluation on test data set.")
run_type_g.add_arg("do_pred", bool, True, "Whether to perform evaluation on pred data set.")
run_type_g.add_arg("use_multi_gpu_test", bool, True, "Whether to perform evaluation using multiple gpu cards")
run_type_g.add_arg("save_and_valid_by_epoch", bool, False, "save_and_valid_by_epoch")
run_type_g.add_arg("eval_script", action=CustomAction, type=str, nargs='+', help="eval_script", default=None)
run_type_g.add_arg("eval_mertrics", str, "", "eval_mertrics")
run_type_g.add_arg("random_seed", int, 0, "Random seed.")
image_g = ArgumentGroup(parser, "image", "image configuration options")
image_g.add_arg("image_embedding_size", int, 2048, "Image feature size==2048.")
image_g.add_arg("max_img_len", int, 37, "Image feature size==2048.")
image_g.add_arg("max_obj_len", int, 50, "max num of object size.")
# yapf: enable
\ No newline at end of file
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""args for regression task"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
from utils.args import ArgumentGroup
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("init_pretraining_params", str, None,
"Init pre-training params which preforms fine-tuning from. If the "
"arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
model_g.add_arg("save_checkpoints", bool, True, "Whether to save checkpoints")
model_g.add_arg("weight_sharing", bool, True, "If set, share weights between word embedding and masked lm.")
model_g.add_arg("unimo_vocab_file", str, './model_files/dict/unimo_en.vocab.txt', "unimo vocab")
model_g.add_arg("encoder_json_file", str, './model_files/dict/unimo_en.encoder.json', 'bpt map')
model_g.add_arg("vocab_bpe_file", str, './model_files/dict/unimo_en.vocab.bpe', "vocab bpe")
model_g.add_arg("unimo_config_path", str, "./model_files/config/unimo_base_en.json",
"The file to save unimo configuration.")
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.")
train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
"scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
train_g.add_arg("warmup_proportion", float, 0.1,
"Proportion of training steps to perform linear learning rate warmup for.")
train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.")
train_g.add_arg("nccl_comm_num", int, 1, "NCCL comm num.")
train_g.add_arg("hierarchical_allreduce_inter_nranks", int, 8, "Hierarchical allreduce inter ranks.")
train_g.add_arg("use_hierarchical_allreduce", bool, False, "Use hierarchical allreduce or not.")
train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.")
train_g.add_arg("use_dynamic_loss_scaling", bool, False, "Whether to use dynamic loss scaling.")
train_g.add_arg("init_loss_scaling", float, 1.0,
"Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.")
train_g.add_arg("incr_every_n_steps", int, 100, "Increases loss scaling every n consecutive.")
train_g.add_arg("decr_every_n_nan_or_inf", int, 2,
"Decreases loss scaling every n accumulated steps with nan or inf gradients.")
train_g.add_arg("incr_ratio", float, 2.0,
"The multiplier to use when increasing the loss scaling.")
train_g.add_arg("decr_ratio", float, 0.8,
"The less-than-one-multiplier to use when decreasing.")
train_g.add_arg("beta1", float, 0.9, "beta1 for adam")
train_g.add_arg("beta2", float, 0.98, "beta2 for adam.")
train_g.add_arg("epsilon", float, 1e-06, "epsilon for adam.")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("train_set", str, None, "Path to training data.")
data_g.add_arg("test_set", str, None, "Path to test data.")
data_g.add_arg("dev_set", str, None, "Path to validation data.")
data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.")
data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training. see also --in_tokens.")
data_g.add_arg("in_tokens", bool, False,
"If set, the batch size will be the maximum number of tokens in one batch. "
"Otherwise, it will be the maximum number of examples in one batch.")
data_g.add_arg("do_lower_case", bool, True,
"Whether to lower case the input text. Should be True for uncased models and False for cased models.")
data_g.add_arg("random_seed", int, 0, "Random seed.")
data_g.add_arg("num_labels", int, 2, "label number")
data_g.add_arg("max_query_length", int, 64, "Max query length.")
data_g.add_arg("max_answer_length", int, 100, "Max answer length.")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("visualdl_log", bool, False, "If set, use visualdl_log on paddlecloud.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.")
run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
run_type_g.add_arg("num_iteration_per_drop_scope", int, 10, "Iteration intervals to drop scope.")
run_type_g.add_arg("do_train", bool, False, "Whether to perform training.")
run_type_g.add_arg("do_val", bool, False, "Whether to perform evaluation on dev data set.")
run_type_g.add_arg("do_test", bool, False, "Whether to perform evaluation on test data set.")
run_type_g.add_arg("do_pred", bool, False, "Whether to predict on test data set.")
run_type_g.add_arg("pred_save", str, "./output/predict/test", "Whether to predict on test data set.")
run_type_g.add_arg("use_multi_gpu_test", bool, False, "Whether to perform evaluation using multiple gpu cards")
run_type_g.add_arg("eval_mertrics", str, "simple_accuracy", "eval_mertrics")
# yapf: enable
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""args for retrieval task"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
from utils.args import ArgumentGroup
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("run_random", bool, False, "run model with random params")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("init_pretraining_params", str, None,
"Init pre-training params which preforms fine-tuning from. If the "
"arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
model_g.add_arg("save_checkpoints", bool, True, "Whether to save checkpoints")
model_g.add_arg("weight_sharing", bool, True, "If set, share weights between word embedding and masked lm.")
model_g.add_arg("unimo_vocab_file", str, './model_files/dict/unimo_en.vocab.txt', "unimo vocab")
model_g.add_arg("encoder_json_file", str, './model_files/dict/unimo_en.encoder.json', 'bpt map')
model_g.add_arg("vocab_bpe_file", str, './model_files/dict/unimo_en.vocab.bpe', "vocab bpe")
model_g.add_arg("unimo_config_path", str, "./model_files/config/unimo_base_en.json",
"The file to save unimo configuration.")
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train.")
train_g.add_arg("learning_rate_scale", float, 0.1, "Learning rate decay scale.")
train_g.add_arg("lr_scheduler", str, "scale_by_epoch_decay",
"scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay', 'scale_by_epoch_decay'])
train_g.add_arg("learning_rate_decay_epoch1", int, 24, "Learning rate decay epoch1.")
train_g.add_arg("learning_rate_decay_epoch2", int, 32, "Learning rate decay epoch2.")
train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
train_g.add_arg("warmup_step", int, 1, "warmup_step, 1 for scale_by_epoch_decay, 0 for others")
train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.")
train_g.add_arg("nccl_comm_num", int, 1, "NCCL comm num.")
train_g.add_arg("hierarchical_allreduce_inter_nranks", int, 8, "Hierarchical allreduce inter ranks.")
train_g.add_arg("use_hierarchical_allreduce", bool, False, "Use hierarchical allreduce or not.")
train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.")
train_g.add_arg("use_dynamic_loss_scaling", bool, False, "Whether to use dynamic loss scaling.")
train_g.add_arg("use_sigmoid", bool, True, "Whether to use sigmoid before loss")
train_g.add_arg("init_loss_scaling", float, 1.0,
"Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.")
train_g.add_arg("incr_every_n_steps", int, 100, "Increases loss scaling every n consecutive.")
train_g.add_arg("decr_every_n_nan_or_inf", int, 2,
"Decreases loss scaling every n accumulated steps with nan or inf gradients.")
train_g.add_arg("incr_ratio", float, 2.0,
"The multiplier to use when increasing the loss scaling.")
train_g.add_arg("decr_ratio", float, 0.8,
"The less-than-one-multiplier to use when decreasing.")
train_g.add_arg("beta1", float, 0.9, "beta1 for adam")
train_g.add_arg("beta2", float, 0.98, "beta2 for adam.")
train_g.add_arg("epsilon", float, 1e-06, "epsilon for adam.")
train_g.add_arg("use_fuse", bool, False, "Whether to use fuse_allreduce_ops.")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
log_g.add_arg("eval_dir", str, "", "eval_dir to save tmp data")
data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("samples_num", int, 20, "neg sample num.")
data_g.add_arg("train_image_caption", str, None, "Path to training data.")
data_g.add_arg("train_image_feature_dir", str, None, "data dir to training data.")
data_g.add_arg("test_image_caption", str, None, "Path to test data.")
data_g.add_arg("test_image_feature_dir", str, None, "data dir to test data.")
data_g.add_arg("dev_image_caption", str, None, "Path to validation data.")
data_g.add_arg("dev_image_feature_dir", str, None, "data dir to validation data.")
data_g.add_arg("img_id_path", str, None, "img_id_path.")
data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.")
data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training. see also --in_tokens.")
data_g.add_arg("test_batch_size", int, 24, "Total examples' number in batch for testing.")
data_g.add_arg("do_lower_case", bool, True,
"Whether to lower case the input text. Should be True for uncased models and False for cased models.")
data_g.add_arg("random_seed", int, 0, "Random seed.")
data_g.add_arg("max_img_len", int, 37, "Image feature size==2048.")
data_g.add_arg("scale_circle", float, "1.0", "The scale factor in circle loss function, only use in circle loss mode")
data_g.add_arg("margin", float, "0.2", "The margin value in loss function")
data_g.add_arg("max_neg_cap_num", int, 0, "max_neg_cap_num")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.")
run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
run_type_g.add_arg("num_iteration_per_drop_scope", int, 10, "Iteration intervals to drop scope.")
run_type_g.add_arg("do_train", bool, True, "Whether to perform training.")
run_type_g.add_arg("do_val", bool, True, "Whether to perform evaluation on dev data set.")
run_type_g.add_arg("do_test", bool, True, "Whether to perform evaluation on test data set.")
run_type_g.add_arg("use_multi_gpu_test", bool, False, "Whether to perform evaluation using multiple gpu cards")
run_type_g.add_arg("eval_mertrics", str, "recall@k", "eval_mertrics")
# yapf: enable
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""args for seq2seq generation"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
import argparse
from utils.args import ArgumentGroup
class CustomAction(argparse.Action):
"""custom action"""
def __call__(self, parser, namespace, values, option_string=None):
setattr(namespace, self.dest, " ".join(values))
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("init_pretraining_params", str, None,
"Init pre-training params which preforms fine-tuning from. If the "
"arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
model_g.add_arg("weight_sharing", bool, True, "If set, share weights between word embedding and masked lm.")
model_g.add_arg("unimo_vocab_file", str, './model_files/dict/unimo_en.vocab.txt', "unimo vocab")
model_g.add_arg("encoder_json_file", str, './model_files/dict/unimo_en.encoder.json', 'bpt map')
model_g.add_arg("vocab_bpe_file", str, './model_files/dict/unimo_en.vocab.bpe', "vocab bpe")
model_g.add_arg("unimo_config_path", str, "./model_files/config/unimo_base_en.json",
"The file to save unimo configuration.")
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 50, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 4e-5, "Learning rate used to train with warmup.")
train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
"scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
train_g.add_arg("warmup_proportion", float, 0.02,
"Proportion of training steps to perform linear learning rate warmup for.")
train_g.add_arg("save_steps", int, 100000, "The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 100000, "The steps interval to evaluate model performance.")
train_g.add_arg("use_fuse", bool, False, "Whether to use fuse_allreduce_ops.")
train_g.add_arg("nccl_comm_num", int, 1, "NCCL comm num.")
train_g.add_arg("hierarchical_allreduce_inter_nranks", int, 8, "Hierarchical allreduce inter ranks.")
train_g.add_arg("use_hierarchical_allreduce", bool, False, "Use hierarchical allreduce or not.")
train_g.add_arg("use_fp16", bool, False, "Whether to use fp16 mixed precision training.")
train_g.add_arg("use_dynamic_loss_scaling", bool, False, "Whether to use dynamic loss scaling.")
train_g.add_arg("init_loss_scaling", float, 128.0,
"Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.")
train_g.add_arg("incr_every_n_steps", int, 100, "Increases loss scaling every n consecutive.")
train_g.add_arg("decr_every_n_nan_or_inf", int, 2,
"Decreases loss scaling every n accumulated steps with nan or inf gradients.")
train_g.add_arg("incr_ratio", float, 2.0,
"The multiplier to use when increasing the loss scaling.")
train_g.add_arg("decr_ratio", float, 0.8,
"The less-than-one-multiplier to use when decreasing.")
train_g.add_arg("beta1", float, 0.9, "beta1 for adam")
train_g.add_arg("beta2", float, 0.98, "beta2 for adam.")
train_g.add_arg("epsilon", float, 1e-06, "epsilon for adam.")
train_g.add_arg("tgt_type_id", int, 1, "for seq2seq task.")
train_g.add_arg("do_decode", bool, False, "for seq2seq task.")
train_g.add_arg("label_smooth", float, 0.1, "label smooth")
train_g.add_arg("hidden_dropout_prob", float, 0.1, "hidden_dropout_prob")
train_g.add_arg("attention_probs_dropout_prob", float, 0.1, "attention_probs_dropout_prob")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 100, "The steps interval to print loss.")
log_g.add_arg("verbose", bool, True, "Whether to output verbose log.")
data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("task_type", str, "normal", "is task type")
data_g.add_arg("train_set", str, None, "Path to training data.")
data_g.add_arg("test_set", str, None, "Path to test data.")
data_g.add_arg("dev_set", str, None, "Path to validation data.")
data_g.add_arg("pred_set", str, None, "Path to pred data.")
data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.")
data_g.add_arg("max_tgt_len", int, 512, "for seq2seq task.")
data_g.add_arg("max_src_len", int, 512, "for seq2seq task.")
data_g.add_arg("max_out_len", int, 512, "for seq2seq task.")
data_g.add_arg("min_out_len", int, 20, "for seq2seq task.")
data_g.add_arg("block_trigram", bool, True, "utilize trigram blocking during beam search")
data_g.add_arg("beam_size", int, 5, "for seq2seq task.")
data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training. see also --in_tokens.")
data_g.add_arg("pred_batch_size", int, 0, "Total examples' number in batch for training. see also --in_tokens.")
data_g.add_arg("in_tokens", bool, False,
"If set, the batch size will be the maximum number of tokens in one batch. "
"Otherwise, it will be the maximum number of examples in one batch.")
data_g.add_arg("do_lower_case", bool, True,
"Whether to lower case the input text. Should be True for uncased models and False for cased models.")
data_g.add_arg("tokenized_input", bool, True, "input is tokenized")
data_g.add_arg("length_penalty", float, 0.6, "length_penalty")
data_g.add_arg("continuous_position", bool, False, "position is continuous")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
run_type_g.add_arg("visualdl_log", bool, False, "If set, use visualdl_log on paddlecloud.")
run_type_g.add_arg("is_distributed", bool, True, "If set, then start distributed training.")
run_type_g.add_arg("use_fast_executor", bool, True, "If set, use fast parallel executor (in experiment).")
run_type_g.add_arg("num_iteration_per_drop_scope", int, 1, "Iteration intervals to drop scope.")
run_type_g.add_arg("do_train", bool, True, "Whether to perform training.")
run_type_g.add_arg("do_val", bool, True, "Whether to perform evaluation on dev data set.")
run_type_g.add_arg("do_test", bool, True, "Whether to perform evaluation on test data set.")
run_type_g.add_arg("do_pred", bool, True, "Whether to perform evaluation on pred data set.")
run_type_g.add_arg("use_multi_gpu_test", bool, True, "Whether to perform evaluation using multiple gpu cards")
run_type_g.add_arg("save_and_valid_by_epoch", bool, False, "save_and_valid_by_epoch")
run_type_g.add_arg("eval_script", action=CustomAction, type=str, nargs='+', help="eval_script", default=None)
run_type_g.add_arg("eval_mertrics", str, "", "eval_mertrics")
run_type_g.add_arg("random_seed", int, 0, "Random seed.")
dialo_g = ArgumentGroup(parser, "dialogue", "for dialogue task.")
dialo_g.add_arg("role_type_size", int, 2, "role type size")
dialo_g.add_arg("turn_type_size", int, 16, "turn type size")
# yapf: enable
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""utils help and eval functions for text generation."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import json
import math
import subprocess
from model.tokenization import GptBpeTokenizer, BasicTokenizer
class GenerationEval(object):
"""GenerationEval"""
def __init__(self, args):
self.eval_script = args.eval_script.split(" ")
self.eval_mertrics = args.eval_mertrics.split(",") if args.eval_mertrics else []
self.basic_tokenizer = BasicTokenizer(do_lower_case=True)
self.roberta_tokenizer = GptBpeTokenizer(vocab_file=args.unimo_vocab_file,
encoder_json_file=args.encoder_json_file,
vocab_bpe_file=args.vocab_bpe_file,
do_lower_case=True)
def eval(self, output_file, phase="", features=None):
"""run eval"""
eval_res = {}
if self.eval_script:
eval_res = subprocess.check_output(self.eval_script + [output_file, phase])
eval_res = json.loads(eval_res)
else:
preds = []
for line in open(output_file):
preds.append(self.basic_tokenizer.tokenize(line.strip()))
refs = []
for id in sorted(features.keys()):
ref_str = self.roberta_tokenizer.gptbpe_tokenizer.decode(features[id].tgt.split(" "))
refs.append([self.basic_tokenizer.tokenize(ref_str)])
for mertric in self.eval_mertrics:
eval_func = getattr(self, mertric, None)
if eval_func:
eval_res[mertric] = eval_func(refs, preds)
ret = []
for mertric in self.eval_mertrics:
mertric_res = eval_res.get(mertric, None)
if mertric_res is None:
raise Exception("Eval mertric: %s is not supported" % mertric)
ret.append("%s: %f" % (mertric, mertric_res))
return ", ".join(ret)
def bleu(self, refs, preds):
"""bleu mertric"""
return _compute_bleu(refs, preds, max_order=4)[0]
def _get_ngrams(segment, max_order):
ngram_counts = collections.Counter()
for order in range(1, max_order + 1):
for i in range(0, len(segment) - order + 1):
ngram = tuple(segment[i: i + order])
ngram_counts[ngram] += 1
return ngram_counts
def _compute_bleu(reference_corpus, translation_corpus, max_order=4, smooth=False):
matches_by_order = [0] * max_order
possible_matches_by_order = [0] * max_order
reference_length = 0
translation_length = 0
for (references, translation) in zip(reference_corpus, translation_corpus):
reference_length += min(len(r) for r in references)
translation_length += len(translation)
merged_ref_ngram_counts = collections.Counter()
for reference in references:
merged_ref_ngram_counts |= _get_ngrams(reference, max_order)
translation_ngram_counts = _get_ngrams(translation, max_order)
overlap = translation_ngram_counts & merged_ref_ngram_counts
for ngram in overlap:
matches_by_order[len(ngram) - 1] += overlap[ngram]
for order in range(1, max_order + 1):
possible_matches = len(translation) - order + 1
if possible_matches > 0:
possible_matches_by_order[order - 1] += possible_matches
precisions = [0] * max_order
for i in range(0, max_order):
if smooth:
precisions[i] = ((matches_by_order[i] + 1.) /
(possible_matches_by_order[i] + 1.))
else:
if possible_matches_by_order[i] > 0:
precisions[i] = (float(matches_by_order[i]) /
possible_matches_by_order[i])
else:
precisions[i] = 0.0
if min(precisions) > 0:
p_log_sum = sum((1. / max_order) * math.log(p) for p in precisions)
geo_mean = math.exp(p_log_sum)
else:
geo_mean = 0
ratio = float(translation_length) / reference_length
if ratio > 1.0:
bp = 1.
else:
bp = math.exp(1 - 1. / (ratio + 1e-4))
bleu = geo_mean * bp
ret = [bleu, precisions, bp, ratio, translation_length, reference_length]
return ret
\ No newline at end of file
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""ultis help and eval functions for glue ."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import time
import numpy as np
from scipy.stats import pearsonr, spearmanr
from six.moves import xrange
import paddle.fluid as fluid
from functools import partial
from collections import OrderedDict
def matthews_corrcoef(preds, labels):
"""matthews_corrcoef"""
preds = np.array(preds)
labels = np.array(labels)
tp = np.sum((labels == 1) & (preds == 1))
tn = np.sum((labels == 0) & (preds == 0))
fp = np.sum((labels == 0) & (preds == 1))
fn = np.sum((labels == 1) & (preds == 0))
mcc = ((tp * tn) - (fp * fn)) / np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
ret = OrderedDict()
ret['mat_cor'] = mcc
ret['key_eval'] = "mat_cor"
return ret
def f1_score(preds, labels):
"""f1_score"""
preds = np.array(preds)
labels = np.array(labels)
tp = np.sum((labels == 1) & (preds == 1))
tn = np.sum((labels == 0) & (preds == 0))
fp = np.sum((labels == 0) & (preds == 1))
fn = np.sum((labels == 1) & (preds == 0))
p = tp / (tp + fp)
r = tp / (tp + fn)
f1 = (2 * p * r) / (p + r + 1e-8)
ret = OrderedDict()
ret['f1'] = f1
ret['key_eval'] = "f1"
return ret
def pearson_and_spearman(preds, labels):
"""pearson_and_spearman"""
preds = np.array(preds)
labels = np.array(labels)
pearson_corr = pearsonr(preds, labels)[0]
spearman_corr = spearmanr(preds, labels)[0]
ret = OrderedDict()
ret['pearson'] = pearson_corr
ret['spearmanr'] = spearman_corr
ret['p_and_sp'] = (pearson_corr + spearman_corr) / 2
ret['key_eval'] = "p_and_sp"
return ret
def acc_and_f1(preds, labels):
"""acc_and_f1"""
preds = np.array(preds)
labels = np.array(labels)
acc = simple_accuracy(preds, labels)['acc']
f1 = f1_score(preds, labels)['f1']
ret = OrderedDict()
ret['acc'] = acc
ret['f1'] = f1
ret['acc_and_f1'] = (acc + f1) / 2
ret['key_eval'] = "acc_and_f1"
return ret
def simple_accuracy(preds, labels):
"""simple_accuracy"""
preds = np.array(preds)
labels = np.array(labels)
acc = (preds == labels).mean()
ret = OrderedDict()
ret['acc'] = acc
ret['key_eval'] = "acc"
return ret
def evaluate_mrr(preds):
"""evaluate_mrr"""
last_qid = None
total_mrr = 0.0
qnum = 0.0
rank = 0.0
correct = False
for qid, score, label in preds:
if qid != last_qid:
rank = 0.0
qnum += 1
correct = False
last_qid = qid
rank += 1
if not correct and label != 0:
total_mrr += 1.0 / rank
correct = True
return total_mrr / qnum
def evaluate_map(preds):
"""evaluate_map"""
def singe_map(st, en):
"""singe_map"""
total_p = 0.0
correct_num = 0.0
for index in xrange(st, en):
if int(preds[index][2]) != 0:
correct_num += 1
total_p += correct_num / (index - st + 1)
if int(correct_num) == 0:
return 0.0
return total_p / correct_num
last_qid = None
total_map = 0.0
qnum = 0.0
st = 0
for i in xrange(len(preds)):
qid = preds[i][0]
if qid != last_qid:
qnum += 1
if last_qid is not None:
total_map += singe_map(st, i)
st = i
last_qid = qid
total_map += singe_map(st, len(preds))
return total_map / qnum
\ No newline at end of file
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""ultis help and eval functions for image/text retrieval."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
from collections import OrderedDict
def recall_at_k(score_matrix, text2img, img2texts):
"""recall@k"""
assert score_matrix.shape[0] == len(text2img) * len(img2texts)
cur_img, cur_cap = score_matrix[:, 1], score_matrix[:, 2]
img_len, cap_len = len(np.unique(cur_img)), len(np.unique(cur_cap))
cur_img_sort = np.reshape(np.argsort(cur_img), [-1, cap_len])
cur_cap_sort = np.reshape(np.argsort(cur_cap), [-1, img_len])
i2c = np.take(score_matrix, cur_img_sort, axis=0) # img_len x cap_len x 3
c2i = np.take(score_matrix, cur_cap_sort, axis=0) # cap_len x img_len x 3
def get_recall_k(scores, idx, label_dict):
"""
scores: sample x len x 5
idx: 1 means text retrieval(i2c), 2 means image retrieval(c2i)
"""
cand_idx_dict = {1: 2, 2: 1}
cand_idx = cand_idx_dict[idx]
tot = scores.shape[0]
r1, r5, r10, rank_tot = 0, 0, 0, 0
for i in range(tot):
score_mat = scores[i]
cur_ids = score_mat[0][idx]
ans_ids = label_dict[cur_ids] # when idx is 1, type is list. idx is 2, type is int
score = score_mat[:, 0]
score_sort = np.argsort(score)[::-1]
cand_ans = np.take(score_mat[:, cand_idx], score_sort, axis=0)
cand_ans = cand_ans.astype(np.int64)
if isinstance(ans_ids, list):
rank = min([np.where(cand_ans == ans)[0] for ans in ans_ids])
elif isinstance(ans_ids, int):
rank = np.where(cand_ans == ans_ids)[0]
else:
raise ValueError('type error')
if rank < 1:
r1 += 1.0
if rank < 5:
r5 += 1.0
if rank < 10:
r10 += 1.0
rank_tot += (rank + 1)
ret = {
'recall@1': float(r1)/tot,
'recall@5': float(r5)/tot,
'recall@10': float(r10)/tot,
'avg_rank': float(rank_tot)/tot
}
return ret
cap_retrieval_recall = get_recall_k(i2c, 1, img2texts)
img_retrieval_recall = get_recall_k(c2i, 2, text2img)
ret = OrderedDict()
ret['img_avg_rank'] = img_retrieval_recall['avg_rank']
ret['cap_avg_rank'] = cap_retrieval_recall['avg_rank']
ret['img_recall@1'] = img_retrieval_recall['recall@1']
ret['img_recall@5'] = img_retrieval_recall['recall@5']
ret['img_recall@10'] = img_retrieval_recall['recall@10']
ret['cap_recall@1'] = cap_retrieval_recall['recall@1']
ret['cap_recall@5'] = cap_retrieval_recall['recall@5']
ret['cap_recall@10'] = cap_retrieval_recall['recall@10']
ret['avg_img_recall'] = (img_retrieval_recall['recall@1'] + \
img_retrieval_recall['recall@5'] + img_retrieval_recall['recall@10']) /3
ret['avg_cap_recall'] = (cap_retrieval_recall['recall@1'] + \
cap_retrieval_recall['recall@5'] + cap_retrieval_recall['recall@10']) /3
ret['avg_recall@1'] = (img_retrieval_recall['recall@1'] + cap_retrieval_recall['recall@1']) /2
ret['avg_recall@5'] = (img_retrieval_recall['recall@5'] + cap_retrieval_recall['recall@5']) /2
ret['avg_recall@10'] = (img_retrieval_recall['recall@10'] + cap_retrieval_recall['recall@10']) /2
ret['key_eval'] = "avg_recall@1"
return ret
\ No newline at end of file
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Model for classifier."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import time
import numpy as np
from six.moves import xrange
import paddle.fluid as fluid
from model.unimo_finetune import UNIMOModel
from eval import glue_eval
from collections import OrderedDict
from utils.utils import print_eval_log
def create_model(args, pyreader_name, config):
"""create_model"""
stype = 'int64'
pyreader = fluid.layers.py_reader(
capacity=50,
shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, args.max_seq_len], [-1, 1],
[-1, 1]],
dtypes=[stype, stype, stype, 'float32', stype, stype],
lod_levels=[0, 0, 0, 0, 0, 0],
name=pyreader_name,
use_double_buffer=True)
(src_ids, sent_ids, pos_ids, input_mask, labels,
qids) = fluid.layers.read_file(pyreader)
emb_ids = {"word_embedding": src_ids, "sent_embedding": sent_ids, "pos_embedding": pos_ids}
model = UNIMOModel(
emb_ids=emb_ids,
input_mask=input_mask,
config=config)
cls_feats = model.get_pooled_text_output()
cls_feats = fluid.layers.dropout(
x=cls_feats,
dropout_prob=0.1,
dropout_implementation="upscale_in_train")
cls_params_name = ["cls_out_%d_w" % args.num_labels, "cls_out_%d_b" % args.num_labels]
logits = fluid.layers.fc(
input=cls_feats,
size=args.num_labels,
param_attr=fluid.ParamAttr(
name=cls_params_name[0],
initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
bias_attr=fluid.ParamAttr(
name=cls_params_name[1], initializer=fluid.initializer.Constant(0.)))
ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=labels, return_softmax=True)
loss = fluid.layers.mean(x=ce_loss)
num_seqs = fluid.layers.create_tensor(dtype='int64')
accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs)
graph_vars = {
"loss": loss,
"probs": probs,
"accuracy": accuracy,
"labels": labels,
"num_seqs": num_seqs,
"qids": qids
}
return pyreader, graph_vars
def predict(exe, test_program, test_pyreader, graph_vars, dev_count=1):
"""predict"""
qids, scores, probs, preds = [], [], [], []
fetch_list = [graph_vars["probs"].name, graph_vars["qids"].name]
test_pyreader.start()
while True:
try:
if dev_count == 1:
np_probs, np_qids = exe.run(program=test_program, fetch_list=fetch_list)
else:
np_probs, np_qids = exe.run(fetch_list=fetch_list)
qids.extend(np_qids.reshape(-1).tolist())
np_preds = np.argmax(np_probs, axis=1).astype(np.float32)
preds.extend(np_preds)
probs.append(np_probs)
except fluid.core.EOFException:
test_pyreader.reset()
break
probs = np.concatenate(probs, axis=0).reshape([len(qids), -1])
return qids, preds, probs
def evaluate(args, exe, test_program, test_pyreader, graph_vars, eval_phase):
"""evaluate"""
total_cost, total_num_seqs = 0.0, 0.0
qids, labels, scores, preds = [], [], [], []
time_begin = time.time()
fetch_list = [
graph_vars["loss"].name,
graph_vars["probs"].name, graph_vars["labels"].name,
graph_vars["num_seqs"].name, graph_vars["qids"].name
]
test_pyreader.start()
while True:
try:
np_loss, np_probs, np_labels, np_num_seqs, np_qids = exe.run(
program=test_program, fetch_list=fetch_list) \
if not args.use_multi_gpu_test else exe.run(fetch_list=fetch_list)
total_cost += np.sum(np_loss * np_num_seqs)
total_num_seqs += np.sum(np_num_seqs)
labels.extend(np_labels.reshape((-1)).tolist())
if np_qids is not None:
qids.extend(np_qids.reshape(-1).tolist())
scores.extend(np_probs[:, 1].reshape(-1).tolist())
np_preds = list(np.argmax(np_probs, axis=1).astype(np.float32))
preds.extend([float(val) for val in np_preds])
except fluid.core.EOFException:
test_pyreader.reset()
break
time_end = time.time()
ret = OrderedDict()
ret['phase'] = eval_phase
ret['loss'] = round(total_cost / total_num_seqs, 4)
ret['data_num'] = total_num_seqs
ret['used_time'] = round(time_end - time_begin, 4)
metrics = OrderedDict()
metrics["acc_and_f1"] = glue_eval.acc_and_f1
metrics["simple_accuracy"] = glue_eval.simple_accuracy
metrics["matthews_corrcoef"] = glue_eval.matthews_corrcoef
if args.eval_mertrics in metrics:
ret_metric = metrics[args.eval_mertrics](preds, labels)
ret.update(ret_metric)
print_eval_log(ret)
else:
raise ValueError('unsupported metric {}'.format(args.eval_mertrics))
return ret
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""image-to-text generation"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import re
import time
import numpy as np
import glob
import json
import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as layers
from model.unimo_finetune import UNIMOModel
from eval.gen_eval import GenerationEval
from finetune.trigram_blocking import TrigramBlocking
import codecs
class Img2Txt(object):
"""image-to-text"""
def __init__(self, args, vl_config, tokenizer):
self.vl_config = vl_config
self.weight_sharing = args.weight_sharing
self.max_seq_len = args.max_seq_len
self.max_img_len = args.max_img_len
self.max_obj_len = args.max_obj_len
self.label_smooth = args.label_smooth
self.tgt_type_id = args.tgt_type_id
self.tokenizer = tokenizer
self.vocab_size = vl_config["vocab_size"]
self.args = args
self.adv_kl_weight = args.adv_kl_weight
self.with_pure_model = args.with_pure_model
self._emb_dtype = "float32"
# for beam_search decoding
self.do_decode = args.do_decode
self.length_penalty = args.length_penalty
self.max_out_len = args.max_out_len
self.min_out_len = args.min_out_len
self.block_trigram = args.block_trigram
self.beam_size = args.beam_size
self.bos_id = tokenizer.cls_token_id
self.eos_id = tokenizer.sep_token_id
self.evaluator = GenerationEval(args)
self.emb_keys = ["word_embedding", "sent_embedding", "pos_embedding"]
self.img_keys = ["image_embedding", "loc_embedding"]
self.task_type = "img2txt"
def _kl_divergence_with_logits(self, q_logits, p_logits):
"""
symmetric KL-divergence (See SMART, Sec 3.1)
q_logits: logits
p_logits: delta_logits
"""
q = fluid.layers.softmax(input=q_logits)
p = fluid.layers.softmax(input=p_logits)
kl_qp = fluid.layers.reduce_sum(q * (fluid.layers.log(q) - fluid.layers.log(p)), -1)
kl_pq = fluid.layers.reduce_sum(p * (fluid.layers.log(p) - fluid.layers.log(q)), -1)
vat_loss = fluid.layers.mean(x=kl_qp + kl_pq)
return vat_loss
def cal_logit(self, enc_out, tgt_pos):
"""calculate logit"""
enc_out = fluid.layers.reshape(x=enc_out,
shape=[-1, self.vl_config["hidden_size"]])
if tgt_pos:
tgt_pos = fluid.layers.cast(x=tgt_pos, dtype='int32')
tgt_feat = fluid.layers.gather(input=enc_out, index=tgt_pos)
else:
tgt_feat = enc_out
tgt_trans_feat = fluid.layers.fc(
input=tgt_feat,
size=self.vl_config["emb_size"] or self.vl_config["hidden_size"],
act=self.vl_config["hidden_act"],
param_attr=fluid.ParamAttr(
name="mask_lm_trans_fc.w_0",
initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
bias_attr=fluid.ParamAttr(
name="mask_lm_trans_fc.b_0",
initializer=fluid.initializer.Constant(0.)))
tgt_trans_feat = fluid.layers.layer_norm(
tgt_trans_feat,
begin_norm_axis=len(tgt_trans_feat.shape) - 1,
param_attr=fluid.ParamAttr(
name='mask_lm_trans_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name='mask_lm_trans_layer_norm_bias',
initializer=fluid.initializer.Constant(1.)))
seq2seq_out_bias_attr = fluid.ParamAttr(
name="mask_lm_out_fc.b_0",
initializer=fluid.initializer.Constant(value=0.0))
if self.weight_sharing:
fc_out = fluid.layers.matmul(
x=tgt_trans_feat,
y=fluid.default_main_program().global_block().var(
"word_embedding"),
transpose_y=True)
fc_out += fluid.layers.create_parameter(
shape=[self.vl_config['vocab_size']],
dtype="float32",
attr=seq2seq_out_bias_attr,
is_bias=True)
else:
out_size = self.vl_config["tgt_vocab_size"] or self.vl_config['vocab_size']
fc_out = fluid.layers.fc(input=tgt_trans_feat,
size=out_size,
param_attr=fluid.ParamAttr(
name="mask_lm_out_fc.w_0",
initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
bias_attr=seq2seq_out_bias_attr)
return fc_out
def to_tensor(self, shapes, dtypes, lod_levels):
"""to tensor"""
return [fluid.layers.data(name="placeholder_" + str(i), shape=shapes[i], dtype=dtypes[i],
lod_level=lod_levels[i]) for i in range(len(shapes))]
def create_model_freelb_text(self):
"""create text freelb model"""
img_input_shapes = [[-1, self.max_img_len, self.vl_config["image_embedding_size"]], # image_embedding
[-1, self.max_img_len, 5], # image_loc
[-1, self.max_seq_len + self.max_obj_len + self.max_img_len,
self.max_seq_len + self.max_obj_len + self.max_img_len], # input_mask
[-1, self.max_img_len, 1], # v_mask
[-1, self.max_seq_len, 1], # t_mask
[-1, self.max_obj_len, 1], # padded_obj_token_id
[-1, self.max_obj_len, 1], # padded_obj_sent_ids
[-1, self.max_obj_len, 1]] # padded_obj_pos_ids
img_input_dtypes = ['float32', 'float32', 'float32', 'float32', 'float32', 'int64', 'int64', 'int64']
img_input_lod_levels = [0, 0, 0, 0, 0, 0, 0, 0]
emb_num = 3
text_input_shapes = [[-1, self.max_seq_len, 1]] * emb_num + [[-1, 1], [-1, 1]]
text_input_dtypes = ['int64'] * emb_num + ['int64', 'int64']
text_input_lod_levels = [0] * emb_num + [0, 0]
shapes = img_input_shapes + text_input_shapes
dtypes = img_input_dtypes + text_input_dtypes
lod_levels = img_input_lod_levels + text_input_lod_levels
inputs = self.to_tensor(shapes, dtypes, lod_levels)
pyreader = fluid.io.DataLoader.from_generator(feed_list=inputs, capacity=70, iterable=False)
img_embs = {}
emb_obj_ids = {}
emb_ids = {}
img_embs["image_embedding"], img_embs["loc_embedding"], input_mask, v_mask, t_mask, \
emb_obj_ids["word_embedding"], emb_obj_ids["sent_embedding"], emb_obj_ids["pos_embedding"], \
emb_ids["word_embedding"], emb_ids["sent_embedding"], emb_ids["pos_embedding"], \
tgt_labels, tgt_pos = inputs
tot_loss = None
adv_step = self.args.adv_step
adv_lr = self.args.adv_lr
norm_type = self.args.norm_type
adv_max_norm = self.args.adv_max_norm
adv_init_mag = self.args.adv_init_mag
# shape(embedded) = (batch_size, num_timesteps, embedding_dim)
_emb_shape = [-1, self.max_seq_len, self.vl_config['hidden_size']]
_fake = fluid.layers.data(name="_fake", shape=_emb_shape, dtype='float32')
t_mask_slice = fluid.layers.slice(t_mask, axes=[1], starts=[0], ends=fluid.layers.shape(t_mask)[1])
t_length = fluid.layers.reduce_sum(t_mask_slice, dim=1, keep_dim=True) * self.vl_config['hidden_size']
if adv_init_mag > 0:
if norm_type == 'l2':
delta = fluid.layers.uniform_random_batch_size_like(t_mask, shape=_fake.shape, min=-1.0, max=1.0)
delta = fluid.layers.slice(delta, axes=[1], starts=[0],
ends=fluid.layers.shape(emb_ids["word_embedding"])[1])
delta = delta * t_mask_slice
mag = adv_init_mag / fluid.layers.sqrt(t_length)
delta = delta * mag
elif norm_type == 'inf':
delta = fluid.layers.uniform_random_batch_size_like(t_mask, shape=_fake.shape, min=-adv_init_mag,
max=adv_init_mag)
delta = fluid.layers.slice(delta, axes=[1], starts=[0],
ends=fluid.layers.shape(emb_ids["word_embedding"])[1])
delta = delta * t_mask_slice
else:
print("Norm type not specified: ", norm_type)
exit()
else:
delta = fluid.layers.uniform_random_batch_size_like(t_mask, shape=_fake.shape, min=0.0, max=0.0)
delta = fluid.layers.slice(delta, axes=[1], starts=[0],
ends=fluid.layers.shape(emb_ids["word_embedding"])[1])
delta = delta * t_mask_slice
for iter in range(adv_step):
if self.with_pure_model:
gene_pure = UNIMOModel(
emb_ids=emb_ids,
emb_obj_ids=emb_obj_ids,
input_mask=input_mask,
config=self.vl_config,
image_input=img_embs,
weight_sharing=self.weight_sharing,
task_type=self.task_type
)
pure_enc_out = gene_pure.get_sequence_output()
pure_fc_out = self.cal_logit(pure_enc_out, tgt_pos)
if self.label_smooth:
out_size = self.vl_config['vocab_size']
labels = fluid.layers.label_smooth(
label=fluid.layers.one_hot(
input=tgt_labels, depth=out_size),
epsilon=self.label_smooth)
pure_ce_loss = layers.softmax_with_cross_entropy(
logits=pure_fc_out, label=labels, soft_label=True)
else:
pure_ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
logits=pure_fc_out, label=tgt_labels, return_softmax=True)
pure_loss = fluid.layers.mean(x=pure_ce_loss)
pure_loss = pure_loss / adv_step
gene = UNIMOModel(
text_adv_delta=delta,
emb_ids=emb_ids,
emb_obj_ids=emb_obj_ids,
input_mask=input_mask,
config=self.vl_config,
image_input=img_embs,
weight_sharing=self.weight_sharing,
task_type=self.task_type
)
enc_out = gene.get_sequence_output()
fc_out = self.cal_logit(enc_out, tgt_pos)
if self.label_smooth:
out_size = self.vl_config['vocab_size']
labels = fluid.layers.label_smooth(
label=fluid.layers.one_hot(
input=tgt_labels, depth=out_size),
epsilon=self.label_smooth)
ce_loss = layers.softmax_with_cross_entropy(
logits=fc_out, label=labels, soft_label=True)
else:
ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
logits=fc_out, label=tgt_labels, return_softmax=True)
loss = fluid.layers.mean(x=ce_loss)
loss = loss / adv_step
delta_grad = fluid.backward.gradients(loss, delta)
delta_grad = delta_grad[0]
if norm_type == 'l2':
# update according to grads
# whether to use scale_l2
delta_norm = fluid.layers.sqrt(
fluid.layers.reduce_sum(
fluid.layers.pow(fluid.layers.reshape(delta_grad, [fluid.layers.shape(delta_grad)[0], -1]),
factor=2), dim=1, keep_dim=True))
_min = float(1e-8)
delta_norm = fluid.layers.clamp(delta_norm, min=_min)
delta = delta + adv_lr * delta_grad / delta_norm
# projection
if adv_max_norm > 0:
exceed_mask = (delta_norm > adv_max_norm).astype('float32')
reweights = (adv_max_norm / delta_norm) * exceed_mask + (1 - exceed_mask)
delta = delta * reweights
elif norm_type == 'inf':
delta_norm = fluid.layers.reduce_max(
fluid.layers.abs(fluid.layers.reshape(delta_grad, [fluid.layers.shape(delta_grad)[0], -1])), \
dim=1, keep_dim=True)
_min = float(1e-8)
delta_norm = fluid.layers.clamp(delta_norm, min=_min)
delta = delta + adv_lr * delta_grad / delta_norm
# projection
if adv_max_norm > 0:
delta = fluid.layers.clamp(delta, min=-adv_max_norm, max=adv_max_norm)
else:
print("Norm type not specified: ", norm_type)
exit()
delta_grad.stop_gradient = True
if self.with_pure_model:
kl_adv_text_loss = self._kl_divergence_with_logits(pure_fc_out, fc_out)
cur_loss = pure_loss + loss + self.adv_kl_weight * kl_adv_text_loss
else:
cur_loss = loss
tot_loss = cur_loss if tot_loss is None else tot_loss + cur_loss
graph_vars = {"loss": tot_loss}
for k, v in graph_vars.items():
v.persistable = True
return pyreader, graph_vars
def create_model_freelb_image(self):
"""create image freelb model"""
img_input_shapes = [[-1, self.max_img_len, self.vl_config["image_embedding_size"]], # image_embedding
[-1, self.max_img_len, 5], # image_loc
[-1, self.max_seq_len + self.max_obj_len + self.max_img_len,
self.max_seq_len + self.max_obj_len + self.max_img_len], # input_mask
[-1, self.max_img_len, 1], # v_mask
[-1, self.max_seq_len, 1], # t_mask
[-1, self.max_obj_len, 1], # padded_obj_token_id
[-1, self.max_obj_len, 1], # padded_obj_sent_ids
[-1, self.max_obj_len, 1]] # padded_obj_pos_ids
img_input_dtypes = ['float32', 'float32', 'float32', 'float32', 'float32', 'int64', 'int64', 'int64']
img_input_lod_levels = [0, 0, 0, 0, 0, 0, 0, 0]
emb_num = 3
text_input_shapes = [[-1, self.max_seq_len, 1]] * emb_num + [[-1, 1], [-1, 1]]
text_input_dtypes = ['int64'] * emb_num + ['int64', 'int64']
text_input_lod_levels = [0] * emb_num + [0, 0]
shapes = img_input_shapes + text_input_shapes
dtypes = img_input_dtypes + text_input_dtypes
lod_levels = img_input_lod_levels + text_input_lod_levels
inputs = self.to_tensor(shapes, dtypes, lod_levels)
pyreader = fluid.io.DataLoader.from_generator(feed_list=inputs, capacity=70, iterable=False)
img_embs = {}
emb_obj_ids = {}
emb_ids = {}
img_embs["image_embedding"], img_embs["loc_embedding"], input_mask, v_mask, t_mask, \
emb_obj_ids["word_embedding"], emb_obj_ids["sent_embedding"], emb_obj_ids["pos_embedding"], \
emb_ids["word_embedding"], emb_ids["sent_embedding"], emb_ids["pos_embedding"], \
tgt_labels, tgt_pos = inputs
tot_loss = None
adv_step = self.args.adv_step
adv_lr = self.args.adv_lr
norm_type = self.args.norm_type
adv_max_norm = self.args.adv_max_norm
adv_init_mag = self.args.adv_init_mag
# shape(embedded) = (batch_size, num_timesteps, embedding_dim)
_emb_shape = [-1, self.max_img_len, self.vl_config["image_embedding_size"]]
_fake = fluid.layers.data(name="_fake", shape=_emb_shape, dtype='float32')
v_mask_slice = fluid.layers.slice(v_mask, axes=[1], starts=[0], ends=fluid.layers.shape(v_mask)[1])
v_length = fluid.layers.reduce_sum(v_mask_slice, dim=1, keep_dim=True) * self.vl_config["image_embedding_size"]
if adv_init_mag > 0:
if norm_type == 'l2':
delta = fluid.layers.uniform_random_batch_size_like(v_mask, shape=_fake.shape, min=-1.0, max=1.0)
delta = fluid.layers.slice(delta, axes=[1], starts=[0],
ends=fluid.layers.shape(img_embs["image_embedding"])[1])
delta = delta * v_mask_slice
mag = adv_init_mag / fluid.layers.sqrt(v_length)
delta = delta * mag
elif norm_type == 'inf':
delta = fluid.layers.uniform_random_batch_size_like(v_mask, shape=_fake.shape, min=-adv_init_mag,
max=adv_init_mag)
delta = fluid.layers.slice(delta, axes=[1], starts=[0],
ends=fluid.layers.shape(img_embs["image_embedding"])[1])
delta = delta * v_mask_slice
else:
print("Norm type not specified: ", norm_type)
exit()
else:
delta = fluid.layers.uniform_random_batch_size_like(v_mask, shape=_fake.shape, min=0.0, max=0.0)
delta = fluid.layers.slice(delta, axes=[1], starts=[0],
ends=fluid.layers.shape(img_embs["image_embedding"])[1])
delta = delta * v_mask_slice
for iter in range(adv_step):
if self.with_pure_model:
gene_pure = UNIMOModel(
emb_ids=emb_ids,
emb_obj_ids=emb_obj_ids,
input_mask=input_mask,
config=self.vl_config,
image_input=img_embs,
weight_sharing=self.weight_sharing,
task_type=self.task_type
)
pure_enc_out = gene_pure.get_sequence_output()
pure_fc_out = self.cal_logit(pure_enc_out, tgt_pos)
if self.label_smooth:
out_size = self.vl_config['vocab_size']
labels = fluid.layers.label_smooth(
label=fluid.layers.one_hot(
input=tgt_labels, depth=out_size),
epsilon=self.label_smooth)
pure_ce_loss = layers.softmax_with_cross_entropy(
logits=pure_fc_out, label=labels, soft_label=True)
else:
pure_ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
logits=pure_fc_out, label=tgt_labels, return_softmax=True)
pure_loss = fluid.layers.mean(x=pure_ce_loss)
pure_loss = pure_loss / adv_step
gene = UNIMOModel(
image_adv_delta=delta,
emb_ids=emb_ids,
emb_obj_ids=emb_obj_ids,
input_mask=input_mask,
config=self.vl_config,
image_input=img_embs,
weight_sharing=self.weight_sharing,
task_type=self.task_type
)
enc_out = gene.get_sequence_output()
fc_out = self.cal_logit(enc_out, tgt_pos)
if self.label_smooth:
out_size = self.vl_config['vocab_size']
labels = fluid.layers.label_smooth(
label=fluid.layers.one_hot(
input=tgt_labels, depth=out_size),
epsilon=self.label_smooth)
ce_loss = layers.softmax_with_cross_entropy(
logits=fc_out, label=labels, soft_label=True)
else:
ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
logits=fc_out, label=tgt_labels, return_softmax=True)
loss = fluid.layers.mean(x=ce_loss)
loss = loss / adv_step
delta_grad = fluid.backward.gradients(loss, delta)
delta_grad = delta_grad[0]
if norm_type == 'l2':
# update according to grads
# whether to use scale_l2
delta_norm = fluid.layers.sqrt(
fluid.layers.reduce_sum(
fluid.layers.pow(fluid.layers.reshape(delta_grad, [fluid.layers.shape(delta_grad)[0], -1]),
factor=2), dim=1, keep_dim=True))
_min = float(1e-8)
delta_norm = fluid.layers.clamp(delta_norm, min=_min)
delta = delta + adv_lr * delta_grad / delta_norm
# projection
if adv_max_norm > 0:
exceed_mask = (delta_norm > adv_max_norm).astype('float32')
reweights = (adv_max_norm / delta_norm) * exceed_mask + (1 - exceed_mask)
delta = delta * reweights
elif norm_type == 'inf':
delta_norm = fluid.layers.reduce_max(
fluid.layers.abs(fluid.layers.reshape(delta_grad, [fluid.layers.shape(delta_grad)[0], -1])), \
dim=1, keep_dim=True)
_min = float(1e-8)
delta_norm = fluid.layers.clamp(delta_norm, min=_min)
delta = delta + adv_lr * delta_grad / delta_norm
# projection
if adv_max_norm > 0:
delta = fluid.layers.clamp(delta, min=-adv_max_norm, max=adv_max_norm)
else:
print("Norm type not specified: ", norm_type)
exit()
delta_grad.stop_gradient = True
if self.with_pure_model:
kl_adv_text_loss = self._kl_divergence_with_logits(pure_fc_out, fc_out)
cur_loss = pure_loss + loss + self.adv_kl_weight * kl_adv_text_loss
else:
cur_loss = loss
tot_loss = cur_loss if tot_loss is None else tot_loss + cur_loss
graph_vars = {"loss": tot_loss}
for k, v in graph_vars.items():
v.persistable = True
return pyreader, graph_vars
def create_model_villa(self):
"""create villa model"""
img_input_shapes = [[-1, self.max_img_len, self.vl_config["image_embedding_size"]], # image_embedding
[-1, self.max_img_len, 5], # image_loc
[-1, self.max_seq_len + self.max_obj_len + self.max_img_len,
self.max_seq_len + self.max_obj_len + self.max_img_len], # input_mask
[-1, self.max_img_len, 1], # v_mask
[-1, self.max_seq_len, 1], # t_mask
[-1, self.max_obj_len, 1], # padded_obj_token_id
[-1, self.max_obj_len, 1], # padded_obj_sent_ids
[-1, self.max_obj_len, 1]] # padded_obj_pos_ids
img_input_dtypes = ['float32', 'float32', 'float32', 'float32', 'float32', 'int64', 'int64', 'int64']
img_input_lod_levels = [0, 0, 0, 0, 0, 0, 0, 0]
emb_num = 3
text_input_shapes = [[-1, self.max_seq_len, 1]] * emb_num + [[-1, 1], [-1, 1]]
text_input_dtypes = ['int64'] * emb_num + ['int64', 'int64']
text_input_lod_levels = [0] * emb_num + [0, 0]
shapes = img_input_shapes + text_input_shapes
dtypes = img_input_dtypes + text_input_dtypes
lod_levels = img_input_lod_levels + text_input_lod_levels
inputs = self.to_tensor(shapes, dtypes, lod_levels)
pyreader = fluid.io.DataLoader.from_generator(feed_list=inputs, capacity=70, iterable=False)
img_embs = {}
emb_obj_ids = {}
emb_ids = {}
img_embs["image_embedding"], img_embs["loc_embedding"], input_mask, v_mask, t_mask, \
emb_obj_ids["word_embedding"], emb_obj_ids["sent_embedding"], emb_obj_ids["pos_embedding"], \
emb_ids["word_embedding"], emb_ids["sent_embedding"], emb_ids["pos_embedding"], \
tgt_labels, tgt_pos = inputs
tot_loss = None
adv_step = self.args.adv_step
adv_lr = self.args.adv_lr
norm_type = self.args.norm_type
adv_max_norm = self.args.adv_max_norm
adv_init_mag = self.args.adv_init_mag
# shape(embedded) = (batch_size, num_timesteps, embedding_dim)
t_emb_shape = [-1, self.max_seq_len, self.vl_config['hidden_size']]
t_fake = fluid.layers.data(name="t_fake", shape=t_emb_shape, dtype='float32')
t_mask_slice = fluid.layers.slice(t_mask, axes=[1], starts=[0], ends=fluid.layers.shape(t_mask)[1])
t_length = fluid.layers.reduce_sum(t_mask_slice, dim=1, keep_dim=True) * self.vl_config['hidden_size']
# shape(embedded) = (batch_size, num_timesteps, embedding_dim)
v_emb_shape = [-1, self.max_img_len, self.vl_config["image_embedding_size"]]
v_fake = fluid.layers.data(name="v_fake", shape=v_emb_shape, dtype='float32')
v_mask_slice = fluid.layers.slice(v_mask, axes=[1], starts=[0], ends=fluid.layers.shape(v_mask)[1])
v_length = fluid.layers.reduce_sum(v_mask_slice, dim=1, keep_dim=True) * self.vl_config["image_embedding_size"]
if adv_init_mag > 0:
if norm_type == 'l2':
t_delta = fluid.layers.uniform_random_batch_size_like(t_mask, shape=t_fake.shape, min=-1.0, max=1.0)
t_delta = fluid.layers.slice(t_delta, axes=[1], starts=[0],
ends=fluid.layers.shape(emb_ids["word_embedding"])[1])
t_delta = t_delta * t_mask_slice
t_mag = adv_init_mag / fluid.layers.sqrt(t_length)
t_delta = t_delta * t_mag
v_delta = fluid.layers.uniform_random_batch_size_like(v_mask, shape=v_fake.shape, min=-1.0, max=1.0)
v_delta = fluid.layers.slice(v_delta, axes=[1], starts=[0],
ends=fluid.layers.shape(img_embs["image_embedding"])[1])
v_delta = v_delta * v_mask_slice
v_mag = adv_init_mag / fluid.layers.sqrt(v_length)
v_delta = v_delta * v_mag
elif norm_type == 'inf':
t_delta = fluid.layers.uniform_random_batch_size_like(t_mask, shape=t_fake.shape, min=-adv_init_mag,
max=adv_init_mag)
t_delta = fluid.layers.slice(t_delta, axes=[1], starts=[0],
ends=fluid.layers.shape(emb_ids["word_embedding"])[1])
t_delta = t_delta * t_mask_slice
v_delta = fluid.layers.uniform_random_batch_size_like(v_mask, shape=v_fake.shape, min=-adv_init_mag,
max=adv_init_mag)
v_delta = fluid.layers.slice(v_delta, axes=[1], starts=[0],
ends=fluid.layers.shape(img_embs["image_embedding"])[1])
v_delta = v_delta * v_mask_slice
else:
print("Norm type not specified: ", norm_type)
exit()
else:
t_delta = fluid.layers.uniform_random_batch_size_like(t_mask, shape=t_fake.shape, min=0.0, max=0.0)
t_delta = fluid.layers.slice(t_delta, axes=[1], starts=[0],
ends=fluid.layers.shape(emb_ids["word_embedding"])[1])
t_delta = t_delta * t_mask_slice
v_delta = fluid.layers.uniform_random_batch_size_like(v_mask, shape=v_fake.shape, min=0.0, max=0.0)
v_delta = fluid.layers.slice(v_delta, axes=[1], starts=[0],
ends=fluid.layers.shape(img_embs["image_embedding"])[1])
v_delta = v_delta * v_mask_slice
for iter in range(adv_step):
if self.with_pure_model:
gene_pure = UNIMOModel(
emb_ids=emb_ids,
emb_obj_ids=emb_obj_ids,
input_mask=input_mask,
config=self.vl_config,
image_input=img_embs,
weight_sharing=self.weight_sharing,
task_type=self.task_type
)
pure_enc_out = gene_pure.get_sequence_output()
pure_fc_out = self.cal_logit(pure_enc_out, tgt_pos)
if self.label_smooth:
out_size = self.vl_config['vocab_size']
labels = fluid.layers.label_smooth(
label=fluid.layers.one_hot(
input=tgt_labels, depth=out_size),
epsilon=self.label_smooth)
pure_ce_loss = layers.softmax_with_cross_entropy(
logits=pure_fc_out, label=labels, soft_label=True)
else:
pure_ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
logits=pure_fc_out, label=tgt_labels, return_softmax=True)
pure_loss = fluid.layers.mean(x=pure_ce_loss)
pure_loss = pure_loss / adv_step
# text adversial learning
gene_text = UNIMOModel(
text_adv_delta=t_delta,
emb_ids=emb_ids,
emb_obj_ids=emb_obj_ids,
input_mask=input_mask,
config=self.vl_config,
image_input=img_embs,
weight_sharing=self.weight_sharing,
task_type=self.task_type
)
text_enc_out = gene_text.get_sequence_output()
text_fc_out = self.cal_logit(text_enc_out, tgt_pos)
if self.label_smooth:
out_size = self.vl_config['vocab_size']
labels = fluid.layers.label_smooth(
label=fluid.layers.one_hot(
input=tgt_labels, depth=out_size),
epsilon=self.label_smooth)
text_ce_loss = layers.softmax_with_cross_entropy(
logits=text_fc_out, label=labels, soft_label=True)
else:
text_ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
logits=text_fc_out, label=tgt_labels, return_softmax=True)
text_loss = fluid.layers.mean(x=text_ce_loss)
text_loss = text_loss / adv_step
# image adversial learning
gene_img = UNIMOModel(
image_adv_delta=v_delta,
emb_ids=emb_ids,
emb_obj_ids=emb_obj_ids,
input_mask=input_mask,
config=self.vl_config,
image_input=img_embs,
weight_sharing=self.weight_sharing,
task_type=self.task_type
)
img_enc_out = gene_img.get_sequence_output()
img_fc_out = self.cal_logit(img_enc_out, tgt_pos)
if self.label_smooth:
out_size = self.vl_config['vocab_size']
labels = fluid.layers.label_smooth(
label=fluid.layers.one_hot(
input=tgt_labels, depth=out_size),
epsilon=self.label_smooth)
img_ce_loss = layers.softmax_with_cross_entropy(
logits=img_fc_out, label=labels, soft_label=True)
else:
img_ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
logits=img_fc_out, label=tgt_labels, return_softmax=True)
img_loss = fluid.layers.mean(x=img_ce_loss)
img_loss = img_loss / adv_step
# update delta
delta_grad_text = fluid.backward.gradients(text_loss, t_delta)
delta_grad_img = fluid.backward.gradients(img_loss, v_delta)
delta_grad_text = delta_grad_text[0]
delta_grad_img = delta_grad_img[0]
if norm_type == 'l2':
# update according to grads
# whether to use scale_l2
# text part
delta_norm_text = fluid.layers.sqrt(
fluid.layers.reduce_sum(
fluid.layers.pow(fluid.layers.reshape(delta_grad_text,
[fluid.layers.shape(delta_grad_text)[0], -1]), factor=2),
dim=1, keep_dim=True))
_min = float(1e-8)
delta_norm_text = fluid.layers.clamp(delta_norm_text, min=_min)
t_delta = t_delta + adv_lr * delta_grad_text / delta_norm_text
# image part
delta_norm_img = fluid.layers.sqrt(
fluid.layers.reduce_sum(
fluid.layers.pow(fluid.layers.reshape(delta_grad_img,
[fluid.layers.shape(delta_grad_img)[0], -1]), factor=2),
dim=1, keep_dim=True))
_min = float(1e-8)
delta_norm_img = fluid.layers.clamp(delta_norm_img, min=_min)
v_delta = v_delta + adv_lr * delta_grad_img / delta_norm_img
# projection
if adv_max_norm > 0:
# text part
exceed_mask_text = (delta_norm_text > adv_max_norm).astype('float32')
reweights_text = (adv_max_norm / delta_norm_text) * exceed_mask_text + (1 - exceed_mask_text)
t_delta = t_delta * reweights_text
# image part
exceed_mask_img = (delta_norm_img > adv_max_norm).astype('float32')
reweights_img = (adv_max_norm / delta_norm_img) * exceed_mask_img + (1 - exceed_mask_img)
v_delta = v_delta * reweights_img
elif norm_type == 'inf':
# text part
delta_norm_text = fluid.layers.reduce_max(
fluid.layers.abs(fluid.layers.reshape(delta_grad_text,
[fluid.layers.shape(delta_grad_text)[0], -1])),
dim=1, keep_dim=True)
_min = float(1e-8)
delta_norm_text = fluid.layers.clamp(delta_norm_text, min=_min)
t_delta = t_delta + adv_lr * delta_grad_text / delta_norm_text
# image part
delta_norm_image = fluid.layers.reduce_max(
fluid.layers.abs(fluid.layers.reshape(delta_grad_img,
[fluid.layers.shape(delta_grad_img)[0], -1])),
dim=1, keep_dim=True)
_min = float(1e-8)
delta_norm_image = fluid.layers.clamp(delta_norm_image, min=_min)
v_delta = v_delta + adv_lr * delta_grad_img / delta_norm_image
# projection
if adv_max_norm > 0:
t_delta = fluid.layers.clamp(t_delta, min=-adv_max_norm, max=adv_max_norm)
v_delta = fluid.layers.clamp(v_delta, min=-adv_max_norm, max=adv_max_norm)
else:
print("Norm type not specified: ", norm_type)
exit()
# delta.stop_gradient=True
delta_grad_text.stop_gradient = True
delta_grad_img.stop_gradient = True
if self.with_pure_model:
kl_adv_text_loss = self._kl_divergence_with_logits(pure_fc_out, text_fc_out)
kl_adv_image_loss = self._kl_divergence_with_logits(pure_fc_out, img_fc_out)
cur_loss = pure_loss + text_loss + img_loss + self.adv_kl_weight * (kl_adv_text_loss + kl_adv_image_loss)
else:
cur_loss = text_loss + img_loss
tot_loss = cur_loss if tot_loss is None else tot_loss + cur_loss
graph_vars = {"loss": tot_loss}
for k, v in graph_vars.items():
v.persistable = True
return pyreader, graph_vars
def create_model(self, decoding=False):
"""create model for training"""
if decoding:
return self.fast_decode()
img_input_shapes = [[-1, self.max_img_len, self.vl_config["image_embedding_size"]], # image_embedding
[-1, self.max_img_len, 5], # image_loc
[-1, self.max_seq_len + self.max_obj_len + self.max_img_len,
self.max_seq_len + self.max_obj_len + self.max_img_len], # input_mask
[-1, self.max_img_len, 1], # v_mask
[-1, self.max_seq_len, 1], # t_mask
[-1, self.max_obj_len, 1], # padded_obj_token_id
[-1, self.max_obj_len, 1], # padded_obj_sent_ids
[-1, self.max_obj_len, 1]] # padded_obj_pos_ids
img_input_dtypes = ['float32', 'float32', 'float32', 'int64', 'int64', 'int64']
img_input_lod_levels = [0, 0, 0, 0, 0, 0]
emb_num = 3
text_input_shapes = [[-1, self.max_seq_len, 1]] * emb_num + [[-1, 1], [-1, 1]]
text_input_dtypes = ['int64'] * emb_num + ['int64', 'int64']
text_input_lod_levels = [0] * emb_num + [0, 0]
shapes = img_input_shapes + text_input_shapes
dtypes = img_input_dtypes + text_input_dtypes
lod_levels = img_input_lod_levels + text_input_lod_levels
inputs = self.to_tensor(shapes, dtypes, lod_levels)
pyreader = fluid.io.DataLoader.from_generator(feed_list=inputs, capacity=70, iterable=False)
img_embs = {}
emb_obj_ids = {}
emb_ids = {}
img_embs["image_embedding"], img_embs["loc_embedding"], input_mask, v_mask, t_mask, \
emb_obj_ids["word_embedding"], emb_obj_ids["sent_embedding"], emb_obj_ids["pos_embedding"], \
emb_ids["word_embedding"], emb_ids["sent_embedding"], emb_ids["pos_embedding"], \
tgt_labels, tgt_pos = inputs
gene = UNIMOModel(
emb_ids=emb_ids,
emb_obj_ids=emb_obj_ids,
input_mask=input_mask,
config=self.vl_config,
image_input=img_embs,
weight_sharing=self.weight_sharing,
task_type=self.task_type)
enc_out = gene.get_sequence_output()
fc_out = self.cal_logit(enc_out, tgt_pos)
if self.label_smooth:
out_size = self.vl_config['vocab_size']
labels = fluid.layers.label_smooth(
label=fluid.layers.one_hot(
input=tgt_labels, depth=out_size),
epsilon=self.label_smooth)
ce_loss = layers.softmax_with_cross_entropy(
logits=fc_out, label=labels, soft_label=True)
else:
ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
logits=fc_out, label=tgt_labels, return_softmax=True)
loss = fluid.layers.mean(x=ce_loss)
graph_vars = {"loss": loss}
for k, v in graph_vars.items():
v.persistable = True
return pyreader, graph_vars
def fast_decode(self):
"""create model for inference"""
input_shapes = [[-1, self.max_img_len, self.vl_config["image_embedding_size"]], # image_embedding
[-1, self.max_img_len, 5], # image_loc
[-1, self.max_img_len + self.max_obj_len, self.max_img_len + self.max_obj_len],# img_input_mask
[-1, 1], # image_id
[-1, self.max_obj_len, 1], # padded_obj_token_id
[-1, self.max_obj_len, 1], # padded_obj_sent_ids
[-1, self.max_obj_len, 1]] # padded_obj_pos_ids
input_dtypes = ['float32', 'float32', 'float32', 'int32', 'int64', 'int64', 'int64']
input_lod_levels = [0, 0, 0, 0, 0, 0, 0]
shapes = input_shapes + [[-1, 1, 1], [-1, 1, 1],
[-1, 1], [-1], [-1, 1, self.max_img_len + self.max_obj_len]]
dtypes = input_dtypes + ['int64', 'int64', 'float32', 'int32', 'float32']
lod_levels = input_lod_levels + [2, 2, 2, 0, 0]
inputs = self.to_tensor(shapes, dtypes, lod_levels)
pyreader = fluid.io.DataLoader.from_generator(feed_list=inputs, capacity=70, iterable=False)
img_embs = {}
emb_obj_ids = {}
img_embs["image_embedding"], img_embs["loc_embedding"], input_mask, image_ids, \
emb_obj_ids["word_embedding"], emb_obj_ids["sent_embedding"], emb_obj_ids["pos_embedding"], \
tgt_ids, tgt_pos, init_scores, parent_idx, tgt_input_mask = inputs
gene = UNIMOModel(
emb_obj_ids=emb_obj_ids,
input_mask=input_mask,
image_input=img_embs,
config=self.vl_config,
weight_sharing=self.weight_sharing,
task_type=self.task_type,
decoding=True,
gather_idx=parent_idx)
max_len = layers.fill_constant(
shape=[1], dtype=tgt_ids.dtype, value=self.max_out_len, force_cpu=True)
min_len = layers.fill_constant(
shape=[1], dtype=tgt_ids.dtype, value=self.min_out_len, force_cpu=True)
neg_inf = layers.fill_constant(
shape=[1], dtype='float32', value=-1e18)
step_idx = layers.fill_constant(
shape=[1], dtype=tgt_ids.dtype, value=0, force_cpu=True)
step_next_idx = layers.fill_constant(
shape=[1], dtype=tgt_ids.dtype, value=1, force_cpu=True)
cond = layers.less_than(x=step_idx, y=max_len)
while_op = layers.While(cond)
ids = layers.array_write(layers.reshape(tgt_ids, (-1, 1)), step_idx)
pos_biases = layers.array_write(tgt_pos, step_idx)
scores = layers.array_write(init_scores, step_idx)
tgt_masks = layers.array_write(tgt_input_mask, step_idx)
trigram_blocking = TrigramBlocking(tgt_ids, self.tokenizer, beam_size=self.beam_size)
with while_op.block():
pre_ids = layers.array_read(array=ids, i=step_idx)
pre_ids = layers.reshape(pre_ids, (-1, 1, 1), inplace=True)
pre_scores = layers.array_read(array=scores, i=step_idx)
pos_bias = layers.array_read(array=pos_biases, i=step_idx)
pos_bias = layers.gather(input=pos_bias, index=parent_idx)
def gen_batch_like(value, dtype="int64", shape=[-1, 1, 1], is_scalar=True):
"""generate batch"""
if is_scalar:
return layers.fill_constant_batch_size_like(
input=parent_idx, value=value, shape=shape, dtype=dtype)
else:
return layers.elementwise_mul(
x=layers.fill_constant_batch_size_like(
input=parent_idx, value=1, shape=shape, dtype=dtype),
y=value, axis=0)
tmp_mask = layers.array_read(tgt_masks, i=step_idx)
tmp_mask = layers.gather(input=tmp_mask, index=parent_idx)
append_1_mask = gen_batch_like(1.0, dtype=tmp_mask.dtype)
pre_mask = layers.concat([tmp_mask, append_1_mask], axis=2)
pre_pos = gen_batch_like(step_idx, is_scalar=False)
pre_pos = pre_pos + pos_bias ####################### pos start from 2
pre_sent = gen_batch_like(self.tgt_type_id, dtype=pre_ids.dtype)
dec_emb_ids = {"word_embedding": pre_ids, "sent_embedding": pre_sent, "pos_embedding": pre_pos}
dec_out = gene.encode(emb_ids=dec_emb_ids,
input_mask=pre_mask,
gather_idx=parent_idx)
fc_out = self.cal_logit(dec_out, None)
# prevent generating end token if length less than min_out_len
eos_index = layers.fill_constant(shape=[layers.shape(fc_out)[0]],
dtype='int64',
value=self.eos_id)
eos_index = fluid.one_hot(eos_index, depth=self.vocab_size)
less_cond = layers.cast(layers.less_than(x=step_idx, y=min_len), dtype='float32')
less_val = layers.elementwise_mul(less_cond, neg_inf)
eos_val = layers.elementwise_mul(eos_index, less_val, axis=0)
revised_logits = layers.elementwise_add(fc_out, eos_val, axis=0)
# topK reduction across beams, also contain special handle of
# end beams and end sentences(batch reduction)
topk_scores, topk_indices = layers.topk(
input=layers.softmax(revised_logits), k=self.beam_size)
# Roll-Back previous-scores for length-penalty
# previous-scores has been length-penaltied, before this timestep length-penalty, need roll-back
# because of doing this, we need store the length-penaltied score in `scores`
# while calculating use the un-penaltied score
# -> safe for step_idx == 0 (initialization state), because previous-score == 0
pre_timestep_length_penalty = fluid.layers.pow(
((5.0 + fluid.layers.cast(step_idx, pre_scores.dtype)) / 6.0), self.length_penalty)
pre_scores_wo_len_penalty = fluid.layers.elementwise_mul(pre_scores, pre_timestep_length_penalty)
# calc trigram-blocking delta scores for current alive sequence
if self.block_trigram:
trigram_blocking.update_seq(pre_ids, parent_idx)
trigram_blocking.expand_cand_seq(topk_indices)
fluid.layers.py_func(func=trigram_blocking.blocking_forward,
x=[trigram_blocking.cand_seq,
trigram_blocking.id2is_full_token],
out=trigram_blocking.delta_score_out,
backward_func=None)
pre_scores_wo_len_penalty = fluid.layers.elementwise_add(x=trigram_blocking.delta_score_out,
y=pre_scores_wo_len_penalty,
axis=0)
# => [N, topk]
accu_scores = layers.elementwise_add(
x=layers.log(topk_scores), y=pre_scores_wo_len_penalty, axis=0)
cur_timestep_length_penalty = layers.pow(((5.0 + layers.cast(step_next_idx, accu_scores.dtype)) / 6.0),
self.length_penalty)
curr_scores = layers.elementwise_div(accu_scores, cur_timestep_length_penalty)
# beam_search op uses lod to differentiate branches.
curr_scores = layers.lod_reset(curr_scores, pre_ids)
topk_indices = layers.lod_reset(topk_indices, pre_ids)
selected_ids, selected_scores, gather_idx = layers.beam_search(
pre_ids=pre_ids,
pre_scores=pre_scores,
ids=topk_indices,
scores=curr_scores,
beam_size=self.beam_size,
end_id=self.eos_id,
return_parent_idx=True)
layers.increment(x=step_idx, value=1.0, in_place=True)
layers.increment(x=step_next_idx, value=1.0, in_place=True)
# cell states(caches) have been updated in wrap_decoder,
# only need to update beam search states here.
layers.array_write(selected_ids, i=step_idx, array=ids)
layers.array_write(selected_scores, i=step_idx, array=scores)
layers.array_write(pre_mask, i=step_idx, array=tgt_masks)
layers.array_write(pos_bias, i=step_idx, array=pos_biases)
layers.assign(gather_idx, parent_idx)
length_cond = layers.less_than(x=step_idx, y=max_len)
finish_cond = layers.logical_not(layers.is_empty(x=selected_ids))
layers.logical_and(x=length_cond, y=finish_cond, out=cond)
finished_ids, finished_scores = layers.beam_search_decode(
ids, scores, beam_size=self.beam_size, end_id=self.eos_id)
graph_vars = {
"finished_ids": finished_ids,
"finished_scores": finished_scores,
"image_ids": image_ids
}
for k, v in graph_vars.items():
v.persistable = True
return pyreader, graph_vars
def post_process_seq(self, seq):
"""
Post-process the beam-search decoded sequence. Truncate from the first
<eos> and remove the <bos> and <eos> tokens currently.
"""
eos_pos = len(seq)
for i, idx in enumerate(seq):
if idx == self.eos_id:
eos_pos = i
break
seq = seq[1:eos_pos]
return seq
def remove_special_tokens(self, seq, special_tokens):
"""Remove special tokens from output sequence"""
seq = [idx for idx in seq if idx not in special_tokens]
return seq
def evaluate(self, resource, eval_phase, graph_vars, features=None,
output_path=None, dev_count=1, gpu_id=0):
"""evaluate model"""
exe, program, pyreader = resource["exe"], resource["program"], resource["pyreader"]
if eval_phase == "train":
fetch_list = [graph_vars["loss"].name]
if "learning_rate" in graph_vars:
fetch_list.append(graph_vars["learning_rate"].name)
outputs = exe.run(fetch_list=fetch_list)
np_loss = outputs[0]
ret = {"loss": np.mean(np_loss), "ppl": np.exp(np.mean(np_loss))}
if "learning_rate" in graph_vars:
ret["learning_rate"] = float(outputs[1][0])
return ret
if self.do_decode:
return_numpy = False
outfile = output_path + "/" + eval_phase
outfile_part = outfile + ".part" + str(gpu_id)
writer = codecs.open(outfile_part, 'w', encoding='utf-8')
fetch_keys = ["finished_ids", "finished_scores", "image_ids"]
special_tokens = [self.tokenizer.cls_token_id,
self.tokenizer.sep_token_id,
self.tokenizer.mask_token_id,
self.tokenizer.pad_token_id,
self.tokenizer.unk_token_id]
else:
steps = 0
cost = 0.0
return_numpy = True
fetch_keys = ["loss"]
fetch_list = [graph_vars[key].name for key in fetch_keys]
time_begin = time.time()
pyreader.start()
while True:
try:
outputs = exe.run(program=program,
fetch_list=fetch_list,
return_numpy=return_numpy)
if not self.do_decode:
np_loss = outputs[0]
cost += np.mean(np_loss)
steps += 1
else:
seq_ids, seq_scores, image_ids = outputs
seq_ids_list, seq_scores_list = [seq_ids], [seq_scores] \
if isinstance(seq_ids, paddle.fluid.core.LoDTensor) else (seq_ids, seq_scores)
image_ids = np.array(image_ids).reshape(-1).tolist()
data_idx = 0
for seq_ids, seq_scores in zip(seq_ids_list, seq_scores_list):
# How to parse the results:
# Suppose the lod of seq_ids is:
# [[0, 3, 6], [0, 12, 24, 40, 54, 67, 82]]
# then from lod[0]:
# there are 2 source sentences, beam width is 3.
# from lod[1]:
# the first source sentence has 3 hyps; the lengths are 12, 12, 16
# the second source sentence has 3 hyps; the lengths are 14, 13, 15
# hyps = [[] for i in range(len(seq_ids.lod()[0]) - 1)]
# scores = [[] for i in range(len(seq_scores.lod()[0]) - 1)]
for i in range(len(seq_ids.lod()[0]) - 1): # for each source sentence
start = seq_ids.lod()[0][i]
end = seq_ids.lod()[0][i + 1]
max_cand = None
for j in range(end - start): # for each candidate
sub_start = seq_ids.lod()[1][start + j]
sub_end = seq_ids.lod()[1][start + j + 1]
token_ids = [int(idx) for idx in self.post_process_seq(
np.array(seq_ids)[sub_start:sub_end])]
hyp_ids = self.remove_special_tokens(token_ids, special_tokens)
hyp_tokens = self.tokenizer.convert_ids_to_tokens(hyp_ids)
hyp_str = self.tokenizer.gptbpe_tokenizer.decode(hyp_tokens)
hyp_str = re.sub('\\s+', ' ', hyp_str)
score = np.array(seq_scores)[sub_end - 1]
if (not max_cand) or score > max_cand[1]:
max_cand = (hyp_str, score)
image_id = image_ids[data_idx]
data_idx += 1
pred = max_cand[0]
writer.write("%d\t%s\n" % (image_id, pred))
except fluid.core.EOFException:
pyreader.reset()
break
time_end = time.time()
if not self.do_decode:
eval_result = "loss: %f, ppl: %f" % (cost / steps, np.exp(cost / steps))
print("[%s evaluation] %s, elapsed time: %f s"
% (eval_phase, eval_result, time_end - time_begin))
else:
writer.close()
# tmp_writer = open("%s/%s_dec_finish.%d" % (output_path, eval_phase, gpu_id), "w")
tmp_writer = codecs.open("%s/%s_dec_finish.%d" % (output_path, eval_phase, gpu_id),
'w', encoding='utf-8')
tmp_writer.close()
if gpu_id != 0:
return
while True:
ret = os.popen('find %s -maxdepth 1 -name "%s_dec_finish.*"' %
(output_path, eval_phase)).readlines()
if len(ret) != dev_count:
time.sleep(1)
continue
else:
break
all_outfiles = glob.glob("%s.part*" % outfile)
img_caption_res = []
unique_image_ids = []
for cur_file in all_outfiles:
for line in codecs.open(cur_file, 'r', encoding='utf-8'):
image_id, caption = line.strip().split('\t')
if image_id in unique_image_ids:
print("Warning: Repeated image_id %s" % str(image_id))
continue
unique_image_ids.append(image_id)
img_caption_res.append({"image_id": int(image_id), "caption": caption})
fout = codecs.open(outfile, 'w', encoding='utf-8')
fout.write(json.dumps(img_caption_res))
fout.close()
os.system("rm %s.part*" % outfile)
os.system("rm %s/%s_dec_finish.*" % (output_path, eval_phase))
eval_result = self.evaluator.eval(outfile,
phase=eval_phase.split("_")[0], features=features)
print("[%s evaluation] %s, elapsed time: %f s"
% (eval_phase, eval_result, time_end - time_begin))
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Model for classifier."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import time
import numpy as np
from six.moves import xrange
import paddle.fluid as fluid
from model.unimo_finetune import UNIMOModel
from eval import glue_eval
from collections import OrderedDict
from utils.utils import print_eval_log
def create_model(args, pyreader_name, config):
"""create_model"""
stype = 'int64'
pyreader = fluid.layers.py_reader(
capacity=50,
shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, args.max_seq_len], [-1, 1],
[-1, 1]],
dtypes=[stype, stype, stype, 'float32', 'float32', stype],
lod_levels=[0, 0, 0, 0, 0, 0],
name=pyreader_name,
use_double_buffer=True)
(src_ids, sent_ids, pos_ids, input_mask, labels,
qids) = fluid.layers.read_file(pyreader)
emb_ids = {"word_embedding": src_ids, "sent_embedding": sent_ids, "pos_embedding": pos_ids}
model = UNIMOModel(
emb_ids=emb_ids,
input_mask=input_mask,
config=config)
cls_feats = model.get_pooled_text_output()
cls_feats = fluid.layers.dropout(
x=cls_feats,
dropout_prob=0.1,
dropout_implementation="upscale_in_train")
cls_params_name = ["cls_out_%d_w" % args.num_labels, "cls_out_%d_b" % args.num_labels]
logits = fluid.layers.fc(
input=cls_feats,
size=args.num_labels,
param_attr=fluid.ParamAttr(
name=cls_params_name[0],
initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
bias_attr=fluid.ParamAttr(
name=cls_params_name[1], initializer=fluid.initializer.Constant(0.)))
cost = fluid.layers.square_error_cost(input=logits, label=labels)
loss = fluid.layers.mean(x=cost)
num_seqs = fluid.layers.create_tensor(dtype='int64')
graph_vars = {
"loss": loss,
"probs": logits,
"labels": labels,
"num_seqs": num_seqs,
"qids": qids
}
return pyreader, graph_vars
def predict(exe, test_program, test_pyreader, graph_vars, dev_count=1):
"""predict"""
qids, scores, probs, preds = [], [], [], []
fetch_list = [graph_vars["probs"].name, graph_vars["qids"].name]
test_pyreader.start()
while True:
try:
if dev_count == 1:
np_probs, np_qids = exe.run(program=test_program, fetch_list=fetch_list)
else:
np_probs, np_qids = exe.run(fetch_list=fetch_list)
qids.extend(np_qids.reshape(-1).tolist())
np_preds = np.argmax(np_probs, axis=1).astype(np.float32)
preds.extend(np_preds)
probs.append(np_probs)
except fluid.core.EOFException:
test_pyreader.reset()
break
probs = np.concatenate(probs, axis=0).reshape([len(qids), -1])
return qids, preds, probs
def evaluate(args, exe, test_program, test_pyreader, graph_vars, eval_phase):
"""evaluate"""
qids, labels, scores = [], [], []
time_begin = time.time()
fetch_list = [
graph_vars["loss"].name, graph_vars["probs"].name,
graph_vars["labels"].name, graph_vars["qids"].name
]
test_pyreader.start()
while True:
try:
np_loss, np_probs, np_labels, np_qids = exe.run(
program=test_program, fetch_list=fetch_list) \
if not args.use_multi_gpu_test else exe.run(fetch_list=fetch_list)
labels.extend(np_labels.reshape((-1)).tolist())
if np_qids is not None:
qids.extend(np_qids.reshape(-1).tolist())
scores.extend(np_probs.reshape((-1)).tolist())
except fluid.core.EOFException:
test_pyreader.reset()
break
time_end = time.time()
ret = OrderedDict()
ret['phase'] = eval_phase
ret['loss'] = -1 # placeholder
ret['data_num'] = -1 # placeholder
ret['used_time'] = round(time_end - time_begin, 4)
metrics = OrderedDict()
metrics["pearson_and_spearman"] = glue_eval.pearson_and_spearman
if args.eval_mertrics in metrics:
ret_metric = metrics[args.eval_mertrics](scores, labels)
ret.update(ret_metric)
print_eval_log(ret)
else:
raise ValueError('unsupported metric {}'.format(args.eval_mertrics))
return ret
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Model for retrieval."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import glob
import time
import codecs
import numpy as np
import paddle.fluid as fluid
from eval import img_eval
from collections import OrderedDict
from utils.utils import print_eval_log
from model.unimo_finetune import UNIMOModel
def circle_loss(sp, sn, m, scale):
"""
sp: score list of positive samples, shape [B * L]
sn: score list of negative samples, shape [B * K]
m: relaxation factor in circle loss function
scale: scale factor in circle loss function
return: circle loss value, shape [1]
"""
op = 1. + m
on = 0. - m
delta_p = 1 - m
delta_n = m
ap = fluid.layers.relu(op - sp)
ap.stop_gradient = True
an = fluid.layers.relu(sn - on)
an.stop_gradient = True
logit_p = ap * (sp - delta_p)
logit_p = -1. * scale * logit_p
logit_p = fluid.layers.cast(x=logit_p, dtype=np.float64)
loss_p = fluid.layers.reduce_sum(fluid.layers.exp(logit_p), dim=1, keep_dim=False)
logit_n = an * (sn - delta_n)
logit_n = scale * logit_n
logit_n = fluid.layers.cast(x=logit_n, dtype=np.float64)
loss_n = fluid.layers.reduce_sum(fluid.layers.exp(logit_n), dim=1, keep_dim=False)
circle_loss = fluid.layers.log(1 + loss_n * loss_p)
circle_loss = fluid.layers.cast(x=circle_loss, dtype=np.float32)
return fluid.layers.mean(circle_loss)
def create_model(args, phase, config, samples_num):
""""create_model"""
input_mask_shape = [-1, args.max_img_len + args.max_seq_len, args.max_img_len + args.max_seq_len]
src_ids = fluid.layers.data(name='src_ids', shape=[-1, args.max_seq_len, 1], dtype='int64')
pos_ids = fluid.layers.data(name='pos_ids', shape=[-1, args.max_seq_len, 1], dtype='int64')
sent_ids = fluid.layers.data(name='sent_ids', shape=[-1, args.max_seq_len, 1], dtype='int64')
input_mask = fluid.layers.data(name='input_mask', shape=input_mask_shape, dtype='float32')
image_embedding = fluid.layers.data(
name='image_embedding',
shape=[-1, args.max_img_len, config["image_embedding_size"]],
dtype='float32')
image_loc = fluid.layers.data(name='image_loc', shape=[-1, args.max_img_len, 5], dtype='float32')
labels = fluid.layers.data(name='labels', shape=[-1, 1], dtype='int64')
ids = fluid.layers.data(name='ids', shape=[-1, 2], dtype='int64')
drop_last = True if phase == 'train' else False
feed_list = [src_ids, pos_ids, sent_ids, input_mask, image_embedding, image_loc, labels, ids]
pyreader = fluid.io.DataLoader.from_generator(
feed_list=feed_list,
capacity=70,
use_double_buffer=True,
iterable=False,
drop_last=drop_last)
emb_ids = {"word_embedding": src_ids, "sent_embedding": sent_ids, "pos_embedding": pos_ids}
image_input = {"image_embedding": image_embedding, "loc_embedding": image_loc}
model = UNIMOModel(
emb_ids=emb_ids,
input_mask=input_mask,
config=config,
image_input=image_input,
weight_sharing=args.weight_sharing
)
text, image = model.get_pooled_output()
score = model.get_match_output(text, image, mode="mul")
score = fluid.layers.fc(
input=score,
size=1,
act=None,
param_attr=fluid.ParamAttr(
name='match_fc.w_0',
initializer=fluid.initializer.Xavier()),
bias_attr=fluid.ParamAttr(name='match_fc.b_0',
initializer=fluid.initializer.UniformInitializer()))
score = fluid.layers.reshape(score, [-1, samples_num])
if phase == 'train':
if args.use_sigmoid:
score = fluid.layers.sigmoid(score)
positive_score = score[:, 0]
image_neg_score = score[:, 1:int((samples_num + 1) / 2)]
caption_neg_score = score[:, int((samples_num + 1) / 2):]
acc = fluid.layers.accuracy(score, labels, k=1)
positive_score = fluid.layers.reshape(x=positive_score, shape=[-1, 1])
loss_c = circle_loss(positive_score, caption_neg_score, args.margin, args.scale_circle)
loss_i = circle_loss(positive_score, image_neg_score, args.margin, args.scale_circle)
total_loss = (loss_c + loss_i) / 2
else:
assert samples_num == 1
total_loss = fluid.layers.cross_entropy(input=score, label=labels)
total_loss = fluid.layers.mean(x=total_loss)
acc = fluid.layers.zeros_like(total_loss)
graph_vars = {"loss": total_loss, "acc": acc, "score": score, "label": labels, "ids": ids}
return pyreader, graph_vars
def evaluate(args, exe, test_pyreader, graph_vars, eval_phase, dev_count=1, gpu_id=0, data_reader=None):
"""evaluate"""
test_pyreader.start()
time_begin = time.time()
all_mat = None
fetch_list = [graph_vars["score"].name, graph_vars["ids"].name]
while True:
try:
score, ids = exe.run(fetch_list=fetch_list)
mat = np.concatenate([score, ids], axis=1)
if all_mat is None:
all_mat = mat
else:
all_mat = np.concatenate([all_mat, mat], axis=0)
except fluid.core.EOFException:
test_pyreader.reset()
break
time_end = time.time()
save_file = "%s/%s.trainers_%d.part_%d.npy" % (args.eval_dir, eval_phase, dev_count, gpu_id)
np.save(save_file, all_mat)
tmp_file = "%s/%s.trainers_%d.part_%d.finish" % (args.eval_dir, eval_phase, dev_count, gpu_id)
tmp_writer = codecs.open(tmp_file, "w", 'utf-8')
tmp_writer.close()
if gpu_id == 0:
while True:
ret = os.popen('find %s -maxdepth 1 -name "%s.trainers_%d.part_*.finish"' %
(args.eval_dir, eval_phase, dev_count)).readlines()
if len(ret) != dev_count:
time.sleep(1)
continue
else:
break
all_mat = None
save_files = glob.glob("%s/%s.trainers_%d.part_*.npy" % (args.eval_dir, eval_phase, dev_count))
for cur_save_file in save_files:
mat = np.load(cur_save_file)
if all_mat is None:
all_mat = mat
else:
all_mat = np.concatenate([all_mat, mat], axis=0)
cur_time = str(int(time.time()))
os.system("mkdir %s/%s" % (args.eval_dir, cur_time))
os.system("mv %s/%s.trainers_%d.* %s/%s" % (args.eval_dir, eval_phase, dev_count, args.eval_dir, cur_time))
assert data_reader is not None
text2img = {text_id: item[-1] for text_id, item in data_reader._caption_ids_dict.items()}
img2texts = data_reader._image_sent_map
ret = OrderedDict()
ret['phase'] = eval_phase
ret['loss'] = -1
ret['data_num'] = all_mat.shape[0]
ret['used_time'] = round(time_end - time_begin, 4)
metrics = OrderedDict()
metrics["recall@k"] = img_eval.recall_at_k
if args.eval_mertrics in metrics:
ret_metric = metrics[args.eval_mertrics](all_mat, text2img, img2texts)
ret.update(ret_metric)
print_eval_log(ret)
else:
raise ValueError('unsupported metric {}'.format(args.eval_mertrics))
return ret
else:
return None
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""seq2seq generation"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import re
import time
import numpy as np
import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as layers
from model.unimo_finetune import UNIMOModel
from eval.gen_eval import GenerationEval
from finetune.trigram_blocking import TrigramBlocking
import codecs
class Seq2Seq(object):
"""finetuning for seq2seq generation"""
def __init__(self, args, gene_config, tokenizer):
self.gene_config = gene_config
self.weight_sharing = args.weight_sharing
self.task_type = args.task_type
self.max_seq_len = args.max_seq_len
self.label_smooth = args.label_smooth
self.tgt_type_id = args.tgt_type_id
self.continuous_position = args.continuous_position
self.tokenizer = tokenizer
self.vocab_size = gene_config["vocab_size"]
self._emb_dtype = "float32"
# for beam_search decoding
self.do_decode = args.do_decode
self.length_penalty = args.length_penalty
self.max_out_len = args.max_out_len
self.min_out_len = args.min_out_len
self.block_trigram = args.block_trigram
self.beam_size = args.beam_size
self.bos_id = tokenizer.cls_token_id
self.eos_id = tokenizer.mask_token_id
self.evaluator = GenerationEval(args)
if self.task_type == "dialog":
self.emb_keys = ["word_embedding", "role_embedding", "turn_embedding", "pos_embedding"]
else:
self.emb_keys = ["word_embedding", "sent_embedding", "pos_embedding"]
def cal_logit(self, enc_out, tgt_pos):
"""calculate logit"""
enc_out = fluid.layers.reshape(x=enc_out,
shape=[-1, self.gene_config["hidden_size"]])
if tgt_pos:
tgt_pos = fluid.layers.cast(x=tgt_pos, dtype='int32')
tgt_feat = fluid.layers.gather(input=enc_out, index=tgt_pos)
else:
tgt_feat = enc_out
tgt_trans_feat = fluid.layers.fc(
input=tgt_feat,
size=self.gene_config["emb_size"] or self.gene_config["hidden_size"],
act=self.gene_config["hidden_act"],
param_attr=fluid.ParamAttr(
name="mask_lm_trans_fc.w_0",
initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
bias_attr=fluid.ParamAttr(
name="mask_lm_trans_fc.b_0",
initializer=fluid.initializer.Constant(0.)))
tgt_trans_feat = fluid.layers.layer_norm(
tgt_trans_feat,
begin_norm_axis=len(tgt_trans_feat.shape) - 1,
param_attr=fluid.ParamAttr(
name='mask_lm_trans_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name='mask_lm_trans_layer_norm_bias',
initializer=fluid.initializer.Constant(1.)))
seq2seq_out_bias_attr = fluid.ParamAttr(
name="mask_lm_out_fc.b_0",
initializer=fluid.initializer.Constant(value=0.0))
if self.weight_sharing:
fc_out = fluid.layers.matmul(
x=tgt_trans_feat,
y=fluid.default_main_program().global_block().var(
"word_embedding"),
transpose_y=True)
fc_out += fluid.layers.create_parameter(
shape=[self.gene_config['vocab_size']],
dtype="float32",
attr=seq2seq_out_bias_attr,
is_bias=True)
else:
out_size = self.gene_config["tgt_vocab_size"] or self.gene_config['vocab_size']
fc_out = fluid.layers.fc(input=tgt_trans_feat,
size=out_size,
param_attr=fluid.ParamAttr(
name="mask_lm_out_fc.w_0",
initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
bias_attr=seq2seq_out_bias_attr)
return fc_out
def to_tensor(self, shapes, dtypes, lod_levels):
"""convert to tensor"""
return [fluid.layers.data(name="placeholder_" + str(i), shape=shapes[i], dtype=dtypes[i],
lod_level=lod_levels[i]) for i in range(len(shapes))]
def create_model(self, decoding=False):
"""create model for training"""
if decoding:
return self.fast_decode()
if self.task_type == "dialog":
emb_num = 4
else:
emb_num = 3
input_shapes = [[-1, self.max_seq_len, 1]] * emb_num + \
[[-1, self.max_seq_len, self.max_seq_len]]
input_dtypes = ['int64'] * emb_num + ['float32']
input_lod_levels = [0] * emb_num + [0]
shapes = input_shapes + [[-1, 1], [-1, 1]]
dtypes = input_dtypes + ['int64', 'int64']
lod_levels = input_lod_levels + [0, 0]
inputs = self.to_tensor(shapes, dtypes, lod_levels)
pyreader = fluid.io.DataLoader.from_generator(feed_list=inputs, capacity=70, iterable=False)
emb_ids = {}
for key, value in zip(self.emb_keys, inputs[:emb_num]):
emb_ids[key] = value # for embeddings
# src_ids, sent_ids, pos_ids = inputs[:emb_num]
input_mask = inputs[emb_num]
tgt_labels, tgt_pos = inputs[-2:]
unimo = UNIMOModel(
emb_ids=emb_ids,
input_mask=input_mask,
config=self.gene_config,
task_type=self.task_type)
enc_out = unimo.get_sequence_output()
fc_out = self.cal_logit(enc_out, tgt_pos)
if self.label_smooth:
out_size = self.gene_config['vocab_size']
labels = fluid.layers.label_smooth(
label=fluid.layers.one_hot(
input=tgt_labels, depth=out_size),
epsilon=self.label_smooth)
ce_loss = layers.softmax_with_cross_entropy(
logits=fc_out, label=labels, soft_label=True)
else:
ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
logits=fc_out, label=tgt_labels, return_softmax=True)
loss = fluid.layers.mean(x=ce_loss)
graph_vars = {"loss": loss}
for k, v in graph_vars.items():
v.persistable = True
return pyreader, graph_vars
def fast_decode(self):
"""create model for inference"""
if self.task_type == "dialog":
emb_num = 4
else:
emb_num = 3
input_shapes = [[-1, self.max_seq_len, 1]] * emb_num + \
[[-1, self.max_seq_len, self.max_seq_len]]
input_dtypes = ['int64'] * emb_num + ['float32']
input_lod_levels = [0] * emb_num + [0]
shapes = input_shapes + [[-1, 1, 1], [-1, 1, 1],
[-1, 1], [-1], [-1, 1, self.max_seq_len], [-1, 1]]
dtypes = input_dtypes + ['int64', 'int64', 'float32', 'int32', 'float32', 'int64']
lod_levels = input_lod_levels + [2, 2, 2, 0, 0, 0]
inputs = self.to_tensor(shapes, dtypes, lod_levels)
pyreader = fluid.io.DataLoader.from_generator(feed_list=inputs, capacity=70, iterable=False)
emb_ids = {}
for key, value in zip(self.emb_keys, inputs[:emb_num]):
emb_ids[key] = value
input_mask = inputs[emb_num]
tgt_ids, tgt_pos, init_scores, parent_idx, tgt_input_mask, data_ids = inputs[-6:]
unimo = UNIMOModel(
emb_ids=emb_ids,
input_mask=input_mask,
config=self.gene_config,
task_type=self.task_type,
decoding=True,
gather_idx=parent_idx)
max_len = layers.fill_constant(
shape=[1], dtype=tgt_ids.dtype, value=self.max_out_len, force_cpu=True)
min_len = layers.fill_constant(
shape=[1], dtype=tgt_ids.dtype, value=self.min_out_len, force_cpu=True)
neg_inf = layers.fill_constant(
shape=[1], dtype='float32', value=-1e18)
step_idx = layers.fill_constant(
shape=[1], dtype=tgt_ids.dtype, value=0, force_cpu=True)
step_next_idx = layers.fill_constant(
shape=[1], dtype=tgt_ids.dtype, value=1, force_cpu=True)
cond = layers.less_than(x=step_idx, y=max_len)
while_op = layers.While(cond)
ids = layers.array_write(layers.reshape(tgt_ids, (-1, 1)), step_idx)
pos_biases = layers.array_write(tgt_pos, step_idx)
scores = layers.array_write(init_scores, step_idx)
tgt_masks = layers.array_write(tgt_input_mask, step_idx)
trigram_blocking = TrigramBlocking(tgt_ids, self.tokenizer, beam_size=self.beam_size)
with while_op.block():
pre_ids = layers.array_read(array=ids, i=step_idx)
pre_ids = layers.reshape(pre_ids, (-1, 1, 1), inplace=True)
pre_scores = layers.array_read(array=scores, i=step_idx)
pos_bias = layers.array_read(array=pos_biases, i=step_idx)
pos_bias = layers.gather(input=pos_bias, index=parent_idx)
def gen_batch_like(value, dtype="int64", shape=[-1, 1, 1], is_scalar=True):
"""generate batch"""
if is_scalar:
return layers.fill_constant_batch_size_like(
input=parent_idx, value=value, shape=shape, dtype=dtype)
else:
return layers.elementwise_mul(
x=layers.fill_constant_batch_size_like(
input=parent_idx, value=1, shape=shape, dtype=dtype),
y=value, axis=0)
tmp_mask = layers.array_read(tgt_masks, i=step_idx)
tmp_mask = layers.gather(input=tmp_mask, index=parent_idx)
append_1_mask = gen_batch_like(1.0, dtype=tmp_mask.dtype)
pre_mask = layers.concat([tmp_mask, append_1_mask], axis=2)
pre_pos = gen_batch_like(step_idx, is_scalar=False)
pre_pos = pre_pos + pos_bias ####################### pos start from 2
pre_sent = gen_batch_like(self.tgt_type_id, dtype=pre_ids.dtype)
dec_emb_ids = {"word_embedding": pre_ids, "pos_embedding": pre_pos}
if self.task_type == "dialog":
role_ids = gen_batch_like(0)
turn_ids = gen_batch_like(0)
dec_emb_ids["role_embedding"] = role_ids
dec_emb_ids["turn_embedding"] = turn_ids
else:
dec_emb_ids["sent_embedding"] = pre_sent
dec_out = unimo.encode(emb_ids=dec_emb_ids,
input_mask=pre_mask,
gather_idx=parent_idx)
fc_out = self.cal_logit(dec_out, None)
# prevent generating end token if length less than min_out_len
eos_index = layers.fill_constant(shape=[layers.shape(fc_out)[0]],
dtype='int64',
value=self.eos_id)
eos_index = fluid.one_hot(eos_index, depth=self.vocab_size)
less_cond = layers.cast(layers.less_than(x=step_idx, y=min_len), dtype='float32')
less_val = layers.elementwise_mul(less_cond, neg_inf)
eos_val = layers.elementwise_mul(eos_index, less_val, axis=0)
revised_logits = layers.elementwise_add(fc_out, eos_val, axis=0)
# topK reduction across beams, also contain special handle of
# end beams and end sentences(batch reduction)
topk_scores, topk_indices = layers.topk(
input=layers.softmax(revised_logits), k=self.beam_size)
# Roll-Back previous-scores for length-penalty
# previous-scores has been length-penaltied, before this timestep length-penalty, need roll-back
# because of doing this, we need store the length-penaltied score in `scores`
# while calculating use the un-penaltied score
# -> safe for step_idx == 0 (initialization state), because previous-score == 0
pre_timestep_length_penalty = fluid.layers.pow(
((5.0 + fluid.layers.cast(step_idx, pre_scores.dtype)) / 6.0), self.length_penalty)
pre_scores_wo_len_penalty = fluid.layers.elementwise_mul(pre_scores, pre_timestep_length_penalty)
# calc trigram-blocking delta scores for current alive sequence
if self.block_trigram:
trigram_blocking.update_seq(pre_ids, parent_idx)
trigram_blocking.expand_cand_seq(topk_indices)
fluid.layers.py_func(func=trigram_blocking.blocking_forward,
x=[trigram_blocking.cand_seq,
trigram_blocking.id2is_full_token],
out=trigram_blocking.delta_score_out,
backward_func=None)
pre_scores_wo_len_penalty = fluid.layers.elementwise_add(x=trigram_blocking.delta_score_out,
y=pre_scores_wo_len_penalty,
axis=0)
# => [N, topk]
accu_scores = layers.elementwise_add(
x=layers.log(topk_scores), y=pre_scores_wo_len_penalty, axis=0)
cur_timestep_length_penalty = layers.pow(((5.0 + layers.cast(step_next_idx, accu_scores.dtype)) / 6.0),
self.length_penalty)
curr_scores = layers.elementwise_div(accu_scores, cur_timestep_length_penalty)
# beam_search op uses lod to differentiate branches.
curr_scores = layers.lod_reset(curr_scores, pre_ids)
topk_indices = layers.lod_reset(topk_indices, pre_ids)
selected_ids, selected_scores, gather_idx = layers.beam_search(
pre_ids=pre_ids,
pre_scores=pre_scores,
ids=topk_indices,
scores=curr_scores,
beam_size=self.beam_size,
end_id=self.eos_id,
return_parent_idx=True)
layers.increment(x=step_idx, value=1.0, in_place=True)
layers.increment(x=step_next_idx, value=1.0, in_place=True)
# cell states(caches) have been updated in wrap_decoder,
# only need to update beam search states here.
layers.array_write(selected_ids, i=step_idx, array=ids)
layers.array_write(selected_scores, i=step_idx, array=scores)
layers.array_write(pre_mask, i=step_idx, array=tgt_masks)
layers.array_write(pos_bias, i=step_idx, array=pos_biases)
layers.assign(gather_idx, parent_idx)
length_cond = layers.less_than(x=step_idx, y=max_len)
finish_cond = layers.logical_not(layers.is_empty(x=selected_ids))
layers.logical_and(x=length_cond, y=finish_cond, out=cond)
finished_ids, finished_scores = layers.beam_search_decode(
ids, scores, beam_size=self.beam_size, end_id=self.eos_id)
graph_vars = {
"finished_ids": finished_ids,
"finished_scores": finished_scores,
"data_ids": data_ids
}
for k, v in graph_vars.items():
v.persistable = True
return pyreader, graph_vars
def post_process_seq(self, seq):
"""
Post-process the beam-search decoded sequence. Truncate from the first
<eos> and remove the <bos> and <eos> tokens currently.
"""
eos_pos = len(seq)
for i, idx in enumerate(seq):
if idx == self.eos_id:
eos_pos = i
break
seq = seq[1:eos_pos]
return seq
def remove_special_tokens(self, seq, special_tokens):
"""Remove special tokens from output sequence"""
seq = [idx for idx in seq if idx not in special_tokens]
return seq
def evaluate(self, resource, eval_phase, graph_vars, features=None,
output_path=None, dev_count=1, gpu_id=0):
"""evaluate model"""
exe, program, pyreader = resource["exe"], resource["program"], resource["pyreader"]
if eval_phase == "train":
fetch_list = [graph_vars["loss"].name]
if "learning_rate" in graph_vars:
fetch_list.append(graph_vars["learning_rate"].name)
outputs = exe.run(fetch_list=fetch_list)
np_loss = outputs[0]
ret = {"loss": np.mean(np_loss), "ppl": np.exp(np.mean(np_loss))}
if "learning_rate" in graph_vars:
ret["learning_rate"] = float(outputs[1][0])
return ret
if self.do_decode:
return_numpy = False
outfile = output_path + "/" + eval_phase
outfile_part = outfile + ".part" + str(gpu_id)
# writer = open(outfile_part, "w", encoding='utf-8')
writer = codecs.open(outfile_part, 'w', encoding='utf-8')
fetch_keys = ["finished_ids", "finished_scores", "data_ids"]
special_tokens = [self.tokenizer.cls_token_id,
self.tokenizer.mask_token_id,
self.tokenizer.pad_token_id,
self.tokenizer.unk_token_id]
else:
steps = 0
cost = 0.0
return_numpy = True
fetch_keys = ["loss"]
fetch_list = [graph_vars[key].name for key in fetch_keys]
time_begin = time.time()
pyreader.start()
while True:
try:
outputs = exe.run(program=program,
fetch_list=fetch_list,
return_numpy=return_numpy)
if not self.do_decode:
np_loss = outputs[0]
cost += np.mean(np_loss)
steps += 1
else:
seq_ids, seq_scores, data_ids = outputs
seq_ids_list, seq_scores_list = [seq_ids], [seq_scores] \
if isinstance(seq_ids, paddle.fluid.core.LoDTensor) else (seq_ids, seq_scores)
data_ids = np.array(data_ids).reshape(-1).tolist()
data_idx = 0
for seq_ids, seq_scores in zip(seq_ids_list, seq_scores_list):
# How to parse the results:
# Suppose the lod of seq_ids is:
# [[0, 3, 6], [0, 12, 24, 40, 54, 67, 82]]
# then from lod[0]:
# there are 2 source sentences, beam width is 3.
# from lod[1]:
# the first source sentence has 3 hyps; the lengths are 12, 12, 16
# the second source sentence has 3 hyps; the lengths are 14, 13, 15
# hyps = [[] for i in range(len(seq_ids.lod()[0]) - 1)]
# scores = [[] for i in range(len(seq_scores.lod()[0]) - 1)]
for i in range(len(seq_ids.lod()[0]) - 1): # for each source sentence
start = seq_ids.lod()[0][i]
end = seq_ids.lod()[0][i + 1]
max_cand = None
for j in range(end - start): # for each candidate
sub_start = seq_ids.lod()[1][start + j]
sub_end = seq_ids.lod()[1][start + j + 1]
token_ids = [int(idx) for idx in self.post_process_seq(
np.array(seq_ids)[sub_start:sub_end])]
# print(len(token_ids))
hyp_ids = self.remove_special_tokens(token_ids, special_tokens)
hyp_tokens = self.tokenizer.convert_ids_to_tokens(hyp_ids)
hyp_str = self.tokenizer.gptbpe_tokenizer.decode(hyp_tokens)
hyp_str = re.sub('\\s+', ' ', hyp_str)
# print(hyp_str)
score = np.array(seq_scores)[sub_end - 1]
if (not max_cand) or score > max_cand[1]:
max_cand = (hyp_str, score)
data_id = data_ids[data_idx]
data_idx += 1
pred = max_cand[0]
writer.write("%d\t%s\n" % (data_id, pred))
except fluid.core.EOFException:
pyreader.reset()
break
time_end = time.time()
if not self.do_decode:
eval_result = "loss: %f, ppl: %f" % (cost / steps, np.exp(cost / steps))
print("[%s evaluation] %s, elapsed time: %f s"
% (eval_phase, eval_result, time_end - time_begin))
else:
writer.close()
# tmp_writer = open("%s/%s_dec_finish.%d" % (output_path, eval_phase, gpu_id), "w")
tmp_writer = codecs.open("%s/%s_dec_finish.%d" % (output_path, eval_phase, gpu_id),
'w', encoding='utf-8')
tmp_writer.close()
if gpu_id != 0:
return
while True:
ret = os.popen('find %s -maxdepth 1 -name "%s_dec_finish.*"' %
(output_path, eval_phase)).readlines()
if len(ret) != dev_count:
time.sleep(1)
continue
else:
break
os.system("sort -t '\t' -k 1 -n %s.part* | awk -F '\t' '{print $2}' > %s" % (outfile, outfile))
os.system("rm %s.part*" % outfile)
os.system("rm %s/%s_dec_finish.*" % (output_path, eval_phase))
eval_result = self.evaluator.eval(outfile,
phase=eval_phase.split("_")[0], features=features)
print("[%s evaluation] %s, elapsed time: %f s"
% (eval_phase, eval_result, time_end - time_begin))
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""trigram_blocking for sequence generation"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import paddle.fluid as fluid
class TrigramBlocking(object):
"""trigram blocking check data holder
"""
def __init__(self, init_token, roberta_tokenizer, beam_size, use_fp16=False):
"""use tokenizer to generate the real-tokens from sub-token ids.
but we can't pass the tokenizer to network, so we need make a trick
"""
# => [N, T==0, 1]
self._alive_seq = fluid.layers.fill_constant_batch_size_like(
input=init_token,
shape=[-1, 0, 1],
dtype=init_token.dtype,
value=0)
self._cand_seq = fluid.layers.fill_constant_batch_size_like(
input=init_token,
shape=[-1, 0, beam_size],
dtype=init_token.dtype,
value=0)
self.beam_size = beam_size
self._dtype = "float32" if not use_fp16 else "float16"
_SHAPE_PLACEHOLDER = [10, beam_size]
self._delta_score_out = fluid.layers.create_parameter(shape=_SHAPE_PLACEHOLDER, dtype=self._dtype,
name="duplicated_trigram_blocking_delta_score_out")
self.tokenizer = roberta_tokenizer
id2is_full_token = self._build_id2is_full_token(self.tokenizer, self._dtype)
self._id2is_full_token = fluid.layers.create_parameter(
shape=id2is_full_token.shape,
dtype=self._dtype,
name="duplicated_trigram_blocking_id2is_full_token",
default_initializer=fluid.initializer.NumpyArrayInitializer(id2is_full_token))
def update_seq(self, new_step_id, gather_idx):
"""update alive sequence. need pre-gather the inner seq then concat the new step id"""
# new_step_id = fluid.layers.unsqueeze(new_step_id, axes=[1])
alive_seq = fluid.layers.gather(self._alive_seq, gather_idx)
# => [N, T==1, 1]
alive_seq = fluid.layers.concat([alive_seq, new_step_id], axis=1)
fluid.layers.assign(alive_seq, self._alive_seq)
return self._alive_seq
def expand_cand_seq(self, new_topk_indx):
"""expand the alive seq by concatenating the topk candidates"""
new_topk_indx = fluid.layers.unsqueeze(new_topk_indx, axes=[1]) # (batch_size, 1, beam_size)
cand_seq = fluid.layers.expand(self._alive_seq, expand_times=[1, 1, self.beam_size])
# => [N, T+1, beam_size]
expand_cand_seq = fluid.layers.concat([cand_seq, new_topk_indx], axis=1)
fluid.layers.assign(expand_cand_seq, self._cand_seq)
return self._cand_seq
@property
def alive_seq(self):
"""alive seq"""
return self._alive_seq
@property
def cand_seq(self):
"""candidate seq"""
return self._cand_seq
@property
def delta_score_out(self):
"""delta score out"""
return self._delta_score_out
@property
def id2is_full_token(self):
"""id->isfulltoken"""
return self._id2is_full_token
@staticmethod
def blocking_forward(cand_seq, id2is_full_token):
"""py_func can't be member function
run the trigram-blocking logic. return `delta-score` for every sequence.
for seq which has duplicated trigram, set delta-score = -inf,
else set delta-score = 0
in the outer, should do the `seq-score + delta-score` logic
alive_seq: shape = [N, T, 1]
Returns
---------
np.array, shape = [N, 1]
"""
_BLOCKING_DELTA = -65000.0 # -65500.0 is the min value of float16
_KEEP_DELTA = 0.0
cand_seq = np.array(cand_seq) # (batch_size, dec_len, beam_size)
cand_seq = np.transpose(cand_seq, axes=(0, 2, 1)) # (batch_size, beam_size, dec_len)
id2is_full_token = np.array(id2is_full_token)
def _sub_token_id2full_tokens(sub_token_ids):
full_tokens = []
for sub_token_id in sub_token_ids:
is_full_token = bool(id2is_full_token[sub_token_id])
if is_full_token or not full_tokens:
full_tokens.append([sub_token_id])
else:
pre_full_token = full_tokens[-1]
pre_full_token.append(sub_token_id)
full_tokens = ["-".join(map(str, full_token)) for full_token in full_tokens]
return full_tokens
_make_trigram_str = lambda trigram_tokens: "_".join(trigram_tokens)
delta_list = []
for beam_cand_ids in cand_seq:
delta_score = []
for one_seq_ids in beam_cand_ids:
sub_token_ids = one_seq_ids.reshape(-1)
tokens = _sub_token_id2full_tokens(sub_token_ids)
if len(tokens) <= 3:
delta_score.append(_KEEP_DELTA)
continue
# don't include the last trigram(checking self)!
trigrams = [_make_trigram_str(tokens[end - 3: end]) for end in range(3, len(tokens))]
trigrams_set = set(trigrams)
last_trigram = _make_trigram_str(tokens[-3:])
if last_trigram in trigrams_set:
# duplicated
delta_score.append(_BLOCKING_DELTA)
else:
delta_score.append(_KEEP_DELTA)
delta_list.append(delta_score)
return np.array(delta_list, dtype=id2is_full_token.dtype).reshape(cand_seq.shape[0], cand_seq.shape[1])
@staticmethod
def blocking_backward(*args):
"""blocking backward"""
raise ValueError("Impossible call backward.")
def _build_id2is_full_token(self, tokenizer, dtype):
vocab_sz = tokenizer.vocab_size()
is_full_token = [0.0] * vocab_sz
for token_id in range(vocab_sz):
token = tokenizer.convert_id_to_token(token_id)
token_str = tokenizer.gptbpe_tokenizer.decode_token(token)
if token_str.startswith(' '):
is_full_token[token_id] = 1.0
return np.array(is_full_token, dtype=dtype)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
launch for multi process training
"""
import sys
import subprocess
import os
import copy
import argparse
from utils.args import ArgumentGroup, print_arguments
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
multip_g = ArgumentGroup(parser, "multiprocessing",
"start paddle training using multi-processing mode.")
multip_g.add_arg("node_ips", str, None,
"paddle trainer ips")
multip_g.add_arg("node_id", int, None,
"the trainer id of the node for multi-node distributed training.")
multip_g.add_arg("print_config", bool, True,
"print the config of multi-processing mode.")
multip_g.add_arg("current_node_ip", str, None,
"the ip of current node.")
multip_g.add_arg("split_log_path", str, "log",
"log path for each trainer.")
multip_g.add_arg("log_prefix", str, "",
"the prefix name of job log.")
multip_g.add_arg("nproc_per_node", int, 8,
"the number of process to use on each node.")
multip_g.add_arg("selected_gpus", str, "0,1,2,3,4,5,6,7",
"the gpus selected to use.")
multip_g.add_arg("training_script", str, None, "the program/script to be lauched "
"in parallel followed by all the arguments", positional_arg=True)
multip_g.add_arg("training_script_args", str, None,
"training script args", positional_arg=True, nargs=argparse.REMAINDER)
# yapf: enable
def start_procs(args):
""" start_procs """
default_env = os.environ.copy()
node_id = args.node_id
print(args.node_ips)
node_ips = [x.strip() for x in args.node_ips.split(',')]
current_ip = args.current_node_ip
num_nodes = len(node_ips)
selected_gpus = [x.strip() for x in args.selected_gpus.split(',')]
selected_gpu_num = len(selected_gpus)
start_port = int(default_env['PADDLE_PORT'])
all_trainer_endpoints = ""
for ip in node_ips:
cur_port = start_port + 1
for i in range(args.nproc_per_node):
cur_port += 1
if all_trainer_endpoints != "":
all_trainer_endpoints += ","
all_trainer_endpoints += "%s:%d" % (ip, cur_port)
nranks = num_nodes * args.nproc_per_node
gpus_per_proc = args.nproc_per_node % selected_gpu_num
if gpus_per_proc == 0:
gpus_per_proc = selected_gpu_num // args.nproc_per_node
else:
gpus_per_proc = selected_gpu_num // args.nproc_per_node + 1
selected_gpus_per_proc = [selected_gpus[i:i + gpus_per_proc]
for i in range(0, len(selected_gpus), gpus_per_proc)]
if args.print_config:
print("all_trainer_endpoints: ", all_trainer_endpoints,
", node_id: ", node_id,
", current_ip: ", current_ip,
", num_nodes: ", num_nodes,
", node_ips: ", node_ips,
", gpus_per_proc: ", gpus_per_proc,
", selected_gpus_per_proc: ", selected_gpus_per_proc,
", nranks: ", nranks)
current_env = copy.copy(default_env)
procs = []
cmds = []
log_fns = []
cur_port = start_port + 1
for i in range(0, args.nproc_per_node):
trainer_id = node_id * args.nproc_per_node + i
cur_port += 1
current_env.update({
"FLAGS_selected_gpus": "%s" % ",".join([str(s) for s in selected_gpus_per_proc[i]]),
"PADDLE_TRAINER_ID": "%d" % trainer_id,
"PADDLE_CURRENT_ENDPOINT": "%s:%d" % (current_ip, cur_port),
"PADDLE_TRAINERS_NUM": "%d" % nranks,
"PADDLE_TRAINER_ENDPOINTS": all_trainer_endpoints,
"PADDLE_NODES_NUM": "%d" % num_nodes
})
cmd = [sys.executable, "-u",
args.training_script] + args.training_script_args
cmds.append(cmd)
if args.split_log_path:
fn = open("%s/%sjob.log.%d" % (args.split_log_path, args.log_prefix, trainer_id), "a")
log_fns.append(fn)
process = subprocess.Popen(cmd, env=current_env, stdout=fn, stderr=fn)
else:
process = subprocess.Popen(cmd, env=current_env)
procs.append(process)
for i in range(len(procs)):
proc = procs[i]
proc.wait()
if len(log_fns) > 0:
log_fns[i].close()
if proc.returncode != 0:
raise subprocess.CalledProcessError(returncode=procs[i].returncode,
cmd=cmds[i])
else:
print("proc %d finsh" % i)
print("run success")
def main(args):
""" main_func """
if args.print_config:
print_arguments(args)
start_procs(args)
if __name__ == "__main__":
lanch_args = parser.parse_args()
main(lanch_args)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""tokenization"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import unicodedata
import six
import os
import json
import codecs
import regex as re
from functools import lru_cache
import numpy as np
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
try:
return text.decode("utf-8", "ignore")
except:
return text.decode("gb18030", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
try:
return text.decode("utf-8", "ignore")
except:
return text.decode("gb18030", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def printable_text(text):
"""Returns text encoded in a way suitable for print or `tf.logging`."""
# These functions want `str` for both Python2 and Python3, but in one case
# it's a Unicode string and in the other it's a byte string.
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text
elif isinstance(text, unicode):
return text.encode("utf-8")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
fin = codecs.open(vocab_file, 'r', encoding='utf-8')
for num, line in enumerate(fin):
items = convert_to_unicode(line.strip()).split("\t")
if len(items) > 2:
break
token = items[0]
index = items[1] if len(items) == 2 else num
token = token.strip()
vocab[token] = int(index)
return vocab
def convert_by_vocab(vocab, items):
"""Converts a sequence of [tokens|ids] using the vocab."""
output = []
for item in items:
output.append(vocab[item])
return output
def convert_tokens_to_ids(vocab, tokens):
"""convert tokens to vocab ids"""
return convert_by_vocab(vocab, tokens)
def convert_ids_to_tokens(inv_vocab, ids):
"""convert vocab ids to tokens"""
return convert_by_vocab(inv_vocab, ids)
def whitespace_tokenize(text):
"""Runs basic whitespace cleaning and splitting on a peice of text."""
text = text.strip()
if not text:
return []
tokens = text.split()
return tokens
@lru_cache()
def bytes_to_unicode():
"""
Returns list of utf-8 byte and a corresponding list of unicode strings.
The reversible bpe codes work on unicode strings.
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
This is a signficant percentage of your normal, say, 32K bpe vocab.
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
And avoids mapping to whitespace/control characters the bpe code barfs on.
"""
bs = list(range(33, 126 + 1)) + list(range(161, 172 + 1)) + list(range(174, 255 + 1))
cs = bs[:]
n = 0
for b in range(2 ** 8):
if b not in bs:
bs.append(b)
cs.append(2 ** 8 + n)
n += 1
cs = [chr(n) for n in cs]
return dict(zip(bs, cs))
def get_pairs(word):
"""Return set of symbol pairs in a word.
Word is represented as tuple of symbols (symbols being variable-length strings).
"""
pairs = set()
prev_char = word[0]
for char in word[1:]:
pairs.add((prev_char, char))
prev_char = char
return pairs
class Tokenizer(object):
"""RoBERTa Tokenizer"""
def __init__(self, encoder, bpe_merges, errors='replace'):
self.encoder = encoder
self.decoder = {v: k for k, v in self.encoder.items()}
self.errors = errors # how to handle errors in decoding
self.byte_encoder = bytes_to_unicode()
self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
self.cache = {}
# Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
self.int_token = re.compile(r"^[0-9]+$")
def bpe(self, token):
"""bpe tokenizing"""
if token in self.cache:
return self.cache[token]
word = tuple(token)
pairs = get_pairs(word)
if not pairs:
return token
while True:
bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float('inf')))
if bigram not in self.bpe_ranks:
break
first, second = bigram
new_word = []
i = 0
while i < len(word):
try:
j = word.index(first, i)
new_word.extend(word[i:j])
i = j
except:
new_word.extend(word[i:])
break
if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
new_word.append(first + second)
i += 2
else:
new_word.append(word[i])
i += 1
new_word = tuple(new_word)
word = new_word
if len(word) == 1:
break
else:
pairs = get_pairs(word)
word = ' '.join(word)
self.cache[token] = word
return word
def encode(self, text):
"""bpe encoding"""
bpe_tokens = []
for token in re.findall(self.pat, text):
token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
return bpe_tokens
def decode(self, tokens):
"""bpe decoding"""
decoded_tokens = []
for token in tokens:
if self.int_token.match(str(token)):
decoded_tokens.append(self.decoder[int(token)])
else:
decoded_tokens.append(str(token))
text = ''.join(decoded_tokens)
text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
return text
def decode_token(self, token):
"""decode one token"""
if self.int_token.match(str(token)):
token = self.decoder[int(token)]
else:
token = str(token)
text = bytearray([self.byte_decoder[c] for c in token]).decode('utf-8', errors=self.errors)
return text
class GptBpeTokenizer(object):
"""GptBpeTokenizer"""
def __init__(self, vocab_file=None, encoder_json_file=None, vocab_bpe_file=None, do_lower_case=True):
if vocab_file is None:
vocab_file = "./model_files/dict/unimo_en.vocab.txt"
if encoder_json_file is None:
encoder_json_file = "./model_files/dict/unimo_en.encoder.json"
if vocab_bpe_file is None:
vocab_bpe_file = "./model_files/dict/unimo_en.vocab.bpe"
with codecs.open(encoder_json_file, 'r', encoding='utf-8') as f:
encoder = json.load(f)
with codecs.open(vocab_bpe_file, 'r', encoding="utf-8") as f:
bpe_data = f.read()
bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
self.gptbpe_tokenizer = Tokenizer(encoder=encoder, bpe_merges=bpe_merges)
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.cls_token = '[CLS]'
self.pad_token = '[PAD]'
self.sep_token = '[SEP]'
self.unk_token = '[UNK]'
self.mask_token = '[MASK]'
self.single_modal = 'madeupword0000'
self.multi_modal = 'madeupword0001'
self.cls_token_id = self.vocab['[CLS]']
self.pad_token_id = self.vocab['[PAD]']
self.sep_token_id = self.vocab['[SEP]']
self.unk_token_id = self.vocab['[UNK]']
self.mask_token_id = self.vocab['[MASK]']
self.single_modal_id = self.vocab['madeupword0000']
self.multi_modal_id = self.vocab['madeupword0001']
def tokenize(self, text):
"""tokenize text to a list of tokens"""
return [str(token) for token in self.gptbpe_tokenizer.encode(text)]
def convert_tokens_to_ids(self, tokens):
"""convert tokens to vocab ids"""
return convert_by_vocab(self.vocab, tokens)
def convert_token_to_id(self, token):
"""convert token to vocab id"""
return self.vocab[token]
def convert_ids_to_tokens(self, ids):
"""convert vocab ids to tokens"""
return convert_by_vocab(self.inv_vocab, ids)
def convert_id_to_token(self, id):
"""convert vocab id to token"""
return self.inv_vocab[id]
def vocab_size(self):
"""get the vocab size"""
return len(self.vocab)
class FullTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
"""tokenize the text"""
split_tokens = []
for token in self.basic_tokenizer.tokenize(text):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
"""convert tokens to vocab ids"""
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
"""convert vocab ids to tokens"""
return convert_by_vocab(self.inv_vocab, ids)
def merge_subword(self, tokens):
"""merge subwords"""
ret = []
for token in tokens:
if token.startswith("##"):
real_token = token[2:]
if len(ret):
ret[-1] += real_token
else:
ret.append(real_token)
else:
ret.append(token)
return ret
class CharTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
"""tokenize the text"""
split_tokens = []
for token in text.lower().split(" "):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
"""convert tokens to vocab ids"""
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
"""convert vocab ids to tokens"""
return convert_by_vocab(self.inv_vocab, ids)
class BasicTokenizer(object):
"""Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
def __init__(self, do_lower_case=True):
"""Constructs a BasicTokenizer.
Args:
do_lower_case: Whether to lower case the input.
"""
self.do_lower_case = do_lower_case
def tokenize(self, text):
"""Tokenizes a piece of text."""
text = convert_to_unicode(text)
text = self._clean_text(text)
# This was added on November 1st, 2018 for the multilingual and Chinese
# models. This is also applied to the English models now, but it doesn't
# matter since the English models were not trained on any Chinese data
# and generally don't have any Chinese data in them (there are Chinese
# characters in the vocabulary because Wikipedia does have some Chinese
# words in the English Wikipedia.).
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if self.do_lower_case:
token = token.lower()
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token))
output_tokens = whitespace_tokenize(" ".join(split_tokens))
return output_tokens
def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _run_split_on_punc(self, text):
"""Splits punctuation on a piece of text."""
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1
return ["".join(x) for x in output]
def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append(" ")
output.append(char)
output.append(" ")
else:
output.append(char)
return "".join(output)
def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
return False
def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xfffd or _is_control(char):
continue
if _is_whitespace(char):
output.append(" ")
else:
output.append(char)
return "".join(output)
class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
def tokenize(self, text):
"""Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization
using the given vocabulary.
For example:
input = "unaffable"
output = ["un", "##aff", "##able"]
Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer.
Returns:
A list of wordpiece tokens.
"""
text = convert_to_unicode(text)
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
def _is_whitespace(char):
"""Checks whether `chars` is a whitespace character."""
# \t, \n, and \r are technically contorl characters but we treat them
# as whitespace since they are generally considered as such.
if char == " " or char == "\t" or char == "\n" or char == "\r":
return True
cat = unicodedata.category(char)
if cat == "Zs":
return True
return False
def _is_control(char):
"""Checks whether `chars` is a control character."""
# These are technically control characters but we count them as whitespace
# characters.
if char == "\t" or char == "\n" or char == "\r":
return False
cat = unicodedata.category(char)
if cat.startswith("C"):
return True
return False
def _is_punctuation(char):
"""Checks whether `chars` is a punctuation character."""
cp = ord(char)
# We treat all non-letter/number ASCII as punctuation.
# Characters such as "^", "$", and "`" are not in the Unicode
# Punctuation class but we treat them as punctuation anyways, for
# consistency.
if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
(cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
return True
cat = unicodedata.category(char)
if cat.startswith("P"):
return True
return False
def build_id2is_full_token(tokenizer, dtype="float32"):
"""build id full token list"""
vocab_sz = tokenizer.vocab_size()
is_full_token = [0.0] * vocab_sz
token_strs = []
for token_id in range(vocab_sz):
token = tokenizer.convert_id_to_token(token_id)
token_str = tokenizer.gptbpe_tokenizer.decode_token(token)
token_strs.append(token_str)
if token_str.startswith(' '):
is_full_token[token_id] = 1.0
is_full_token = np.array(is_full_token, dtype=dtype)
print(is_full_token[500:520])
print(token_strs[500:520])
if __name__ == "__main__":
pass
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Transformer encoder."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from functools import partial
import paddle.fluid as fluid
import paddle.fluid.layers as layers
def multi_head_attention(queries,
keys,
values,
attn_bias,
d_key,
d_value,
d_model,
n_head=1,
dropout_rate=0.,
cache=None,
gather_idx=None,
param_initializer=None,
name='multi_head_att'):
"""
Multi-Head Attention. Note that attn_bias is added to the logit before
computing softmax activiation to mask certain selected positions so that
they will not considered in attention weights.
"""
keys = queries if keys is None else keys
values = keys if values is None else values
if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
raise ValueError(
"Inputs: quries, keys and values should all be 3-D tensors.")
def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
"""
Add linear projection to queries, keys, and values.
"""
q = layers.fc(input=queries,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_query_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_query_fc.b_0')
k = layers.fc(input=keys,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_key_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_key_fc.b_0')
v = layers.fc(input=values,
size=d_value * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_value_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_value_fc.b_0')
return q, k, v
def __split_heads(x, n_head):
"""
Reshape the last dimension of inpunt tensor x so that it becomes two
dimensions and then transpose. Specifically, input a tensor with shape
[bs, max_sequence_length, n_head * hidden_dim] then output a tensor
with shape [bs, n_head, max_sequence_length, hidden_dim].
"""
hidden_size = x.shape[-1]
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
reshaped = layers.reshape(
x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
# permuate the dimensions into:
# [batch_size, n_head, max_sequence_len, hidden_size_per_head]
return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
def __combine_heads(x):
"""
Transpose and then reshape the last two dimensions of inpunt tensor x
so that it becomes one dimension, which is reverse to __split_heads.
"""
if len(x.shape) == 3: return x
if len(x.shape) != 4:
raise ValueError("Input(x) should be a 4-D Tensor.")
trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
return layers.reshape(
x=trans_x,
shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
inplace=True)
def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
"""
Scaled Dot-Product Attention
"""
scaled_q = layers.scale(x=q, scale=d_key ** -0.5)
product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
if attn_bias:
product += attn_bias
weights = layers.softmax(product)
if dropout_rate:
weights = layers.dropout(
weights,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.matmul(weights, v)
return out
q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
if cache is not None: # use cache and concat time steps
# Since the inplace reshape in __split_heads changes the shape of k and
# v, which is the cache input for next time step, reshape the cache
# input from the previous time step first.
cache_k, cache_v = cache["k"], cache["v"]
select_k = layers.gather(cache_k, index=gather_idx)
select_v = layers.gather(cache_v, index=gather_idx)
select_k = layers.reshape(select_k, shape=[0, 0, d_model])
select_v = layers.reshape(select_v, shape=[0, 0, d_model])
k = layers.concat([select_k, k], axis=1)
v = layers.concat([select_v, v], axis=1)
layers.assign(k, cache["k"])
layers.assign(v, cache["v"])
q = __split_heads(q, n_head)
k = __split_heads(k, n_head)
v = __split_heads(v, n_head)
ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
dropout_rate)
out = __combine_heads(ctx_multiheads)
# Project back to the model size.
proj_out = layers.fc(input=out,
size=d_model,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_output_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_output_fc.b_0')
return proj_out
def positionwise_feed_forward(x,
d_inner_hid,
d_hid,
dropout_rate,
hidden_act,
param_initializer=None,
name='ffn'):
"""
Position-wise Feed-Forward Networks.
This module consists of two linear transformations with a ReLU activation
in between, which is applied to each position separately and identically.
"""
hidden = layers.fc(input=x,
size=d_inner_hid,
num_flatten_dims=2,
act=hidden_act,
param_attr=fluid.ParamAttr(
name=name + '_fc_0.w_0',
initializer=param_initializer),
bias_attr=name + '_fc_0.b_0')
if dropout_rate:
hidden = layers.dropout(
hidden,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.fc(input=hidden,
size=d_hid,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_fc_1.w_0', initializer=param_initializer),
bias_attr=name + '_fc_1.b_0')
return out
def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., name=''):
"""
Add residual connection, layer normalization and droput to the out tensor
optionally according to the value of process_cmd.
This will be used before or after multi-head attention and position-wise
feed-forward networks.
"""
for cmd in process_cmd:
if cmd == "a": # add residual connection
out = out + prev_out if prev_out else out
elif cmd == "n": # add layer normalization
out = layers.layer_norm(
out,
begin_norm_axis=len(out.shape) - 1,
param_attr=fluid.ParamAttr(
name=name + '_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name=name + '_layer_norm_bias',
initializer=fluid.initializer.Constant(0.)))
elif cmd == "d": # add dropout
if dropout_rate:
out = layers.dropout(
out,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
return out
pre_process_layer = partial(pre_post_process_layer, None)
post_process_layer = pre_post_process_layer
def encoder_layer(query_input,
key_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
name='',
cache=None,
gather_idx=None):
"""The encoder layers that can be stacked to form a deep encoder.
This module consits of a multi-head (self) attention followed by
position-wise feed-forward networks and both the two components companied
with the post_process_layer to add residual connection, layer normalization
and droput.
"""
key_input = pre_process_layer(
key_input,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_att') if key_input else None
value_input = key_input if key_input else None
attn_output = multi_head_attention(
pre_process_layer(
query_input,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_att'),
key_input,
value_input,
attn_bias,
d_key,
d_value,
d_model,
n_head,
attention_dropout,
param_initializer=param_initializer,
name=name + '_multi_head_att',
cache=cache,
gather_idx=gather_idx)
attn_output = post_process_layer(
query_input,
attn_output,
postprocess_cmd,
prepostprocess_dropout,
name=name + '_post_att')
ffd_output = positionwise_feed_forward(
pre_process_layer(
attn_output,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_ffn'),
d_inner_hid,
d_model,
relu_dropout,
hidden_act,
param_initializer=param_initializer,
name=name + '_ffn')
return post_process_layer(
attn_output,
ffd_output,
postprocess_cmd,
prepostprocess_dropout,
name=name + '_post_ffn')
def encoder(enc_input,
attn_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
name='',
caches=None,
gather_idx=None):
"""
The encoder is composed of a stack of identical layers returned by calling
encoder_layer.
"""
for i in range(n_layer):
enc_output = encoder_layer(
enc_input,
None,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd,
postprocess_cmd,
param_initializer=param_initializer,
name=name + '_layer_' + str(i),
cache=caches[i] if caches is not None else None,
gather_idx=gather_idx)
enc_input = enc_output
enc_output = pre_process_layer(
enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
return enc_output
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Unified Visual Language model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import json
import six
import codecs
import paddle.fluid as fluid
import paddle.fluid.layers as layers
from model.transformer_encoder import encoder, pre_process_layer
class UNIMOConfig(object):
"""configuration"""
def __init__(self, config_path):
self._config_dict = self._parse(config_path)
def _parse(self, config_path):
try:
with codecs.open(config_path, 'r', encoding='utf-8') as json_file:
config_dict = json.load(json_file)
except Exception:
raise IOError("Error in parsing unimo model config file '%s'" %
config_path)
else:
return config_dict
def __getitem__(self, key):
return self._config_dict.get(key, None)
def __setitem__(self, key, value):
self._config_dict[key] = value
def print_config(self):
"""print config"""
for arg, value in sorted(six.iteritems(self._config_dict)):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
class UNIMOModel(object):
"""UNIMO model for finetuning"""
def __init__(self,
emb_ids=None,
emb_obj_ids=None,
input_mask=None,
config=None,
image_input=None,
text_adv_delta=None,
image_adv_delta=None,
weight_sharing=True,
task_type="normal",
decoding=False,
gather_idx=None):
self.text_adv_delta = text_adv_delta
self.image_adv_delta = image_adv_delta
self._emb_size = config['hidden_size']
self._n_layer = config['num_hidden_layers']
self._n_head = config['num_attention_heads']
self._voc_size = config['vocab_size']
self._max_position_seq_len = config['max_position_embeddings']
self._hidden_act = config['hidden_act']
self._prepostprocess_dropout = config['hidden_dropout_prob']
self._attention_dropout = config['attention_probs_dropout_prob']
self._weight_sharing = weight_sharing
self._task_type = task_type
self._emb_vocab_size = {"word_embedding": self._voc_size,
"pos_embedding": self._max_position_seq_len}
assert emb_ids is not None or image_input is not None, "emb_ids and image_input cannot be both None"
self._is_dialogue_task = (task_type == "dialog")
self._is_img2txt_task = (task_type == "img2txt")
self._is_multimodal_task = (image_input is not None)
if emb_ids is not None and image_input is not None and emb_obj_ids is not None:
self._input_type = 'vol'
elif emb_ids is not None and image_input is not None:
self._input_type = 'vl'
elif emb_ids is not None:
self._input_type = 'l'
elif image_input is not None and emb_obj_ids is not None:
self._input_type = 'vo'
else:
raise ValueError('input feature error')
if self._is_dialogue_task:
self._role_type_size = config["role_type_size"]
self._turn_type_size = config["turn_type_size"]
self._emb_vocab_size["role_embedding"] = self._role_type_size
self._emb_vocab_size["turn_embedding"] = self._turn_type_size
else:
self._sent_types = config['type_vocab_size']
self._emb_vocab_size["sent_embedding"] = self._sent_types
if self._is_multimodal_task or self._is_img2txt_task:
self._image_class_size = config['image_class_size']
self._class_attr_size = config['class_attr_size']
self._image_embedding_size = config['image_embedding_size']
self._image_predict_feature = config['image_predict_feature']
self._image_predict_class = config['image_predict_class']
self._image_use_attr = config['image_use_attr']
self._image_use_soft_label = config['image_use_soft_label']
self._image_emb_name = "image_embedding"
self._loc_emb_name = "loc_embedding"
self._emb_dtype = "float32"
if decoding:
self.caches = [{
"k":
fluid.layers.fill_constant_batch_size_like(
input=emb_ids["word_embedding"] if emb_ids is not None else image_input["image_embedding"],
shape=[-1, 0, self._emb_size],
dtype=self._emb_dtype, # float32,
value=0),
"v":
fluid.layers.fill_constant_batch_size_like(
input=emb_ids["word_embedding"] if emb_ids is not None else image_input["image_embedding"],
shape=[-1, 0, self._emb_size],
dtype=self._emb_dtype, # float32,
value=0),
} for i in range(self._n_layer)]
else:
self.caches = None
# Initialize all weigths by truncated normal initializer, and all biases
# will be initialized by constant zero by default.
self._param_initializer = fluid.initializer.TruncatedNormal(
scale=config['initializer_range'])
self._build_model(emb_ids=emb_ids,
input_mask=input_mask,
image_input=image_input,
emb_obj_ids=emb_obj_ids,
gather_idx=gather_idx)
def _build_model(self, emb_ids=None, input_mask=None, image_input=None, emb_obj_ids=None, gather_idx=None):
"""build unimo model"""
self._enc_vol_out = None
self._enc_vl_out = None
self._enc_v_out = None
self._enc_l_out = None
if self._input_type == 'vol':
self._enc_vol_out, self._enc_v_out, self._enc_l_out = self.encode(emb_ids=emb_ids,
input_mask=input_mask,
image_input=image_input,
emb_obj_ids=emb_obj_ids,
gather_idx=gather_idx)
elif self._input_type == 'vl':
self._enc_vl_out, self._enc_v_out, self._enc_l_out = self.encode(emb_ids=emb_ids,
input_mask=input_mask,
image_input=image_input,
gather_idx=gather_idx)
elif self._input_type == 'vo':
self._enc_v_out = self.encode(input_mask=input_mask,
image_input=image_input,
emb_obj_ids=emb_obj_ids,
gather_idx=gather_idx)
elif self._input_type == 'l':
self._enc_l_out = self.encode(emb_ids=emb_ids,
input_mask=input_mask,
gather_idx=gather_idx)
else:
raise ValueError("The input type is invalid")
def encode(self, emb_ids=None, input_mask=None, image_input=None, emb_obj_ids=None, gather_idx=None):
"""unimo encoder"""
emb_feature, n_head_self_attn_mask, _v_seq_len, _o_seq_len = self._gen_input(emb_ids=emb_ids,
input_mask=input_mask,
image_input=image_input,
emb_obj_ids=emb_obj_ids)
enc_out = encoder(
enc_input=emb_feature,
attn_bias=n_head_self_attn_mask,
n_layer=self._n_layer,
n_head=self._n_head,
d_key=self._emb_size // self._n_head,
d_value=self._emb_size // self._n_head,
d_model=self._emb_size,
d_inner_hid=self._emb_size * 4,
prepostprocess_dropout=self._prepostprocess_dropout,
attention_dropout=self._attention_dropout,
relu_dropout=0,
hidden_act=self._hidden_act,
preprocess_cmd="",
postprocess_cmd="dan",
param_initializer=self._param_initializer,
name='encoder',
caches=self.caches,
gather_idx=gather_idx)
if self._input_type == 'vol':
assert _v_seq_len is not None and _o_seq_len is not None, "the input is invalid"
_vol_seq_len = layers.shape(enc_out)[1]
enc_v_out = fluid.layers.slice(
input=enc_out, axes=[1], starts=[0], ends=[_v_seq_len])
enc_o_out = fluid.layers.slice(
input=enc_out, axes=[1], starts=[_v_seq_len], ends=[_v_seq_len + _o_seq_len])
enc_l_out = fluid.layers.slice(
input=enc_out, axes=[1], starts=[_v_seq_len + _o_seq_len], ends=[_vol_seq_len])
enc_vol_out = enc_out
return enc_vol_out, enc_v_out, enc_l_out
elif self._input_type == 'vl':
assert _v_seq_len is not None and _o_seq_len is None, "the input is invalid"
_vl_seq_len = layers.shape(enc_out)[1]
enc_v_out = fluid.layers.slice(
input=enc_out, axes=[1], starts=[0], ends=[_v_seq_len])
enc_l_out = fluid.layers.slice(
input=enc_out, axes=[1], starts=[_v_seq_len], ends=[_vl_seq_len])
enc_vl_out = enc_out
return enc_vl_out, enc_v_out, enc_l_out
elif self._input_type == 'vo':
assert _v_seq_len is not None and _o_seq_len is not None, "the input is invalid"
enc_v_out = fluid.layers.slice(
input=enc_out, axes=[1], starts=[0], ends=[_v_seq_len])
return enc_v_out
elif self._input_type == 'l':
assert _v_seq_len is None and _o_seq_len is None, "the input is invalid"
enc_l_out = enc_out
return enc_l_out
else:
raise ValueError("The input type is invalid")
def _gen_input(self, emb_ids=None, input_mask=None, image_input=None, emb_obj_ids=None):
assert input_mask is not None, "input_mask should not be none"
self_attn_mask = input_mask
self_attn_mask = fluid.layers.scale(
x=self_attn_mask, scale=1e4, bias=-1.0, bias_after_scale=False)
n_head_self_attn_mask = fluid.layers.stack(
x=[self_attn_mask] * self._n_head, axis=1)
n_head_self_attn_mask.stop_gradient = True
emb_feature, _v_seq_len, _o_seq_len = None, None, None
if emb_ids is not None:
emb_out = None
# text part
for emb_name, emb_id in emb_ids.items():
if emb_name == "sent_embedding":
continue # don't use sentence embedding
emb = fluid.layers.embedding(
input=emb_id,
size=[self._emb_vocab_size[emb_name], self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=emb_name, initializer=self._param_initializer))
emb_out = emb_out + emb if emb_out else emb
if self.text_adv_delta is not None:
emb_out = emb_out + self.text_adv_delta
emb_out = pre_process_layer(
emb_out, 'nd', self._prepostprocess_dropout, name="pre_encoder")
if image_input is not None:
# visual part
if self.image_adv_delta is not None:
emb_v_in = image_input[self._image_emb_name]
emb_v_in = emb_v_in + self.image_adv_delta
else:
emb_v_in = image_input[self._image_emb_name]
image_embeddings = fluid.layers.fc(emb_v_in, # [batch_size, 37, 2048]
self._emb_size,
param_attr=fluid.ParamAttr(
name="image_emb.w_0",
initializer=self._param_initializer),
bias_attr="image_emb.b_0",
num_flatten_dims=2)
loc_emb_out = fluid.layers.fc(image_input[self._loc_emb_name], # [batch_size, 37, 5]
self._emb_size,
param_attr=fluid.ParamAttr(
name="image_loc.w_0",
initializer=self._param_initializer),
bias_attr="image_loc.b_0",
num_flatten_dims=2)
emb_v_out = image_embeddings + loc_emb_out
emb_v_out = pre_process_layer(
emb_v_out, 'nd', self._prepostprocess_dropout, name='v_pre_encoder')
_v_seq_len = layers.shape(emb_v_out)[1]
if emb_obj_ids is not None:
emb_obj_out = None
# text part
for emb_obj_name, emb_obj_id in emb_obj_ids.items():
if emb_obj_name == "sent_embedding":
continue # don't use sentence embedding in roberta
emb_obj = fluid.layers.embedding(
input=emb_obj_id,
size=[self._emb_vocab_size[emb_obj_name], self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=emb_obj_name, initializer=self._param_initializer))
emb_obj_out = emb_obj_out + emb_obj if emb_obj_out else emb_obj
emb_obj_out = pre_process_layer(
emb_obj_out, 'nd', self._prepostprocess_dropout, name="pre_encoder")
_o_seq_len = layers.shape(emb_obj_out)[1]
if self._input_type == 'vol':
assert emb_ids is not None and image_input is not None and emb_obj_ids is not None, "the input is invalid"
emb_feature = fluid.layers.concat([emb_v_out, emb_obj_out, emb_out], axis=1)
elif self._input_type == 'vl':
assert emb_ids is not None and image_input is not None and emb_obj_ids is None, "the input is invalid"
emb_feature = fluid.layers.concat([emb_v_out, emb_out], axis=1)
elif self._input_type == 'l':
assert emb_ids is not None and image_input is None and emb_obj_ids is None, "the input is invalid"
emb_feature = emb_out
elif self._input_type == 'vo':
assert emb_ids is None and image_input is not None and emb_obj_ids is not None, "the input is invalid"
emb_feature = fluid.layers.concat([emb_v_out, emb_obj_out], axis=1)
else:
raise ValueError("The input type is invalid")
return [emb_feature, n_head_self_attn_mask, _v_seq_len, _o_seq_len]
def get_sequence_output(self):
"""get sequence output"""
return self._enc_l_out
def get_pooled_output(self):
"""Get the first feature of each sequence for classification"""
text_feat = self.get_pooled_text_output()
visual_feat = self.get_pooled_visual_output()
return text_feat, visual_feat
def get_pooled_visual_output(self):
"""Get the first feature of each sequence for classification"""
if self._enc_v_out is None:
return None
visual_feat = fluid.layers.slice(
input=self._enc_v_out, axes=[1], starts=[0], ends=[1])
visual_feat = fluid.layers.reshape(
x=visual_feat, shape=[-1, self._emb_size])
visual_feat = fluid.layers.fc(
input=visual_feat,
size=self._emb_size,
act="relu",
param_attr=fluid.ParamAttr(
name="pooled_fc_image.w_0",
initializer=self._param_initializer),
bias_attr="pooled_fc_image.b_0")
return visual_feat
def get_pooled_text_output(self):
"""Get the first feature of each sequence for classification"""
if self._enc_l_out is None:
return None
text_feat = fluid.layers.slice(
input=self._enc_l_out, axes=[1], starts=[0], ends=[1])
text_feat = fluid.layers.reshape(
x=text_feat, shape=[-1, self._emb_size])
text_feat = fluid.layers.fc(
input=text_feat,
size=self._emb_size,
act="relu",
param_attr=fluid.ParamAttr(
name="pooled_fc_text.w_0",
initializer=self._param_initializer),
bias_attr="pooled_fc_text.b_0"
)
return text_feat
def get_match_output(self, text, image, mode="mul"):
"""get_match_output"""
if mode == "sum":
emb_fuse = text + image
elif mode == "mul":
emb_fuse = text * image
else:
"current mode %s is not supported" % mode
return
emb_fuse = fluid.layers.dropout(emb_fuse,
self._attention_dropout,
dropout_implementation="upscale_in_train")
return emb_fuse
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""padding and batching."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
def pad_batch_data(insts,
pretraining_task='seq2seq',
pad_idx=1,
sent_b_starts=None,
return_pos=False,
return_input_mask=False,
return_max_len=False,
return_num_token=False,
return_seq_lens=False):
"""
Pad the instances to the max sequence length in batch, and generate the
corresponding position data and attention bias.
"""
return_list = []
max_len = max(len(inst) for inst in insts)
# Any token included in dict can be used to pad, since the paddings' loss
# will be masked out by weights and make no effect on parameter gradients.
inst_data = np.array(
[inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
return_list += [inst_data.astype('int64').reshape([-1, max_len, 1])]
# position data
if return_pos:
inst_pos = np.array([
list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
for inst in insts
])
return_list += [inst_pos.astype('int64').reshape([-1, max_len, 1])]
if return_input_mask:
if pretraining_task is 'seq2seq':
assert sent_b_starts is not None, \
"[FATAL] For seq2seq lanugae model loss," \
" sent_b_starts should not be None"
# This is used to avoid attention on paddings and subsequent words.
input_mask_data = np.zeros((inst_data.shape[0], max_len, max_len))
for index, mask_data in enumerate(input_mask_data):
start = sent_b_starts[index]
end = len(insts[index])
mask_data[:end, :start] = 1.0
# Generate the lower triangular matrix using the slice of matrix
b = np.tril(np.ones([end - start, end - start]), 0)
mask_data[start:end, start:end] = b
input_mask_data = np.array(input_mask_data).reshape([-1, max_len, max_len])
else:
# This is used to avoid attention on paddings.
input_mask_data = np.array([[1] * len(inst) + [0] *
(max_len - len(inst)) for inst in insts])
input_mask_data = np.expand_dims(input_mask_data, axis=-1)
# input_mask_data = np.matmul(input_mask_data, np.transpose(input_mask_data, (0, 2, 1)))
return_list += [input_mask_data.astype("float32")]
if return_max_len:
return_list += [max_len]
if return_num_token:
num_token = 0
for inst in insts:
num_token += len(inst)
return_list += [num_token]
if return_seq_lens:
seq_lens = np.array([len(inst) for inst in insts])
return_list += [seq_lens.astype('int64').reshape([-1, 1])]
return return_list if len(return_list) > 1 else return_list[0]
def pad_feature_data(data, pad_value=0.0, dtype="float32", return_mask=False, batch_image_size=None):
"""for image feature sequence padding"""
# num box + 1 ,1 for global feature
max_lenth = max([len(item) for item in data])
data_width = len(data[0][0])
out_data = np.ones((len(data), max_lenth, data_width), dtype=dtype) * pad_value
out_mask = np.zeros((len(data), max_lenth, 1), dtype=dtype)
for i in range(len(data)):
out_data[i, 0:len(data[i]), :] = data[i]
if return_mask and batch_image_size[i] > 1:
out_mask[i, 0:len(data[i]), :] = 1.0
if return_mask:
return out_data, out_mask
else:
return out_data
def gen_seq2seq_mask(insts, sent_b_starts=None):
"""
generate input mask for seq2seq
"""
max_len = max(len(inst) for inst in insts)
input_mask_data = np.zeros((len(insts), max_len, max_len))
for index, mask_data in enumerate(input_mask_data):
start = sent_b_starts[index]
end = len(insts[index])
mask_data[:end, :start] = 1.0
# Generate the lower triangular matrix using the slice of matrix
b = np.tril(np.ones([end - start, end - start]), 0)
mask_data[start:end, start:end] = b
input_mask_data = np.array(input_mask_data, dtype='float32').reshape([-1, max_len, max_len])
return input_mask_data
if __name__ == "__main__":
pass
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""data reader for text classification tasks"""
import os
import csv
import numpy as np
import copy
from collections import namedtuple
from model import tokenization
from reader.batching import pad_batch_data
class ClassifyReader(object):
"""ClassifyReader"""
def __init__(self, tokenizer, args):
self.tokenizer = tokenizer
self.pad_id = tokenizer.pad_token_id
self.cls_id = tokenizer.cls_token_id
self.sep_id = tokenizer.sep_token_id
self.mask_id = tokenizer.mask_token_id
self.max_seq_len = args.max_seq_len
self.in_tokens = args.in_tokens
self.random_seed = 0
self.global_rng = np.random.RandomState(self.random_seed)
self.trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
self.trainer_nums = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
self.current_example = 0
self.current_epoch = 0
self.num_examples = 0
def get_train_progress(self):
"""Gets progress for training phase."""
return self.current_example, self.current_epoch
def _read_tsv(self, input_file, quotechar=None):
"""Reads a tab separated value file."""
with open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
headers = next(reader)
text_indices = [
index for index, h in enumerate(headers) if h != "label"
]
Example = namedtuple('Example', headers)
examples = []
for line in reader:
example = Example(*line)
examples.append(example)
return examples
def _pad_batch_records(self, batch_records):
batch_token_ids = [record.token_ids for record in batch_records]
batch_text_type_ids = [record.text_type_ids for record in batch_records]
batch_position_ids = [record.position_ids for record in batch_records]
batch_labels = [record.label_id for record in batch_records]
batch_labels = np.array(batch_labels).astype('int64').reshape([-1, 1])
if batch_records[0].qid:
batch_qids = [record.qid for record in batch_records]
batch_qids = np.array(batch_qids).astype('int64').reshape([-1, 1])
else:
batch_qids = np.array([]).astype('int64').reshape([-1, 1])
# padding
padded_token_ids, input_mask = pad_batch_data(
batch_token_ids, pretraining_task='nlu', pad_idx=self.pad_id, return_input_mask=True)
padded_text_type_ids = pad_batch_data(
batch_text_type_ids, pretraining_task='nlu', pad_idx=self.pad_id)
padded_position_ids = pad_batch_data(
batch_position_ids, pretraining_task='nlu', pad_idx=self.pad_id)
input_mask = np.matmul(input_mask, np.transpose(input_mask, (0, 2, 1)))
return_list = [
padded_token_ids, padded_text_type_ids, padded_position_ids,
input_mask, batch_labels, batch_qids
]
return return_list
def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def _convert_example_to_record(self, example, max_seq_length, tokenizer):
"""Converts a single `Example` into a single `Record`."""
text_a = tokenization.convert_to_unicode(example.text_a)
tokens_a = tokenizer.tokenize(text_a)
tokens_b = None
if "text_b" in example._fields:
text_b = tokenization.convert_to_unicode(example.text_b)
tokens_b = tokenizer.tokenize(text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
self._truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
tokens = []
text_type_ids = []
tokens.append("[CLS]")
text_type_ids.append(0)
for token in tokens_a:
tokens.append(token)
text_type_ids.append(0)
tokens.append("[SEP]")
text_type_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
text_type_ids.append(1)
tokens.append("[SEP]")
text_type_ids.append(1)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
position_ids = list(range(2, len(token_ids) + 2))
label_id = example.label
Record = namedtuple(
'Record',
['token_ids', 'text_type_ids', 'position_ids', 'label_id', 'qid'])
qid = None
if "qid" in example._fields:
qid = example.qid
record = Record(
token_ids=token_ids,
text_type_ids=text_type_ids,
position_ids=position_ids,
label_id=label_id,
qid=qid)
return record
def _prepare_batch_data(self, examples, batch_size, phase=None):
"""generate batch records"""
batch_records, max_len = [], 0
for index, example in enumerate(examples):
if phase == "train":
self.current_example = index
record = self._convert_example_to_record(example, self.max_seq_len,
self.tokenizer)
max_len = max(max_len, len(record.token_ids))
if self.in_tokens:
to_append = (len(batch_records) + 1) * max_len <= batch_size
else:
to_append = len(batch_records) < batch_size
if to_append:
batch_records.append(record)
else:
yield self._pad_batch_records(batch_records)
batch_records, max_len = [record], len(record.token_ids)
if batch_records:
yield self._pad_batch_records(batch_records)
def get_num_examples(self, input_file):
"""get_num_examples"""
examples = self._read_tsv(input_file)
return len(examples)
def data_generator(self,
input_file,
batch_size,
epoch,
dev_count=1,
shuffle=True,
phase=None):
"""data_generator"""
examples = self._read_tsv(input_file)
def wrapper():
"""wrapper"""
all_dev_batches = []
trainer_id = 0
for epoch_index in range(epoch):
if phase == "train":
self.current_example = 0
self.current_epoch = epoch_index
self.random_seed = epoch_index
self.global_rng = np.random.RandomState(self.random_seed)
trainer_id = self.trainer_id
else:
trainer_id = 0
assert dev_count == 1, "only supports 1 GPU while prediction"
current_examples = copy.deepcopy(examples)
if shuffle:
self.global_rng.shuffle(current_examples)
for batch_data in self._prepare_batch_data(
current_examples, batch_size, phase=phase):
if len(all_dev_batches) < dev_count:
all_dev_batches.append(batch_data)
if len(all_dev_batches) == dev_count:
yield all_dev_batches[trainer_id]
all_dev_batches = []
if phase != "train" and self.trainer_id < len(all_dev_batches):
yield all_dev_batches[self.trainer_id]
return wrapper
if __name__ == '__main__':
pass
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""data reader for image-to-text generation tasks"""
import csv
csv.field_size_limit(1024 * 1024)
import numpy as np
from collections import namedtuple
import base64
import os
import gzip
import paddle.fluid as fluid
from reader.batching import pad_batch_data, pad_feature_data
class Img2TxtReader(object):
"""image-to-text reader"""
def __init__(self, tokenizer, args):
self.tokenizer = tokenizer
self.pad_id = tokenizer.pad_token_id
self.cls_id = tokenizer.cls_token_id
self.sep_id = tokenizer.sep_token_id
self.mask_id = tokenizer.mask_token_id
self.tgt_type_id = args.tgt_type_id
self.max_img_len = args.max_img_len # 37
self.max_obj_len = args.max_obj_len
self.image_embedding_size = args.image_embedding_size # 2048
self.max_tgt_len = args.max_tgt_len
self.max_out_len = args.max_out_len
self.obj_dict = self.load_obj_file(args.object_file)
# random_seed must be set for data slicing when using multi-gpu
if args.random_seed:
np.random.seed(args.random_seed)
else:
np.random.seed(0)
self.trainer_id = 0
self.trainer_nums = 1
if os.getenv("PADDLE_TRAINER_ID"):
self.trainer_id = int(os.getenv("PADDLE_TRAINER_ID"))
if os.getenv("PADDLE_TRAINERS_NUM"):
self.trainer_nums = int(os.getenv("PADDLE_TRAINERS_NUM"))
self.current_example = 0
self.current_epoch = 0
self.num_examples = 0
self.features = {}
def get_progress(self):
"""return current progress of traning data
"""
return [self.current_epoch, self.current_file_index, self.total_file, self.current_file]
def get_num_examples(self, filelist):
"""get total number of examples"""
num_exp = 0
files = open(filelist).readlines()
for index, file_ in enumerate(files):
file_ = file_.strip()
if file_.endswith('.gz'):
with gzip.open(file_, "rt") as f:
for line in f:
if line is None:
continue
num_exp += 1
else:
with open(file_, "r") as f:
for line in f:
if line is None:
continue
num_exp += 1
return num_exp
def parse_line(self, line):
""" parse one line to token_ids, sentence_ids, pos_ids, label
"""
line = line.strip('\r\n').split(";")
if len(line) == 16:
(image_id, caption_id, token_ids, sent_ids, pos_ids, seg_labels,
node_attr_b, node_attr_e, image_h, image_w, number_box, boxes,
probs, attr_probs, image_embeddings, label) = line
image_id = image_id[9:]
else:
raise ValueError("One sample have %d fields!" % len(line))
def decode_feature(base64_str, size):
"""decode image feature"""
fea_base64 = base64.b64decode(base64_str)
fea_decode = np.frombuffer(fea_base64, dtype=np.float32)
shape = size, int(fea_decode.shape[0] / size)
features = np.resize(fea_decode, shape)
return features
number_box = int(number_box)
boxes = decode_feature(boxes, number_box)
# probs = decode_feature(probs, number_box)
image_embeddings = decode_feature(image_embeddings, number_box)
image_embeddings_cls = np.mean(image_embeddings, axis=0, keepdims=True)
image_embeddings = np.concatenate([image_embeddings_cls, image_embeddings], 0)
image_location = np.zeros((boxes.shape[0], 5), dtype=np.float32)
image_location[:, :4] = boxes
image_location[:, 4] = (image_location[:, 3] - image_location[:, 1]) * (
image_location[:, 2] - image_location[:, 0]) / (float(image_w) * float(image_h))
image_location[:, 0] = image_location[:, 0] / float(image_w)
image_location[:, 1] = image_location[:, 1] / float(image_h)
image_location[:, 2] = image_location[:, 2] / float(image_w)
image_location[:, 3] = image_location[:, 3] / float(image_h)
g_location = np.array([0, 0, 1, 1, 1])
image_location = np.concatenate([np.expand_dims(g_location, axis=0), image_location], axis=0)
image_loc = image_location
obj_token_ids, obj_sent_ids, obj_pos_ids = self.obj_dict[image_id]
obj_token_ids = [int(token) for token in obj_token_ids.split(" ")]
obj_sent_ids = [int(token) for token in obj_sent_ids.split(" ")]
obj_pos_ids = [int(token) for token in obj_pos_ids.split(" ")]
assert len(obj_token_ids) == len(obj_sent_ids) == len(obj_pos_ids), \
"[Must be true]len(obj_token_ids) == len(obj_sent_ids) == len(obj_pos_ids)"
if len(obj_token_ids) > self.max_obj_len:
obj_token_ids = obj_token_ids[:self.max_obj_len]
obj_sent_ids = obj_sent_ids[:self.max_obj_len]
obj_pos_ids = obj_pos_ids[:self.max_obj_len]
if image_loc.shape[0] > self.max_img_len:
image_loc = image_loc[:self.max_img_len]
image_embeddings = image_embeddings[:self.max_img_len]
if token_ids != '':
token_ids = [int(token) for token in token_ids.split(" ")]
sent_ids = [int(token) for token in sent_ids.split(" ")]
pos_ids = [int(token) for token in pos_ids.split(" ")]
seg_labels = [int(seg_label) for seg_label in seg_labels.split(" ")]
assert len(token_ids) == len(sent_ids) == len(pos_ids) == len(seg_labels), \
"[Must be true]len(token_ids) == len(sent_ids) == len(pos_ids) == len(seg_labels)"
if len(token_ids) > self.max_tgt_len:
token_ids = token_ids[:self.max_tgt_len - 1] + [self.sep_id]
sent_ids = sent_ids[:self.max_tgt_len]
pos_ids = pos_ids[:self.max_tgt_len]
seg_labels = seg_labels[:self.max_tgt_len - 1] + [-1]
Record = namedtuple(
'Record',
['image_loc', 'image_embeddings', 'number_box', 'image_id',
'obj_token_ids', 'obj_sent_ids', 'obj_pos_ids',
'token_ids', 'sent_ids', 'pos_ids', 'seg_labels'])
record = Record(
image_loc=image_loc,
image_embeddings=image_embeddings,
number_box=number_box + 1,
image_id=int(image_id),
obj_token_ids=obj_token_ids,
obj_sent_ids=obj_sent_ids,
obj_pos_ids=obj_pos_ids,
token_ids=token_ids,
sent_ids=sent_ids,
pos_ids=pos_ids,
seg_labels=seg_labels)
else:
Record = namedtuple(
'Record',
['image_loc', 'image_embeddings', 'number_box', 'image_id',
'obj_token_ids', 'obj_sent_ids', 'obj_pos_ids'])
record = Record(
image_loc=image_loc,
image_embeddings=image_embeddings,
number_box=number_box + 1,
image_id=int(image_id),
obj_token_ids=obj_token_ids,
obj_sent_ids=obj_sent_ids,
obj_pos_ids=obj_pos_ids)
return record
def load_obj_file(self, obj_file):
"""load image objects file"""
if not obj_file:
print("obj_file is None")
return None
_dict = {}
for line in open(obj_file):
line = line.strip('\r\n').split(';')
assert len(line) == 4, "the object file should only contain 4 fields!!!"
image_id, obj_token_ids, obj_sent_ids, obj_pos_ids = line
_dict[image_id] = [obj_token_ids, obj_sent_ids, obj_pos_ids]
print('obj_dict size is ', len(_dict))
return _dict
def read_file(self, file):
"""read file"""
if file.endswith('.gz'):
with gzip.open(file, "rt") as f:
for line in f:
parsed_line = self.parse_line(line)
if parsed_line is None:
continue
yield parsed_line
else:
with open(file, "r") as f:
for line in f:
parsed_line = self.parse_line(line)
if parsed_line is None:
continue
yield parsed_line
def shuffle_samples(self, sample_generator, buffer=1000):
"""shuffle samples"""
samples = []
try:
while True:
while len(samples) < buffer:
sample = next(sample_generator)
samples.append(sample)
np.random.shuffle(samples)
for sample in samples:
yield sample
samples = []
except StopIteration:
print("stopiteration: reach end of file")
if len(samples) == 0:
yield None
else:
np.random.shuffle(samples)
for sample in samples:
yield sample
def _prepare_batch_data(self, before_batch_records, sample_generator, batch_size, phase=None, do_decode=False,
place=None):
"""generate batch records"""
batch, index = before_batch_records[:], 0
for sample in sample_generator:
if sample is None:
continue
self.current_example = index
index += 1
to_append = len(batch) < batch_size
if to_append:
batch.append(sample)
else:
yield (True, self._pad_batch_records(batch, do_decode, place))
batch = [sample]
if batch:
if len(batch) == batch_size:
yield (True, self._pad_batch_records(batch, do_decode, place))
else:
# not enough length size batch_size
yield (False, batch)
def data_generator(self,
filelist,
batch_size,
epoch,
dev_count=1,
shuffle=True,
phase=None,
do_decode=False,
place=None):
"""data generator"""
files = open(filelist).readlines()
self.total_file = len(files)
def wrapper():
"""wrapper"""
all_dev_batches = []
trainer_id = self.trainer_id
before_batch_records = []
for epoch_index in range(epoch):
self.current_file_index = 0
self.current_epoch = epoch_index
if phase == "train": # shuffle file list
np.random.shuffle(files)
for index, file_ in enumerate(files):
file_ = file_.strip()
self.current_file_index = index + 1
self.current_file = file_
sample_generator = self.read_file(file_)
if phase == "train": # shuffle buffered sample
sample_generator = self.shuffle_samples(sample_generator)
for enough_batch_flag, batch_data in self._prepare_batch_data(before_batch_records,
sample_generator, batch_size,
phase=phase, do_decode=do_decode,
place=place):
if enough_batch_flag:
if len(all_dev_batches) < dev_count:
all_dev_batches.append(batch_data)
if len(all_dev_batches) == dev_count:
yield all_dev_batches[trainer_id]
all_dev_batches = []
else:
print("%d lines remains for file %s" % (len(batch_data), file_))
before_batch_records = batch_data[:]
if len(before_batch_records) != 0:
if phase == 'train':
print("remaining %d records not training, bug ignore it", len(before_batch_records))
else:
print("remaining %d records not val/test, yield", len(before_batch_records))
last_batch_data = self._pad_batch_records(before_batch_records, do_decode, place)
if len(all_dev_batches) < dev_count:
all_dev_batches.append(last_batch_data)
if len(all_dev_batches) == dev_count:
yield all_dev_batches[trainer_id]
all_dev_batches = []
if phase != "train":
if trainer_id < len(all_dev_batches):
yield all_dev_batches[trainer_id]
return wrapper
def _to_lodtensor(self, data, place, lod=None):
data_tensor = fluid.LoDTensor()
data_tensor.set(data, place)
if lod is not None:
data_tensor.set_lod(lod)
return data_tensor
def _pad_batch_records(self, batch_records, do_decode, place):
# visual image part
batch_image_loc = [record.image_loc for record in batch_records]
batch_image_embedding = [record.image_embeddings for record in batch_records]
batch_image_size = [record.number_box for record in batch_records]
image_embedding, image_mask = pad_feature_data(batch_image_embedding,
return_mask=True,
batch_image_size=batch_image_size)
image_loc = pad_feature_data(batch_image_loc)
batch_obj_token_ids = [record.obj_token_ids for record in batch_records]
batch_obj_sent_ids = [record.obj_sent_ids for record in batch_records]
batch_obj_pos_ids = [record.obj_pos_ids for record in batch_records]
padded_obj_token_id, obj_token_mask = pad_batch_data(
batch_obj_token_ids, pretraining_task='nlu', pad_idx=self.pad_id, return_input_mask=True)
padded_obj_sent_ids = pad_batch_data(
batch_obj_sent_ids, pretraining_task='nlu', pad_idx=self.pad_id)
padded_obj_pos_ids = pad_batch_data(
batch_obj_pos_ids, pretraining_task='nlu', pad_idx=self.pad_id)
batch_size = len(batch_image_embedding)
if do_decode:
batch_image_id = [record.image_id for record in batch_records]
image_id = np.array(batch_image_id, dtype='int32').reshape((-1, 1))
tgt_word = np.array([[self.cls_id]] * batch_size,
dtype="int64").reshape([-1, 1, 1])
tgt_pos_id = np.full_like(tgt_word, 2, dtype="int64").reshape(
[-1, 1, 1]) ####################### pos start from 2
init_score = np.zeros_like(tgt_word, dtype="float32").reshape([-1, 1])
lods = [range(tgt_word.shape[0] + 1)] * 2
init_score = self._to_lodtensor(init_score, place, lods)
tgt_word = self._to_lodtensor(tgt_word, place, lods)
tgt_pos_id = self._to_lodtensor(tgt_pos_id, place, lods)
init_idx = np.array(range(batch_size), dtype="int32")
# (batch_size, max_img_len+max_obj_len, 1)
input_mask = np.concatenate((image_mask, obj_token_mask), axis=1)
# (batch_size, 1, max_img_len+max_obj_len)
tgt_src_attn_bias = np.transpose(input_mask, (0, 2, 1)).astype("float32")
# (batch_size, max_img_len, max_img_len)
input_mask = np.matmul(input_mask, np.transpose(input_mask, (0, 2, 1)))
return_list = [image_embedding, image_loc, input_mask, image_id,
padded_obj_token_id, padded_obj_sent_ids, padded_obj_pos_ids,
tgt_word, tgt_pos_id, init_score, init_idx, tgt_src_attn_bias]
else:
batch_token_ids = [record.token_ids for record in batch_records]
batch_sent_ids = [record.sent_ids for record in batch_records]
batch_position_ids = [record.pos_ids for record in batch_records]
token_ids = pad_batch_data(batch_token_ids, pad_idx=self.pad_id)
sent_ids = pad_batch_data(batch_sent_ids, pad_idx=self.pad_id)
position_ids = pad_batch_data(batch_position_ids, pad_idx=self.pad_id)
max_len = token_ids.shape[1]
tgt_label = []
for i in range(len(batch_token_ids)):
tgt_idxs = range(1, len(batch_token_ids[i]))
tgt_label.extend(batch_token_ids[i][idx] for idx in tgt_idxs)
tgt_label = np.array(tgt_label).astype("int64").reshape([-1, 1])
tgt_pos = sum(list(map(lambda i: list(range(max_len * i,
max_len * i + len(batch_token_ids[i]) - 1)),
range(batch_size))), [])
tgt_pos = np.array(tgt_pos).reshape([-1, 1]).astype('int64')
# This is used to avoid attention on paddings.
token_mask_data = np.array([[1] * len(inst) + [0] *
(max_len - len(inst)) for inst in batch_token_ids])
token_mask = np.expand_dims(token_mask_data, axis=-1).astype("float32") # [batch_size, max_len ,1]
# This is used to avoid attention on paddings and subsequent words.
token_seq_mask_data = np.zeros((batch_size, max_len, max_len))
for index, mask_data in enumerate(token_seq_mask_data):
start = 0
end = len(batch_token_ids[index])
# Generate the lower triangular matrix using the slice of matrix
b = np.tril(np.ones([end - start, end - start]), 0)
mask_data[start:end, start:end] = b
token_seq_mask = token_seq_mask_data.astype("float32")
# (batch_size, max_img_len+max_obj_len+max_seq_len, 1)
input_mask = np.concatenate((image_mask, obj_token_mask, token_mask), axis=1)
# (batch_size, max_img_len+max_obj_len+max_seq_len, max_img_len+max_obj_len+max_seq_len)
input_mask = np.matmul(input_mask, np.transpose(input_mask, (0, 2, 1)))
input_mask[:, len(image_mask[0]) + len(obj_token_mask[0]):, len(image_mask[0]) + len(obj_token_mask[0]):] \
= token_seq_mask
input_mask[:, :len(image_mask[0]) + len(obj_token_mask[0]),
len(image_mask[0]) + len(obj_token_mask[0]):] = 0
return_list = [image_embedding, image_loc, input_mask, image_mask, token_mask,
padded_obj_token_id, padded_obj_sent_ids, padded_obj_pos_ids,
token_ids, sent_ids, position_ids, tgt_label, tgt_pos]
return return_list
if __name__ == '__main__':
pass
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""data reader for text classification tasks"""
import os
import csv
import numpy as np
import copy
from collections import namedtuple
from model import tokenization
from reader.batching import pad_batch_data
class RegressionReader(object):
"""RegressionReader"""
def __init__(self, tokenizer, args):
self.tokenizer = tokenizer
self.pad_id = tokenizer.pad_token_id
self.cls_id = tokenizer.cls_token_id
self.sep_id = tokenizer.sep_token_id
self.mask_id = tokenizer.mask_token_id
self.max_seq_len = args.max_seq_len
self.in_tokens = args.in_tokens
self.random_seed = 0
self.global_rng = np.random.RandomState(self.random_seed)
self.trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
self.trainer_nums = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
self.current_example = 0
self.current_epoch = 0
self.num_examples = 0
def get_train_progress(self):
"""Gets progress for training phase."""
return self.current_example, self.current_epoch
def _read_tsv(self, input_file, quotechar=None):
"""Reads a tab separated value file."""
with open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
headers = next(reader)
text_indices = [
index for index, h in enumerate(headers) if h != "label"
]
Example = namedtuple('Example', headers)
examples = []
for line in reader:
example = Example(*line)
examples.append(example)
return examples
def _pad_batch_records(self, batch_records):
batch_token_ids = [record.token_ids for record in batch_records]
batch_text_type_ids = [record.text_type_ids for record in batch_records]
batch_position_ids = [record.position_ids for record in batch_records]
batch_labels = [record.label_id for record in batch_records]
batch_labels = np.array(batch_labels).astype('float32').reshape([-1, 1])
if batch_records[0].qid:
batch_qids = [record.qid for record in batch_records]
batch_qids = np.array(batch_qids).astype('int64').reshape([-1, 1])
else:
batch_qids = np.array([]).astype('int64').reshape([-1, 1])
# padding
padded_token_ids, input_mask = pad_batch_data(
batch_token_ids, pretraining_task='nlu', pad_idx=self.pad_id, return_input_mask=True)
padded_text_type_ids = pad_batch_data(
batch_text_type_ids, pretraining_task='nlu', pad_idx=self.pad_id)
padded_position_ids = pad_batch_data(
batch_position_ids, pretraining_task='nlu', pad_idx=self.pad_id)
input_mask = np.matmul(input_mask, np.transpose(input_mask, (0, 2, 1)))
return_list = [
padded_token_ids, padded_text_type_ids, padded_position_ids,
input_mask, batch_labels, batch_qids
]
return return_list
def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def _convert_example_to_record(self, example, max_seq_length, tokenizer):
"""Converts a single `Example` into a single `Record`."""
text_a = tokenization.convert_to_unicode(example.text_a)
tokens_a = tokenizer.tokenize(text_a)
tokens_b = None
if "text_b" in example._fields:
text_b = tokenization.convert_to_unicode(example.text_b)
tokens_b = tokenizer.tokenize(text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
self._truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
tokens = []
text_type_ids = []
tokens.append("[CLS]")
text_type_ids.append(0)
for token in tokens_a:
tokens.append(token)
text_type_ids.append(0)
tokens.append("[SEP]")
text_type_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
text_type_ids.append(1)
tokens.append("[SEP]")
text_type_ids.append(1)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
position_ids = list(range(2, len(token_ids) + 2))
label_id = example.label
Record = namedtuple(
'Record',
['token_ids', 'text_type_ids', 'position_ids', 'label_id', 'qid'])
qid = None
if "qid" in example._fields:
qid = example.qid
record = Record(
token_ids=token_ids,
text_type_ids=text_type_ids,
position_ids=position_ids,
label_id=label_id,
qid=qid)
return record
def _prepare_batch_data(self, examples, batch_size, phase=None):
"""generate batch records"""
batch_records, max_len = [], 0
for index, example in enumerate(examples):
if phase == "train":
self.current_example = index
record = self._convert_example_to_record(example, self.max_seq_len,
self.tokenizer)
max_len = max(max_len, len(record.token_ids))
if self.in_tokens:
to_append = (len(batch_records) + 1) * max_len <= batch_size
else:
to_append = len(batch_records) < batch_size
if to_append:
batch_records.append(record)
else:
yield self._pad_batch_records(batch_records)
batch_records, max_len = [record], len(record.token_ids)
if batch_records:
yield self._pad_batch_records(batch_records)
def get_num_examples(self, input_file):
"""get_num_examples"""
examples = self._read_tsv(input_file)
return len(examples)
def data_generator(self,
input_file,
batch_size,
epoch,
dev_count=1,
shuffle=True,
phase=None):
"""data_generator"""
examples = self._read_tsv(input_file)
def wrapper():
"""wrapper"""
all_dev_batches = []
trainer_id = 0
for epoch_index in range(epoch):
if phase == "train":
self.current_example = 0
self.current_epoch = epoch_index
self.random_seed = epoch_index
self.global_rng = np.random.RandomState(self.random_seed)
trainer_id = self.trainer_id
else:
trainer_id = 0
assert dev_count == 1, "only supports 1 GPU while prediction"
current_examples = copy.deepcopy(examples)
if shuffle:
self.global_rng.shuffle(current_examples)
for batch_data in self._prepare_batch_data(
current_examples, batch_size, phase=phase):
if len(all_dev_batches) < dev_count:
all_dev_batches.append(batch_data)
if len(all_dev_batches) == dev_count:
yield all_dev_batches[trainer_id]
all_dev_batches = []
if phase != "train" and self.trainer_id < len(all_dev_batches):
yield all_dev_batches[self.trainer_id]
return wrapper
if __name__ == '__main__':
pass
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""data reader for image-text retrieval tasks"""
import os
import pickle
import base64
import codecs
import numpy as np
from collections import namedtuple
from reader.batching import pad_feature_data, pad_batch_data
class RetrievalTrainReader(object):
"""RetrievalTrainReader"""
def __init__(self, tokenizer, args, image_feature_dir, image_caption):
self.epoch = args.epoch
self.batch_size = args.batch_size
self.tokenizer = tokenizer
self.pad_id = tokenizer.pad_token_id
self.cls_id = tokenizer.cls_token_id
self.sep_id = tokenizer.sep_token_id
self.mask_id = tokenizer.mask_token_id
self.max_seq_len = args.max_seq_len
self.trainer_id = int(os.getenv("PADDLE_TRAINER_ID", 0))
self.trainers_num = int(os.getenv("PADDLE_TRAINERS_NUM", 1))
self.current_example = 0
self.current_epoch = 0
self._load_image_feature(image_feature_dir)
self._load_caption_dict(image_caption)
self._load_img_id(args.img_id_path)
if args.samples_num == 20:
self._negative_schema = ['ei'] * 10 + ['ec'] * 10
self.outs = len(self._negative_schema) + 1
else:
raise ValueError('dont support')
def _load_caption_dict(self, image_caption):
'''parse dataset_flickr30k.json which is made by karpathy'''
self._caption_ids_dict = {}
self._image_sent_map = {}
with codecs.open(image_caption, 'r', encoding='utf-8') as f:
for line in f:
line = line.strip().split(";")
token_ids, sent_ids, pos_ids, image_name, sent_id = line
token_ids = [int(token) for token in token_ids.split(" ")]
sent_ids = [int(token) for token in sent_ids.split(" ")]
pos_ids = [int(token) for token in pos_ids.split(" ")]
if len(token_ids) > self.max_seq_len:
token_ids = [token_ids[0]] + token_ids[1:self.max_seq_len - 1] + [token_ids[-1]]
sent_ids = sent_ids[:self.max_seq_len]
pos_ids = pos_ids[:self.max_seq_len]
assert len(token_ids) <= self.max_seq_len, \
"token length must be less than max_seq_len"
assert len(token_ids) == len(sent_ids) == len(pos_ids), \
"[Must be true]len(token_ids) == len(sent_ids) == len(pos_ids)"
self._caption_ids_dict[int(sent_id)] = \
[token_ids, sent_ids, pos_ids, int(image_name)]
self._image_sent_map.setdefault(int(image_name), [])
self._image_sent_map[int(image_name)].append(int(sent_id))
self._train_caption_ids = list(self._caption_ids_dict.keys())
self._train_image_list = list(self._image_sent_map.keys())
def _parse_image_line(self, line):
def decode_feature(base64_str, size):
"""decode_feature"""
fea_base64 = base64.b64decode(base64_str)
fea_decode = np.frombuffer(fea_base64, dtype=np.float32)
shape = size, int(fea_decode.shape[0] / size)
features = np.resize(fea_decode, shape)
return features
items = line.strip('\r\n').split('\t')
assert len(items) == 7
img_filename, image_w, image_h, number_box, boxes, image_embeddings, probs = items
number_box = int(number_box)
boxes = decode_feature(boxes, number_box)
probs = decode_feature(probs, number_box)
image_embeddings = decode_feature(image_embeddings, number_box)
image_embeddings_cls = np.mean(image_embeddings, axis=0, keepdims=True)
image_embeddings = np.concatenate([image_embeddings_cls, image_embeddings], 0)
image_location = np.zeros((boxes.shape[0], 5), dtype=np.float32)
image_location[:, :4] = boxes
image_location[:, 4] = (image_location[:, 3] - image_location[:, 1]) * (
image_location[:, 2] - image_location[:, 0]) / (float(image_w) * float(image_h))
image_location[:, 0] = image_location[:, 0] / float(image_w)
image_location[:, 1] = image_location[:, 1] / float(image_h)
image_location[:, 2] = image_location[:, 2] / float(image_w)
image_location[:, 3] = image_location[:, 3] / float(image_h)
g_location = np.array([0, 0, 1, 1, 1])
image_location = np.concatenate([np.expand_dims(g_location, axis=0), image_location], axis=0)
image_loc = image_location
cls_prob = np.mean(probs, axis=0, keepdims=True)
probs = np.concatenate([cls_prob, probs], 0)
output = namedtuple('output', ["img_filename", "number_box", "image_loc", "probs", "image_embeddings"])
return output(img_filename=img_filename,
number_box=number_box + 1,
image_loc=image_loc,
probs=probs,
image_embeddings=image_embeddings)
def _load_image_feature(self, data_dir):
self._image_feature_dict = {}
for file in os.listdir(data_dir):
file = os.path.join(data_dir, file)
with codecs.open(file, 'r', encoding='utf-8') as fr:
for line in fr.readlines():
items = self._parse_image_line(line)
self._image_feature_dict[int(items[0])] = items[1:]
def _load_img_id(self, img_id_path):
self.imgname2id = {}
self.id2imgname = {}
with codecs.open(img_id_path, 'r', encoding='utf-8') as f:
for line in f.readlines():
items = line.strip('\r\n').split('\t')
self.imgname2id[int(items[0])] = int(items[1])
self.id2imgname[int(items[1])] = int(items[0])
def get_train_progress(self):
"""Gets progress for training phase."""
return self.current_example, self.current_epoch
def _prepare_batch_data(self, insts):
"""generate batch and pad"""
batch_src_ids = [inst["token_ids"][out] for inst in insts for out in range(self.outs)]
batch_sent_ids = [inst["sent_ids"][out] for inst in insts for out in range(self.outs)]
batch_pos_ids = [inst["pos_ids"][out] for inst in insts for out in range(self.outs)]
batch_image_loc = [inst["image_loc"][out] for inst in insts for out in range(self.outs)]
batch_image_embedding = [inst["image_embeddings"][out] for inst in insts for out in range(self.outs)]
batch_image_size = [inst["number_box"][out] for inst in insts for out in range(self.outs)]
batch_size = int(len(batch_src_ids) / self.outs)
label = np.array([[0]] * batch_size, dtype="int64")
ids = np.array([[0, 0]] * batch_size, dtype="int64")
padded_token_ids, token_mask = pad_batch_data(
batch_src_ids, pretraining_task='nlu', pad_idx=self.pad_id, return_input_mask=True)
padded_sent_ids = pad_batch_data(
batch_sent_ids, pretraining_task='nlu', pad_idx=self.pad_id)
padded_pos_ids = pad_batch_data(
batch_pos_ids, pretraining_task='nlu', pad_idx=self.pad_id)
padded_image_embedding, image_mask = pad_feature_data(batch_image_embedding,
return_mask=True,
batch_image_size=batch_image_size)
padded_image_loc = pad_feature_data(batch_image_loc)
input_mask = np.concatenate((image_mask, token_mask), axis=1)
input_mask = np.matmul(input_mask, np.transpose(input_mask, (0, 2, 1)))
return_list = [
padded_token_ids, padded_pos_ids, padded_sent_ids, input_mask,
padded_image_embedding, padded_image_loc, label, ids
]
return return_list
def get_num_examples(self):
"""get_num_examples"""
cap_len = len(self._train_caption_ids)
img_len = len(self._train_image_list)
total_samples = cap_len
return total_samples, cap_len, img_len
def process_vl(self, sent_id):
"""trans the orgin tokens to the wanted tokens"""
captions_pos = self._caption_ids_dict[sent_id]
image_name = captions_pos[-1]
image_id = self.imgname2id[image_name]
number_box, image_loc, _, image_embeddings = self._image_feature_dict[image_name]
images = [[image_embeddings, number_box, image_loc]]
captions = [captions_pos]
for item in self._negative_schema:
if item[0] == "e":
while True:
image_name_neg = self.neg_rng.choice(self._train_image_list)
if image_name_neg != image_name:
break
else:
print("error negative schema")
exit()
if item[1] == "i":
number_box_neg, image_loc_neg, _, image_embeddings_neg = self._image_feature_dict[image_name_neg]
captions.append(self._caption_ids_dict[sent_id])
images.append([image_embeddings_neg, number_box_neg, image_loc_neg])
elif item[1] == "c":
sent_id_neg = self.neg_rng.choice(self._image_sent_map[image_name_neg])
captions.append(self._caption_ids_dict[sent_id_neg])
images.append([image_embeddings, number_box, image_loc])
else:
print("error negative schema")
exit()
token_ids_list, sent_ids_list, pos_ids_list, _ = zip(*captions)
image_embeddings_list, number_box_list, image_loc_list = zip(*images)
sample_json = {
"token_ids": token_ids_list,
"sent_ids": sent_ids_list,
"pos_ids": pos_ids_list,
"image_loc": image_loc_list,
"image_embeddings": image_embeddings_list,
"number_box": number_box_list,
}
return sample_json
def read_caption_id(self):
"""read_caption_id"""
self.global_rng.shuffle(self._train_caption_ids)
for index, item in enumerate(self._train_caption_ids):
if index % self.trainers_num != self.trainer_id:
continue
yield self.process_vl(item)
def shuffle_samples(self, sample_generator, buffer=128):
"""shuffle_samples"""
samples = []
try:
while True:
while len(samples) < buffer:
sample = next(sample_generator)
samples.append(sample)
for sample in samples:
yield sample
samples = []
except StopIteration:
if len(samples) == 0:
yield None
else:
for sample in samples:
yield sample
def data_generator(self):
"""data_generator"""
def wrapper():
"""wrapper"""
for epoch_index in range(self.epoch):
self.global_rng = np.random.RandomState(epoch_index)
self.neg_rng = np.random.RandomState(epoch_index)
self.current_epoch = epoch_index
batch_records = []
self.current_example = 0
for sample in self.shuffle_samples(self.read_caption_id()):
self.current_example = self.current_example + 1
if len(batch_records) < self.batch_size:
batch_records.append(sample)
if len(batch_records) == self.batch_size:
yield self._prepare_batch_data(batch_records)
batch_records = []
if batch_records:
yield self._prepare_batch_data(batch_records)
return wrapper
class RetrievalTestReader(object):
"""RetrievalTrainReader"""
def __init__(self, tokenizer, args, image_feature_dir, image_caption):
self.batch_size = args.test_batch_size
self.tokenizer = tokenizer
self.pad_id = tokenizer.pad_token_id
self.cls_id = tokenizer.cls_token_id
self.sep_id = tokenizer.sep_token_id
self.mask_id = tokenizer.mask_token_id
self.max_seq_len = args.max_seq_len
self.trainer_id = int(os.getenv("PADDLE_TRAINER_ID", 0))
self.trainers_num = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
self.current_example = 0
self._load_image_feature(image_feature_dir)
self._load_caption_dict(image_caption)
def _load_caption_dict(self, image_caption):
'''parse dataset_flickr30k.json which is made by karpathy'''
self._caption_ids_dict = {}
self._image_sent_map = {}
with codecs.open(image_caption, 'r', encoding='utf-8') as f:
cnt = 0
for line in f:
line = line.strip().split(";")
token_ids, sent_ids, pos_ids, image_name, sent_id = line
token_ids = [int(token) for token in token_ids.split(" ")]
sent_ids = [int(token) for token in sent_ids.split(" ")]
pos_ids = [int(token) for token in pos_ids.split(" ")]
if len(token_ids) > self.max_seq_len:
token_ids = [token_ids[0]] + token_ids[1:self.max_seq_len - 1] + [token_ids[-1]]
sent_ids = sent_ids[:self.max_seq_len]
pos_ids = pos_ids[:self.max_seq_len]
assert len(token_ids) <= self.max_seq_len, \
"token length must be less than max_seq_len"
assert len(token_ids) == len(sent_ids) == len(pos_ids), \
"[Must be true]len(token_ids) == len(sent_ids) == len(pos_ids)"
self._caption_ids_dict[int(sent_id)] = \
[token_ids, sent_ids, pos_ids, int(image_name)]
self._image_sent_map.setdefault(int(image_name), [])
self._image_sent_map[int(image_name)].append(int(sent_id))
self._train_caption_ids = list(self._caption_ids_dict.keys())
self._train_image_list = list(self._image_sent_map.keys())
def _parse_image_line(self, line):
def decode_feature(base64_str, size):
"""decode_feature"""
fea_base64 = base64.b64decode(base64_str)
fea_decode = np.frombuffer(fea_base64, dtype=np.float32)
shape = size, int(fea_decode.shape[0] / size)
features = np.resize(fea_decode, shape)
return features
items = line.strip('\r\n').split('\t')
assert len(items) == 7
img_filename, image_h, image_w, number_box, boxes, image_embeddings, probs = items
number_box = int(number_box)
boxes = decode_feature(boxes, number_box)
probs = decode_feature(probs, number_box)
image_embeddings = decode_feature(image_embeddings, number_box)
image_embeddings_cls = np.mean(image_embeddings, axis=0, keepdims=True)
image_embeddings = np.concatenate([image_embeddings_cls, image_embeddings], 0)
image_location = np.zeros((boxes.shape[0], 5), dtype=np.float32)
image_location[:, :4] = boxes
image_location[:, 4] = (image_location[:, 3] - image_location[:, 1]) * (
image_location[:, 2] - image_location[:, 0]) / (float(image_w) * float(image_h))
image_location[:, 0] = image_location[:, 0] / float(image_w)
image_location[:, 1] = image_location[:, 1] / float(image_h)
image_location[:, 2] = image_location[:, 2] / float(image_w)
image_location[:, 3] = image_location[:, 3] / float(image_h)
g_location = np.array([0, 0, 1, 1, 1])
image_location = np.concatenate([np.expand_dims(g_location, axis=0), image_location], axis=0)
image_loc = image_location
cls_prob = np.mean(probs, axis=0, keepdims=True)
probs = np.concatenate([cls_prob, probs], 0)
output = namedtuple('output', ["img_filename", "number_box", "image_loc", "probs", "image_embeddings"])
return output(img_filename=img_filename,
number_box=number_box + 1,
image_loc=image_loc,
probs=probs,
image_embeddings=image_embeddings)
def _load_image_feature(self, data_dir):
self._image_feature_dict = {}
for file in os.listdir(data_dir):
file = os.path.join(data_dir, file)
with codecs.open(file, 'r', encoding='utf-8') as fr:
for line in fr.readlines():
items = self._parse_image_line(line)
self._image_feature_dict[int(items[0])] = items[1:]
def _prepare_batch_data(self, insts):
"""generate batch and pad"""
batch_src_ids = [inst["token_ids"] for inst in insts]
batch_sent_ids = [inst["sent_ids"] for inst in insts]
batch_pos_ids = [inst["pos_ids"] for inst in insts]
batch_image_loc = [inst["image_loc"] for inst in insts]
batch_image_embedding = [inst["image_embeddings"] for inst in insts]
batch_image_size = [inst["number_box"] for inst in insts]
batch_ids = [inst["cur_ids"] for inst in insts]
batch_labels = [[0]] * len(insts)
padded_token_ids, token_mask = pad_batch_data(
batch_src_ids, pretraining_task='nlu', pad_idx=self.pad_id, return_input_mask=True)
padded_sent_ids = pad_batch_data(
batch_sent_ids, pretraining_task='nlu', pad_idx=self.pad_id)
padded_pos_ids = pad_batch_data(
batch_pos_ids, pretraining_task='nlu', pad_idx=self.pad_id)
padded_image_embedding, image_mask = pad_feature_data(batch_image_embedding,
return_mask=True,
batch_image_size=batch_image_size)
padded_image_loc = pad_feature_data(batch_image_loc)
ids = np.array(batch_ids, dtype="int64")
label = np.array(batch_labels, dtype="int64")
input_mask = np.concatenate((image_mask, token_mask), axis=1)
input_mask = np.matmul(input_mask, np.transpose(input_mask, (0, 2, 1)))
return_list = [
padded_token_ids, padded_pos_ids, padded_sent_ids, input_mask,
padded_image_embedding, padded_image_loc, label, ids
]
return return_list
def get_num_examples(self):
"""get_num_examples"""
cap_len = len(self._train_caption_ids)
img_len = len(self._train_image_list)
total_samples = cap_len
return total_samples, cap_len, img_len
def process_vl(self, sent_id):
"""trans the orgin tokens to the wanted tokens"""
token_ids, sent_ids, pos_ids, image_name = self._caption_ids_dict[sent_id]
for cur_img_name in self._train_image_list:
number_box, image_loc, _, image_embeddings = self._image_feature_dict[cur_img_name]
sample_json = {
"token_ids": token_ids,
"sent_ids": sent_ids,
"pos_ids": pos_ids,
"image_loc": image_loc,
"image_embeddings": image_embeddings,
"number_box": number_box,
"cur_ids": [cur_img_name, sent_id],
}
yield sample_json
def read_caption_id(self):
"""read_caption_id"""
for item in self._train_caption_ids:
sent_id = item
token_ids, sent_ids, pos_ids, image_name = self._caption_ids_dict[sent_id]
for cur_img_name in self._train_image_list:
number_box, image_loc, _, image_embeddings = self._image_feature_dict[cur_img_name]
sample_json = {
"token_ids": token_ids,
"sent_ids": sent_ids,
"pos_ids": pos_ids,
"image_loc": image_loc,
"image_embeddings": image_embeddings,
"number_box": number_box,
"cur_ids": [cur_img_name, sent_id],
}
yield sample_json
def shuffle_samples(self, sample_generator, buffer=128):
"""shuffle_samples"""
samples = []
try:
while True:
while len(samples) < buffer:
sample = next(sample_generator)
samples.append(sample)
for sample in samples:
yield sample
samples = []
except StopIteration:
if len(samples) == 0:
yield None
else:
for sample in samples:
yield sample
def data_generator(self):
"""data_generator"""
def wrapper():
""""wrapper"""
def batch_reader():
"""batch_reader"""
batch_records = []
self.current_example = 0
for sample in self.shuffle_samples(self.read_caption_id()):
self.current_example = self.current_example + 1
if len(batch_records) < self.batch_size:
batch_records.append(sample)
if len(batch_records) == self.batch_size:
yield self._prepare_batch_data(batch_records)
batch_records = []
if batch_records:
yield self._prepare_batch_data(batch_records)
all_dev_batches = []
for batch_data in batch_reader():
if len(all_dev_batches) < self.trainers_num:
all_dev_batches.append(batch_data)
if len(all_dev_batches) == self.trainers_num:
yield all_dev_batches[self.trainer_id]
all_dev_batches = []
if self.trainer_id < len(all_dev_batches):
yield all_dev_batches[self.trainer_id]
return wrapper
if __name__ == '__main__':
pass
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""data reader for seq2seq generation tasks"""
import os
import csv
csv.field_size_limit(1024 * 1024)
import numpy as np
from collections import namedtuple
import model.tokenization as tokenization
from reader.batching import pad_batch_data, gen_seq2seq_mask
import paddle.fluid as fluid
class Seq2SeqReader(object):
"""seq2seq reader"""
def __init__(self, tokenizer, args):
self.tokenizer = tokenizer
self.pad_id = tokenizer.pad_token_id
self.cls_id = tokenizer.cls_token_id
self.sep_id = tokenizer.sep_token_id
self.mask_id = tokenizer.mask_token_id
self.tgt_type_id = args.tgt_type_id
self.max_src_len = args.max_src_len
self.max_tgt_len = args.max_tgt_len
self.max_out_len = args.max_out_len
self.tokenized_input = args.tokenized_input
self.in_tokens = args.in_tokens
self.continuous_position = args.continuous_position
self.is_dialogue_task = (args.task_type == "dialog")
self.turn_type_size = args.turn_type_size
# random_seed must be set for data slicing when using multi-gpu
if args.random_seed:
np.random.seed(args.random_seed)
else:
np.random.seed(0)
self.trainer_id = 0
self.trainer_nums = 1
if os.getenv("PADDLE_TRAINER_ID"):
self.trainer_id = int(os.getenv("PADDLE_TRAINER_ID"))
if os.getenv("PADDLE_TRAINERS_NUM"):
self.trainer_nums = int(os.getenv("PADDLE_TRAINERS_NUM"))
self.current_example = 0
self.current_epoch = 0
self.num_examples = 0
self.features = {}
def get_train_progress(self):
"""Gets progress for training phase."""
return self.current_example, self.current_epoch
def get_num_examples(self, input_file):
"""get total number of examples"""
examples = self._read_tsv(input_file)
return len(examples)
def _read_tsv_with_buff(self, input_file, quotechar=None, buff_size=1000, shuffle=False):
"""Reads a tab separated value file."""
data_id = 0
with open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
headers = next(reader)
src_indices = [
index for index, h in enumerate(headers) if h != "tgt" and h != "knowledge"
]
assert len(src_indices) <= self.tgt_type_id, "len(src_indices) > self.tgt_type_id"
assert len(src_indices) > 0, "len(src_indices) <= 0"
Example = namedtuple('Example', ["src", "tgt", "knowledge", "data_id"])
examples = []
for line in reader:
src = []
tgt = None
knowledge = None
assert len(line) == len(headers), "len(line) != len(headers)"
for index, text in enumerate(line):
if index in src_indices:
src.append(text)
elif headers[index] == "tgt":
tgt = text
else:
knowledge = text
examples.append(Example(src=src, tgt=tgt, knowledge=knowledge, data_id=data_id))
data_id += 1
if len(examples) >= buff_size:
if shuffle:
np.random.shuffle(examples)
for e in examples:
yield e
examples = []
if shuffle:
np.random.shuffle(examples)
for e in examples:
yield e
def _read_tsv(self, input_file, quotechar=None):
"""Reads a tab separated value file."""
data_id = 0
with open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
headers = next(reader)
src_indices = [
index for index, h in enumerate(headers) if h != "tgt" and h != "knowledge"
]
assert len(src_indices) <= self.tgt_type_id, "len(src_indices) > self.tgt_type_id"
assert len(src_indices) > 0, "len(src_indices) <= 0"
Example = namedtuple('Example', ["src", "tgt", "knowledge", "data_id"])
examples = []
for line in reader:
src = []
tgt = None
knowledge = None
assert len(line) == len(headers), "len(line) != len(headers)"
for index, text in enumerate(line):
if index in src_indices:
src.append(text)
elif headers[index] == "tgt":
tgt = text
else:
knowledge = text
examples.append(Example(src=src, tgt=tgt, knowledge=knowledge, data_id=data_id))
data_id += 1
return examples
def _trunc_token_ids(self, token_ids, max_len, trunc_type="right", keep_sep=True):
"""turncate token_ids to max_len"""
if len(token_ids) > max_len:
if trunc_type == "left":
token_ids = token_ids[-max_len:]
elif keep_sep:
token_ids = token_ids[:max_len - 1] + [self.sep_id]
else:
token_ids = token_ids[:max_len]
return token_ids
def _text_to_ids(self, text, tokenizer=None, max_len=None, trunc_type="right", keep_sep=True):
"""convert text to vocab ids"""
max_len = max_len or self.max_src_len - 1
tokenizer = tokenizer or self.tokenizer
text = tokenization.convert_to_unicode(text)
if self.tokenized_input:
tokens = text.split(" ")
else:
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens) + [self.sep_id]
token_ids = self._trunc_token_ids(token_ids, max_len, trunc_type, keep_sep)
pos_ids = range(3, len(token_ids) + 3) ####################### pos start from 2
return token_ids, pos_ids
def _convert_dialogue_example_to_record(self, example, do_decode=False):
"""convert dialogue example"""
turn_split = " [SEP] "
srcs = example.src[0].split(turn_split)
if len(srcs) > self.turn_type_size - 1:
srcs = srcs[len(srcs) - (self.turn_type_size - 1):]
cur_role_type = len(srcs) % 2
cur_turn_type = len(srcs)
token_ids = [self.cls_id]
role_type_ids = [cur_role_type]
turn_type_ids = [cur_turn_type]
position_ids = [2] ####################### pos start from 2
if example.knowledge:
cur_token_ids, cur_pos_ids = self._text_to_ids(example.knowledge)
token_ids += cur_token_ids
position_ids += cur_pos_ids
role_type_ids += [2] * len(cur_token_ids)
turn_type_ids += [0] * len(cur_token_ids)
for text in srcs:
cur_token_ids, cur_pos_ids = self._text_to_ids(text)
token_ids += cur_token_ids
position_ids += cur_pos_ids
role_type_ids += [cur_role_type] * len(cur_token_ids)
turn_type_ids += [cur_turn_type] * len(cur_token_ids)
cur_turn_type -= 1
cur_role_type = (cur_role_type + 1) % 2
if self.continuous_position and len(token_ids) > self.max_src_len:
token_ids = token_ids[-self.max_src_len:]
role_type_ids = role_type_ids[-self.max_src_len:]
turn_type_ids = turn_type_ids[-self.max_src_len:]
tgt_start_idx = len(token_ids)
if not do_decode:
assert example.tgt, "example.tgt is None"
token_ids.append(self.cls_id)
role_type_ids.append(0)
turn_type_ids.append(0)
position_ids.append(2) ####################### pos start from 2
tgt_token_ids, tgt_pos_ids = self._text_to_ids(example.tgt,
max_len=self.max_tgt_len - 1,
keep_sep=False)
if tgt_token_ids[-1] == self.sep_id:
tgt_token_ids[-1] = self.mask_id # we use [MASK] token as the end token
token_ids += tgt_token_ids
position_ids += tgt_pos_ids
role_type_ids += [0] * len(tgt_token_ids)
turn_type_ids += [0] * len(tgt_token_ids)
if self.continuous_position:
position_ids = range(2, len(token_ids) + 2) ####################### pos start from 2
assert len(token_ids) == len(position_ids) == len(role_type_ids) == len(turn_type_ids), \
"not len(token_ids) == len(position_ids) == len(role_type_ids) == len(turn_type_ids)"
Record = namedtuple(
'Record',
['token_ids', 'position_ids', 'role_ids', 'turn_ids', 'tgt_start_idx', 'data_id'])
record = Record(
token_ids=token_ids,
position_ids=position_ids,
role_ids=role_type_ids,
turn_ids=turn_type_ids,
tgt_start_idx=tgt_start_idx,
data_id=example.data_id)
return record
def _convert_example_to_record(self, example, do_decode=False):
"""Converts a single `Example` into a single `Record`."""
if self.is_dialogue_task:
return self._convert_dialogue_example_to_record(example, do_decode=do_decode)
token_ids = [self.cls_id]
text_type_ids = [0]
position_ids = [2] ####################### pos start from 2
text_type = 0
for text in example.src:
cur_token_ids, cur_pos_ids = self._text_to_ids(text)
token_ids += cur_token_ids
position_ids += cur_pos_ids
text_type_ids += [text_type] * len(cur_token_ids)
text_type += 1
if self.continuous_position and len(token_ids) > self.max_src_len:
token_ids = self._trunc_token_ids(token_ids, self.max_src_len)
text_type_ids = text_type_ids[:self.max_src_len]
tgt_start_idx = len(token_ids)
if not do_decode:
assert example.tgt, "example.tgt is None"
token_ids.append(self.cls_id)
text_type_ids.append(self.tgt_type_id)
position_ids.append(2) ####################### pos start from 2
tgt_token_ids, tgt_pos_ids = self._text_to_ids(example.tgt,
max_len=self.max_tgt_len - 1,
keep_sep=False)
if tgt_token_ids[-1] == self.sep_id:
tgt_token_ids[-1] = self.mask_id # we use [MASK] token as the end token
token_ids += tgt_token_ids
position_ids += tgt_pos_ids
text_type_ids += [self.tgt_type_id] * len(tgt_token_ids)
if self.continuous_position:
position_ids = range(2, len(token_ids) + 2) ####################### pos start from 2
assert len(token_ids) == len(position_ids) == len(text_type_ids), \
"not len(token_ids) == len(position_ids) == len(text_type_ids)"
Record = namedtuple(
'Record',
['token_ids', 'text_type_ids', 'position_ids', 'tgt_start_idx', 'data_id'])
record = Record(
token_ids=token_ids,
text_type_ids=text_type_ids,
position_ids=position_ids,
tgt_start_idx=tgt_start_idx,
data_id=example.data_id)
return record
def _prepare_batch_data(self, examples, batch_size, phase=None, do_decode=False, place=None):
"""generate batch records"""
batch_records, max_len = [], 0
for index, example in enumerate(examples):
if phase == "train":
self.current_example = index
record = self._convert_example_to_record(example, do_decode)
max_len = max(max_len, len(record.token_ids))
if self.in_tokens:
to_append = (len(batch_records) + 1) * max_len <= batch_size
else:
to_append = len(batch_records) < batch_size
if to_append:
batch_records.append(record)
else:
yield self._pad_batch_records(batch_records, do_decode, place)
batch_records, max_len = [record], len(record.token_ids)
if batch_records:
yield self._pad_batch_records(batch_records, do_decode, place)
def get_features(self, phase):
"""obtain data features"""
return self.features.get(phase, None)
def data_generator(self,
input_file,
batch_size,
epoch,
dev_count=1,
shuffle=True,
phase=None,
do_decode=False,
place=None):
"""data generator"""
examples = self._read_tsv(input_file)
if do_decode:
features = {}
for example in examples:
features[example.data_id] = example
self.features[phase] = features
def wrapper():
"""wrapper"""
all_dev_batches = []
for epoch_index in range(epoch):
if phase == "train":
self.current_example = 0
self.current_epoch = epoch_index
trainer_id = self.trainer_id
if shuffle:
np.random.shuffle(examples)
for batch_data in self._prepare_batch_data(
examples, batch_size, phase=phase, do_decode=do_decode, place=place):
if len(all_dev_batches) < dev_count:
all_dev_batches.append(batch_data)
if len(all_dev_batches) == dev_count:
yield all_dev_batches[trainer_id]
all_dev_batches = []
if phase != "train":
if trainer_id < len(all_dev_batches):
yield all_dev_batches[trainer_id]
return wrapper
def _to_lodtensor(self, data, place, lod=None):
data_tensor = fluid.LoDTensor()
data_tensor.set(data, place)
if lod is not None:
data_tensor.set_lod(lod)
return data_tensor
def _pad_batch_records(self, batch_records, do_decode, place):
batch_token_ids = [record.token_ids for record in batch_records]
batch_position_ids = [record.position_ids for record in batch_records]
batch_tgt_start_idx = [record.tgt_start_idx for record in batch_records]
input_mask = gen_seq2seq_mask(batch_token_ids, batch_tgt_start_idx)
if self.is_dialogue_task:
batch_role_ids = [record.role_ids for record in batch_records]
batch_turn_ids = [record.turn_ids for record in batch_records]
to_pad_list = [batch_token_ids, batch_role_ids, batch_turn_ids, batch_position_ids]
else:
batch_text_type_ids = [record.text_type_ids for record in batch_records]
to_pad_list = [batch_token_ids, batch_text_type_ids, batch_position_ids]
return_list = []
for ids in to_pad_list:
return_list.append(pad_batch_data(ids, pad_idx=self.pad_id))
return_list.append(input_mask)
batch_size = len(batch_tgt_start_idx)
max_len = return_list[0].shape[1]
if do_decode:
batch_data_ids = [record.data_id for record in batch_records]
tgt_word = np.array([[self.cls_id]] * len(batch_token_ids),
dtype="int64").reshape([-1, 1, 1])
if self.continuous_position:
tgt_pos_id = np.array(batch_tgt_start_idx, dtype="int64").reshape([-1, 1, 1])
else:
tgt_pos_id = np.full_like(batch_tgt_start_idx, 2, dtype="int64").reshape([-1, 1, 1]) ####################### pos start from 2
init_score = np.zeros_like(tgt_word, dtype="float32").reshape([-1, 1])
lods = [range(tgt_word.shape[0] + 1)] * 2
init_score = self._to_lodtensor(init_score, place, lods)
tgt_word = self._to_lodtensor(tgt_word, place, lods)
tgt_pos_id = self._to_lodtensor(tgt_pos_id, place, lods)
init_idx = np.array(range(len(batch_token_ids)), dtype="int32")
tgt_src_attn_bias = np.tile(input_mask[:, ::max_len, :], [1, 1, 1]).astype("float32")
data_ids = np.array(batch_data_ids).astype("int64").reshape([-1, 1])
return_list += [tgt_word, tgt_pos_id, init_score, init_idx,
tgt_src_attn_bias, data_ids]
else:
tgt_label = []
for i in range(len(batch_token_ids)):
tgt_idxs = range(batch_tgt_start_idx[i] + 1, len(batch_token_ids[i]))
tgt_label.extend(batch_token_ids[i][idx] for idx in tgt_idxs)
tgt_label = np.array(tgt_label).astype("int64").reshape([-1, 1])
tgt_pos = sum(list(map(lambda i: list(range(max_len * i + batch_tgt_start_idx[i],
max_len * i + len(batch_token_ids[i]) - 1)),
range(batch_size))), [])
tgt_pos = np.array(tgt_pos).reshape([-1, 1]).astype('int64')
return_list += [tgt_label, tgt_pos]
return return_list
if __name__ == '__main__':
pass
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Finetuning on classification tasks."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
import multiprocessing
import paddle.fluid as fluid
import numpy as np
from reader.classification_reader import ClassifyReader
from model.unimo_finetune import UNIMOConfig
from model.tokenization import GptBpeTokenizer
from finetune.classifier import create_model, evaluate, predict
from utils.optimization import optimization
from utils.utils import get_time
from utils.args import print_arguments
from utils.init import init_pretraining_params, init_checkpoint
from args.classification_args import parser
args = parser.parse_args()
def main(args):
"""main"""
model_config = UNIMOConfig(args.unimo_config_path)
model_config.print_config()
gpu_id = 0
gpus = fluid.core.get_cuda_device_count()
if args.is_distributed and os.getenv("FLAGS_selected_gpus") is not None:
gpu_list = os.getenv("FLAGS_selected_gpus").split(",")
gpus = len(gpu_list)
gpu_id = int(gpu_list[0])
if args.use_cuda:
place = fluid.CUDAPlace(gpu_id)
dev_count = gpus
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
tokenizer = GptBpeTokenizer(vocab_file=args.unimo_vocab_file,
encoder_json_file=args.encoder_json_file,
vocab_bpe_file=args.vocab_bpe_file,
do_lower_case=args.do_lower_case)
data_reader = ClassifyReader(tokenizer, args)
if not (args.do_train or args.do_val or args.do_val_hard \
or args.do_test or args.do_test_hard or args.do_diagnostic):
raise ValueError("For args `do_train`, `do_val`, `do_val_hard`, `do_test`," \
" `do_test_hard` and `do_diagnostic`, at least one of them must be True.")
startup_prog = fluid.Program()
if args.random_seed is not None:
startup_prog.random_seed = args.random_seed
if args.do_train:
trainers_num = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
train_data_generator = data_reader.data_generator(
input_file=args.train_set,
batch_size=args.batch_size,
epoch=args.epoch,
dev_count=trainers_num,
shuffle=True,
phase="train")
num_train_examples = data_reader.get_num_examples(args.train_set)
if args.in_tokens:
max_train_steps = args.epoch * num_train_examples // (
args.batch_size // args.max_seq_len) // trainers_num
else:
max_train_steps = args.epoch * num_train_examples // args.batch_size // trainers_num
warmup_steps = int(max_train_steps * args.warmup_proportion)
print("Device count: %d, gpu_id: %d" % (dev_count, gpu_id))
print("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps)
print("Num warmup steps: %d" % warmup_steps)
train_program = fluid.Program()
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
train_pyreader, graph_vars = create_model(
args,
pyreader_name='train_reader',
config=model_config)
scheduled_lr, loss_scaling = optimization(
loss=graph_vars["loss"],
warmup_steps=warmup_steps,
num_train_steps=max_train_steps,
learning_rate=args.learning_rate,
train_program=train_program,
weight_decay=args.weight_decay,
scheduler=args.lr_scheduler,
use_fp16=args.use_fp16,
use_dynamic_loss_scaling=args.use_dynamic_loss_scaling,
init_loss_scaling=args.init_loss_scaling,
beta1=args.beta1,
beta2=args.beta2,
epsilon=args.epsilon)
if args.verbose:
if args.in_tokens:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program,
batch_size=args.batch_size // args.max_seq_len)
else:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size)
print("Theoretical memory usage in training: %.3f - %.3f %s" %
(lower_mem, upper_mem, unit))
if args.do_val or args.do_val_hard or args.do_test or args.do_test_hard \
or args.do_pred or args.do_pred_hard or args.do_diagnostic:
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, graph_vars = create_model(
args,
pyreader_name='test_reader',
config=model_config)
test_prog = test_prog.clone(for_test=True)
nccl2_num_trainers = 1
nccl2_trainer_id = 0
print("args.is_distributed:", args.is_distributed)
if args.is_distributed:
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
worker_endpoints = worker_endpoints_env.split(",")
trainers_num = len(worker_endpoints)
print("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
trainer_id:{}".format(worker_endpoints, trainers_num,
current_endpoint, trainer_id))
# prepare nccl2 env.
config = fluid.DistributeTranspilerConfig()
config.mode = "nccl2"
if args.nccl_comm_num > 1:
config.nccl_comm_num = args.nccl_comm_num
if args.use_hierarchical_allreduce and trainers_num > args.hierarchical_allreduce_inter_nranks:
config.use_hierarchical_allreduce = args.use_hierarchical_allreduce
config.hierarchical_allreduce_inter_nranks = args.hierarchical_allreduce_inter_nranks
assert config.hierarchical_allreduce_inter_nranks > 1
assert trainers_num % config.hierarchical_allreduce_inter_nranks == 0
config.hierarchical_allreduce_exter_nranks = \
trainers_num / config.hierarchical_allreduce_inter_nranks
t = fluid.DistributeTranspiler(config=config)
t.transpile(
trainer_id,
trainers=worker_endpoints_env,
current_endpoint=current_endpoint,
program=train_program if args.do_train else test_prog,
startup_program=startup_prog)
nccl2_num_trainers = trainers_num
nccl2_trainer_id = trainer_id
exe = fluid.Executor(place)
exe.run(startup_prog)
if args.do_train:
if args.init_checkpoint and args.init_pretraining_params:
print(
"WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
"both are set! Only arg 'init_checkpoint' is made valid.")
if args.init_checkpoint:
init_checkpoint(
exe,
args.init_checkpoint,
main_program=train_program)
elif args.init_pretraining_params:
init_pretraining_params(
exe,
args.init_pretraining_params,
main_program=train_program)
elif args.do_val or args.do_val_hard or args.do_test or args.do_test_hard \
or args.do_pred or args.do_pred_hard or args.do_diagnostic:
if not args.init_checkpoint:
raise ValueError("args 'init_checkpoint' should be set if"
"only doing validation or testing!")
init_checkpoint(
exe,
args.init_checkpoint,
main_program=startup_prog)
if args.do_train:
exec_strategy = fluid.ExecutionStrategy()
if args.use_fast_executor:
exec_strategy.use_experimental_executor = True
exec_strategy.num_threads = dev_count
exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
loss_name=graph_vars["loss"].name,
exec_strategy=exec_strategy,
main_program=train_program,
num_trainers=nccl2_num_trainers,
trainer_id=nccl2_trainer_id)
train_pyreader.decorate_tensor_provider(train_data_generator)
else:
train_exe = None
test_exe = exe
if args.do_val or args.do_val_hard or args.do_test or args.do_test_hard \
or args.do_pred or args.do_pred_hard or args.do_diagnostic:
if args.use_multi_gpu_test:
test_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
main_program=test_prog,
share_vars_from=train_exe)
dev_ret_history = [] # (steps, key_eval, eval)
dev_hard_ret_history = [] # (steps, key_eval, eval)
test_ret_history = [] # (steps, key_eval, eval)
test_hard_ret_history = [] # (steps, key_eval, eval)
if args.do_train:
train_pyreader.start()
steps = 0
if warmup_steps > 0:
graph_vars["learning_rate"] = scheduled_lr
time_begin = time.time()
skip_steps = args.skip_steps
while True:
try:
steps += 1
if steps % skip_steps == 0:
train_fetch_list = [
graph_vars["loss"].name,
graph_vars["accuracy"].name,
graph_vars["num_seqs"].name
]
if "learning_rate" in graph_vars:
train_fetch_list.append(graph_vars["learning_rate"].name)
res = train_exe.run(fetch_list=train_fetch_list)
outputs = {"loss": np.mean(res[0])}
if "learning_rate" in graph_vars:
outputs["learning_rate"] = float(res[3][0])
if args.verbose:
verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size(
)
verbose += "learning rate: %f" % (
outputs["learning_rate"]
if warmup_steps > 0 else args.learning_rate)
print(verbose)
current_example, current_epoch = data_reader.get_train_progress()
time_end = time.time()
used_time = time_end - time_begin
print("%s - epoch: %d, progress: %d/%d, step: %d, ave loss: %f, speed: %f steps/s" % \
(get_time(), current_epoch, current_example, num_train_examples, steps, \
outputs["loss"], args.skip_steps / used_time))
time_begin = time.time()
else:
train_exe.run(fetch_list=[])
if nccl2_trainer_id == 0:
if steps % args.save_steps == 0 and args.save_checkpoints:
save_path = os.path.join(args.checkpoints,
"step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
if steps % args.validation_steps == 0:
# evaluate dev set
if args.do_val:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.dev_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
outputs = evaluate(args, test_exe, test_prog, test_pyreader, graph_vars, "dev")
dev_ret_history.append((steps, outputs['key_eval'], outputs[outputs['key_eval']]))
# evaluate dev_hard set
if args.do_val_hard:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.dev_hard_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
outputs = evaluate(args, test_exe, test_prog, test_pyreader, graph_vars, "dev_hard")
dev_hard_ret_history.append((steps, outputs['key_eval'], outputs[outputs['key_eval']]))
# evaluate test set
if args.do_test:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.test_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
outputs = evaluate(args, test_exe, test_prog, test_pyreader, graph_vars, "test")
test_ret_history.append((steps, outputs['key_eval'], outputs[outputs['key_eval']]))
# evaluate test_hard set
if args.do_test_hard:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.test_hard_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
outputs = evaluate(args, test_exe, test_prog, test_pyreader, graph_vars, "test_hard")
test_hard_ret_history.append((steps, outputs['key_eval'], outputs[outputs['key_eval']]))
# pred diagnostic set
if args.do_diagnostic:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.diagnostic_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
qids, preds, probs = predict(test_exe, test_prog, test_pyreader, graph_vars, dev_count=1)
save_path = args.pred_save + '.diagnostic.' + str(steps) + '.txt'
print("testing {}, save to {}".format(args.diagnostic_set, save_path))
with open(save_path, 'w') as f:
for id, s, p in zip(qids, preds, probs):
f.write('{}\t{}\t{}\n'.format(id, s, p))
# pred test set
if args.do_pred:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.test_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
qids, preds, probs = predict(test_exe, test_prog, test_pyreader, graph_vars, dev_count=1)
save_path = args.pred_save + '.test.' + str(steps) + '.txt'
print("testing {}, save to {}".format(args.test_set, save_path))
with open(save_path, 'w') as f:
for id, s, p in zip(qids, preds, probs):
f.write('{}\t{}\t{}\n'.format(id, s, p))
# pred test hard set
if args.do_pred_hard:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.test_hard_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
qids, preds, probs = predict(test_exe, test_prog, test_pyreader, graph_vars, dev_count=1)
save_path = args.pred_save + '.test_hard.' + str(steps) + '.txt'
print("testing {}, save to {}".format(args.test_hard_set, save_path))
with open(save_path, 'w') as f:
for id, s, p in zip(qids, preds, probs):
f.write('{}\t{}\t{}\n'.format(id, s, p))
except fluid.core.EOFException:
if args.save_checkpoints:
save_path = os.path.join(args.checkpoints, "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
train_pyreader.reset()
break
if nccl2_trainer_id == 0:
# final pred on diagnostic set
if args.do_diagnostic:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.diagnostic_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
qids, preds, probs = predict(test_exe, test_prog, test_pyreader, graph_vars, dev_count=1)
save_path = args.pred_save + '.diagnostic.' + str(steps) + '.txt'
print("testing {}, save to {}".format(args.diagnostic_set, save_path))
with open(save_path, 'w') as f:
for id, s, p in zip(qids, preds, probs):
f.write('{}\t{}\t{}\n'.format(id, s, p))
# final pred on test set
if args.do_pred:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.test_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
qids, preds, probs = predict(test_exe, test_prog, test_pyreader, graph_vars, dev_count=1)
save_path = args.pred_save + '.test.' + str(steps) + '.txt'
print("testing {}, save to {}".format(args.test_set, save_path))
with open(save_path, 'w') as f:
for id, s, p in zip(qids, preds, probs):
f.write('{}\t{}\t{}\n'.format(id, s, p))
# final pred on test_hard set
if args.do_pred_hard:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.test_hard_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
qids, preds, probs = predict(test_exe, test_prog, test_pyreader, graph_vars, dev_count=1)
save_path = args.pred_save + '.test_hard.' + str(steps) + '.txt'
print("testing {}, save to {}".format(args.test_hard_set, save_path))
with open(save_path, 'w') as f:
for id, s, p in zip(qids, preds, probs):
f.write('{}\t{}\t{}\n'.format(id, s, p))
# final eval on test set
if args.do_test:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.test_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
print("Final test result:")
outputs = evaluate(args, test_exe, test_prog, test_pyreader, graph_vars, "test")
test_ret_history.append((steps, outputs['key_eval'], outputs[outputs['key_eval']]))
test_ret_history = sorted(test_ret_history, key=lambda a: a[2], reverse=True)
print("Best testing result: step %d %s %f" % (
test_ret_history[0][0], test_ret_history[0][1], test_ret_history[0][2]))
# final eval on test hard set
if args.do_test_hard:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.test_hard_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
print("Final test_hard result:")
outputs = evaluate(args, test_exe, test_prog, test_pyreader, graph_vars, "test_hard")
test_hard_ret_history.append((steps, outputs['key_eval'], outputs[outputs['key_eval']]))
test_hard_ret_history = sorted(test_hard_ret_history, key=lambda a: a[2], reverse=True)
print("Best testing hard result: step %d %s %f" % (
test_hard_ret_history[0][0], test_hard_ret_history[0][1], test_hard_ret_history[0][2]))
# final eval on dev set
if args.do_val:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.dev_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
print("Final validation result:")
outputs = evaluate(args, test_exe, test_prog, test_pyreader, graph_vars, "dev")
dev_ret_history.append((steps, outputs['key_eval'], outputs[outputs['key_eval']]))
dev_ret_history = sorted(dev_ret_history, key=lambda a: a[2], reverse=True)
print("Best validation result: step %d %s %f" % (
dev_ret_history[0][0], dev_ret_history[0][1], dev_ret_history[0][2]))
# final eval on dev hard set
if args.do_val_hard:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.dev_hard_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
print("Final validation_hard result:")
outputs = evaluate(args, test_exe, test_prog, test_pyreader, graph_vars, "dev_hard")
dev_hard_ret_history.append((steps, outputs['key_eval'], outputs[outputs['key_eval']]))
dev_hard_ret_history = sorted(dev_hard_ret_history, key=lambda a: a[2], reverse=True)
print("Best validation_hard result: step %d %s %f" % (
dev_hard_ret_history[0][0], dev_hard_ret_history[0][1], dev_hard_ret_history[0][2]))
if __name__ == '__main__':
print_arguments(args)
main(args)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Finetuning on image-to-text generation tasks."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
import multiprocessing
import paddle.fluid as fluid
from reader.img2txt_reader import Img2TxtReader
from model.tokenization import GptBpeTokenizer
from model.unimo_finetune import UNIMOConfig
from utils.optimization import optimization
from utils.init import init_model
from utils.args import print_arguments
from utils.utils import visualdl_log
from finetune.img2txt import Img2Txt
from args.img2txt_args import parser
from functools import partial
from collections import OrderedDict
args = parser.parse_args()
def get_time():
"""get time"""
res = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
return res
def evaluate_datasets(pyreader, reader, eval_func, data_generator,
do_pred=False, suffix="out"):
"""evaluate"""
def evaluate_dataset(phase, path):
"""run evaluation"""
pyreader.set_batch_generator(data_generator(filelist=path, phase=phase))
eval_func(eval_phase="%s_%s" % (phase, suffix))
if args.do_val:
evaluate_dataset("dev", args.valid_filelist)
if args.do_test:
evaluate_dataset("test", args.test_filelist)
if args.do_pred and do_pred:
evaluate_dataset("pred", args.test_filelist)
def save_checkpoint(program, exe, suffix):
"""save model checkpoint"""
save_path = os.path.join(args.checkpoints, suffix)
fluid.io.save_persistables(exe, save_path, program)
def main(args):
"""main func"""
unimo_config = UNIMOConfig(args.unimo_config_path)
if args.hidden_dropout_prob >= 0:
unimo_config["hidden_dropout_prob"] = args.hidden_dropout_prob
if args.attention_probs_dropout_prob >= 0:
unimo_config["attention_probs_dropout_prob"] = args.attention_probs_dropout_prob
unimo_config.print_config()
if args.pred_batch_size <= 0:
args.pred_batch_size = args.batch_size
gpu_id = 0
gpus = fluid.core.get_cuda_device_count()
if args.is_distributed and os.getenv("FLAGS_selected_gpus") is not None:
gpu_list = os.getenv("FLAGS_selected_gpus").split(",")
gpus = len(gpu_list)
gpu_id = int(gpu_list[0])
if args.use_cuda:
place = fluid.CUDAPlace(gpu_id)
dev_count = gpus
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
"""load vocabulary"""
tokenizer = GptBpeTokenizer(vocab_file=args.unimo_vocab_file,
encoder_json_file=args.encoder_json_file,
vocab_bpe_file=args.vocab_bpe_file,
do_lower_case=True)
reader = Img2TxtReader(tokenizer, args)
img2txt = Img2Txt(args, unimo_config, tokenizer)
if not (args.do_train or args.do_val or args.do_test or args.do_pred):
raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
"least one of them must be True.")
startup_prog = fluid.Program()
if args.random_seed is not None:
startup_prog.random_seed = args.random_seed
if args.do_train:
trainers_num = int(os.getenv("PADDLE_TRAINERS_NUM", 1))
train_data_generator = reader.data_generator(
filelist=args.train_filelist,
batch_size=args.batch_size,
epoch=args.epoch,
dev_count=trainers_num,
shuffle=True,
phase="train")
num_train_examples = reader.get_num_examples(args.train_filelist) # 566747
max_train_steps = args.epoch * num_train_examples // args.batch_size // trainers_num
warmup_steps = int(max_train_steps * args.warmup_proportion)
print("Device count: %d, gpu_id: %d" % (dev_count, gpu_id))
print("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps)
print("Num warmup steps: %d" % warmup_steps)
train_program = fluid.Program()
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
print("using adv_type is ", args.adv_type)
if args.adv_type == "freelb_text":
train_pyreader, graph_vars = img2txt.create_model_freelb_text()
elif args.adv_type == "freelb_image":
train_pyreader, graph_vars = img2txt.create_model_freelb_image()
elif args.adv_type == "villa":
train_pyreader, graph_vars = img2txt.create_model_villa()
else:
print("Unsupported adv_type, run model without adversial training")
train_pyreader, graph_vars = img2txt.create_model()
scheduled_lr, loss_scaling = optimization(
loss=graph_vars["loss"],
warmup_steps=warmup_steps,
num_train_steps=max_train_steps,
learning_rate=args.learning_rate,
train_program=train_program,
weight_decay=args.weight_decay,
scheduler=args.lr_scheduler,
use_fp16=args.use_fp16,
use_dynamic_loss_scaling=args.use_dynamic_loss_scaling,
init_loss_scaling=args.init_loss_scaling,
beta1=args.beta1,
beta2=args.beta2,
epsilon=args.epsilon)
if args.do_val or args.do_test or args.do_pred:
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, test_graph_vars = img2txt.create_model(decoding=args.do_decode)
test_prog = test_prog.clone(for_test=True)
nccl2_num_trainers = 1
nccl2_trainer_id = 0
print("args.is_distributed:", args.is_distributed)
if args.is_distributed:
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
worker_endpoints = worker_endpoints_env.split(",")
trainers_num = len(worker_endpoints)
print("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
trainer_id:{}".format(worker_endpoints, trainers_num,
current_endpoint, trainer_id))
# prepare nccl2 env.
config = fluid.DistributeTranspilerConfig()
config.mode = "nccl2"
if args.nccl_comm_num > 1:
config.nccl_comm_num = args.nccl_comm_num
if args.use_hierarchical_allreduce and trainers_num > args.hierarchical_allreduce_inter_nranks:
config.use_hierarchical_allreduce = args.use_hierarchical_allreduce
config.hierarchical_allreduce_inter_nranks = args.hierarchical_allreduce_inter_nranks
assert config.hierarchical_allreduce_inter_nranks > 1
assert trainers_num % config.hierarchical_allreduce_inter_nranks == 0
config.hierarchical_allreduce_exter_nranks = \
trainers_num / config.hierarchical_allreduce_inter_nranks
t = fluid.DistributeTranspiler(config=config)
t.transpile(
trainer_id,
trainers=worker_endpoints_env,
current_endpoint=current_endpoint,
program=train_program if args.do_train else test_prog,
startup_program=startup_prog)
nccl2_num_trainers = trainers_num
nccl2_trainer_id = trainer_id
exe = fluid.Executor(place)
exe.run(startup_prog)
init_model(args, exe, train_program if args.do_train else test_prog)
if args.do_train:
exec_strategy = fluid.ExecutionStrategy()
if args.use_fast_executor:
exec_strategy.use_experimental_executor = True
exec_strategy.num_threads = 4 if args.use_fp16 else 2 # 2 for fp32 4 for fp16
exec_strategy.num_iteration_per_drop_scope = min(args.num_iteration_per_drop_scope, args.skip_steps)
build_strategy = fluid.BuildStrategy()
build_strategy.remove_unnecessary_lock = False
if args.use_fuse:
build_strategy.fuse_all_reduce_ops = True
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
loss_name=graph_vars["loss"].name,
build_strategy=build_strategy,
exec_strategy=exec_strategy,
main_program=train_program,
num_trainers=nccl2_num_trainers,
trainer_id=nccl2_trainer_id)
train_pyreader.set_batch_generator(train_data_generator)
train_resource = {"exe": train_exe,
"program": train_program,
"pyreader": train_pyreader}
save_model = partial(save_checkpoint, program=train_program, exe=exe)
test_dev_count = 1
if args.do_val or args.do_test or args.do_pred:
test_exe = exe
if args.use_multi_gpu_test:
test_dev_count = nccl2_num_trainers
test_resource = {"exe": test_exe,
"program": test_prog,
"pyreader": test_pyreader}
eval_data_generator = partial(reader.data_generator, batch_size=args.pred_batch_size,
epoch=1, dev_count=test_dev_count, shuffle=False, do_decode=args.do_decode,
place=place)
eval_func = partial(img2txt.evaluate, resource=test_resource, graph_vars=test_graph_vars,
dev_count=test_dev_count, output_path=args.checkpoints, gpu_id=nccl2_trainer_id)
evaluate = partial(evaluate_datasets, pyreader=test_pyreader, reader=reader,
eval_func=eval_func, data_generator=eval_data_generator)
if args.do_train:
train_pyreader.start()
steps = 0
last_epoch = 0
if warmup_steps > 0:
graph_vars["learning_rate"] = scheduled_lr
time_begin = time.time()
skip_steps = args.skip_steps
while True:
try:
steps += 1
if args.save_and_valid_by_epoch:
suffix = "epoch_" + str(last_epoch)
else:
suffix = "step_" + str(steps)
if steps % skip_steps == 0:
outputs = img2txt.evaluate(train_resource, "train", graph_vars)
if args.verbose:
verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size()
verbose += "learning rate: %.8f" % (
outputs["learning_rate"] if warmup_steps > 0 else args.learning_rate)
print(verbose)
current_epoch = steps * args.batch_size * trainers_num // num_train_examples
current_example = steps * args.batch_size * trainers_num % num_train_examples
time_end = time.time()
used_time = time_end - time_begin
print("%s - epoch: %d, progress: %d/%d, step: %d, loss: %f, "
"ppl: %f, speed: %f steps/s"
% (get_time(), current_epoch, current_example, num_train_examples,
steps, outputs["loss"], outputs["ppl"],
args.skip_steps / used_time))
time_begin = time.time()
if args.visualdl_log and nccl2_trainer_id == 0:
visuallog_dict = OrderedDict()
visuallog_dict["ppl"] = outputs["ppl"]
visualdl_log(visuallog_dict, outputs["ppl"], steps, phase='train')
else:
train_exe.run(fetch_list=[])
if nccl2_trainer_id >= test_dev_count:
continue
do_save = False
do_eval = False
if not args.save_and_valid_by_epoch:
if steps % args.save_steps == 0 and nccl2_trainer_id == 0:
do_save = True
if steps % args.validation_steps == 0:
do_eval = True
else:
current_epoch = steps * args.batch_size * trainers_num // num_train_examples
if current_epoch != last_epoch:
if nccl2_trainer_id == 0:
do_save = True
do_eval = True
if do_save:
save_model(suffix=suffix)
if do_eval:
if args.do_val or args.do_test or args.do_pred:
evaluate(suffix=suffix)
if args.save_and_valid_by_epoch:
last_epoch = current_epoch
except fluid.core.EOFException:
save_model(suffix=suffix)
train_pyreader.reset()
break
if nccl2_trainer_id >= test_dev_count:
return
if args.do_val or args.do_test or args.do_pred:
suffix = "output"
if args.do_train:
if not args.save_and_valid_by_epoch:
suffix = "step_" + str(steps)
else:
suffix = "epoch_" + str(last_epoch)
evaluate(suffix=suffix, do_pred=True)
if __name__ == '__main__':
print_arguments(args)
main(args)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Finetuning on regression tasks."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
import multiprocessing
import paddle.fluid as fluid
import numpy as np
from reader.regression_reader import RegressionReader
from model.unimo_finetune import UNIMOConfig
from model.tokenization import GptBpeTokenizer
from finetune.regression import create_model, evaluate, predict
from utils.optimization import optimization
from utils.utils import get_time
from utils.args import print_arguments
from utils.init import init_pretraining_params, init_checkpoint
from args.regression_args import parser
args = parser.parse_args()
def main(args):
"""main"""
model_config = UNIMOConfig(args.unimo_config_path)
model_config.print_config()
gpu_id = 0
gpus = fluid.core.get_cuda_device_count()
if args.is_distributed and os.getenv("FLAGS_selected_gpus") is not None:
gpu_list = os.getenv("FLAGS_selected_gpus").split(",")
gpus = len(gpu_list)
gpu_id = int(gpu_list[0])
if args.use_cuda:
place = fluid.CUDAPlace(gpu_id)
dev_count = gpus
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
tokenizer = GptBpeTokenizer(vocab_file=args.unimo_vocab_file,
encoder_json_file=args.encoder_json_file,
vocab_bpe_file=args.vocab_bpe_file,
do_lower_case=args.do_lower_case)
data_reader = RegressionReader(tokenizer, args)
if not (args.do_train or args.do_val or args.do_test):
raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
"least one of them must be True.")
startup_prog = fluid.Program()
if args.random_seed is not None:
startup_prog.random_seed = args.random_seed
if args.do_train:
trainers_num = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
train_data_generator = data_reader.data_generator(
input_file=args.train_set,
batch_size=args.batch_size,
epoch=args.epoch,
dev_count=trainers_num,
shuffle=True,
phase="train")
num_train_examples = data_reader.get_num_examples(args.train_set)
if args.in_tokens:
max_train_steps = args.epoch * num_train_examples // (
args.batch_size // args.max_seq_len) // trainers_num
else:
max_train_steps = args.epoch * num_train_examples // args.batch_size // trainers_num
warmup_steps = int(max_train_steps * args.warmup_proportion)
print("Device count: %d, gpu_id: %d" % (dev_count, gpu_id))
print("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps)
print("Num warmup steps: %d" % warmup_steps)
train_program = fluid.Program()
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
train_pyreader, graph_vars = create_model(
args,
pyreader_name='train_reader',
config=model_config)
scheduled_lr, loss_scaling = optimization(
loss=graph_vars["loss"],
warmup_steps=warmup_steps,
num_train_steps=max_train_steps,
learning_rate=args.learning_rate,
train_program=train_program,
weight_decay=args.weight_decay,
scheduler=args.lr_scheduler,
use_fp16=args.use_fp16,
use_dynamic_loss_scaling=args.use_dynamic_loss_scaling,
init_loss_scaling=args.init_loss_scaling,
beta1=args.beta1,
beta2=args.beta2,
epsilon=args.epsilon)
if args.verbose:
if args.in_tokens:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program,
batch_size=args.batch_size // args.max_seq_len)
else:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size)
print("Theoretical memory usage in training: %.3f - %.3f %s" %
(lower_mem, upper_mem, unit))
if args.do_val or args.do_test or args.do_pred:
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, graph_vars = create_model(
args,
pyreader_name='test_reader',
config=model_config)
test_prog = test_prog.clone(for_test=True)
nccl2_num_trainers = 1
nccl2_trainer_id = 0
print("args.is_distributed:", args.is_distributed)
if args.is_distributed:
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
worker_endpoints = worker_endpoints_env.split(",")
trainers_num = len(worker_endpoints)
print("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
trainer_id:{}".format(worker_endpoints, trainers_num,
current_endpoint, trainer_id))
# prepare nccl2 env.
config = fluid.DistributeTranspilerConfig()
config.mode = "nccl2"
if args.nccl_comm_num > 1:
config.nccl_comm_num = args.nccl_comm_num
if args.use_hierarchical_allreduce and trainers_num > args.hierarchical_allreduce_inter_nranks:
config.use_hierarchical_allreduce = args.use_hierarchical_allreduce
config.hierarchical_allreduce_inter_nranks = args.hierarchical_allreduce_inter_nranks
assert config.hierarchical_allreduce_inter_nranks > 1
assert trainers_num % config.hierarchical_allreduce_inter_nranks == 0
config.hierarchical_allreduce_exter_nranks = \
trainers_num / config.hierarchical_allreduce_inter_nranks
t = fluid.DistributeTranspiler(config=config)
t.transpile(
trainer_id,
trainers=worker_endpoints_env,
current_endpoint=current_endpoint,
program=train_program if args.do_train else test_prog,
startup_program=startup_prog)
nccl2_num_trainers = trainers_num
nccl2_trainer_id = trainer_id
exe = fluid.Executor(place)
exe.run(startup_prog)
if args.do_train:
if args.init_checkpoint and args.init_pretraining_params:
print(
"WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
"both are set! Only arg 'init_checkpoint' is made valid.")
if args.init_checkpoint:
init_checkpoint(
exe,
args.init_checkpoint,
main_program=train_program)
elif args.init_pretraining_params:
init_pretraining_params(
exe,
args.init_pretraining_params,
main_program=train_program)
elif args.do_val or args.do_test or args.do_pred:
if not args.init_checkpoint:
raise ValueError("args 'init_checkpoint' should be set if"
"only doing validation or testing!")
init_checkpoint(
exe,
args.init_checkpoint,
main_program=startup_prog)
if args.do_train:
exec_strategy = fluid.ExecutionStrategy()
if args.use_fast_executor:
exec_strategy.use_experimental_executor = True
exec_strategy.num_threads = dev_count
exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
loss_name=graph_vars["loss"].name,
exec_strategy=exec_strategy,
main_program=train_program,
num_trainers=nccl2_num_trainers,
trainer_id=nccl2_trainer_id)
train_pyreader.decorate_tensor_provider(train_data_generator)
else:
train_exe = None
test_exe = exe
if args.do_val or args.do_test or args.do_pred:
if args.use_multi_gpu_test:
test_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
main_program=test_prog,
share_vars_from=train_exe)
dev_ret_history = [] # (steps, key_eval, eval)
if args.do_train:
train_pyreader.start()
steps = 0
if warmup_steps > 0:
graph_vars["learning_rate"] = scheduled_lr
time_begin = time.time()
skip_steps = args.skip_steps
while True:
try:
steps += 1
if steps % skip_steps == 0:
train_fetch_list = [
graph_vars["loss"].name,
]
if "learning_rate" in graph_vars:
train_fetch_list.append(graph_vars["learning_rate"].name)
res = train_exe.run(fetch_list=train_fetch_list)
outputs = {"loss": np.mean(res[0])}
if "learning_rate" in graph_vars:
outputs["learning_rate"] = float(res[1][0])
if args.verbose:
verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size(
)
verbose += "learning rate: %f" % (
outputs["learning_rate"]
if warmup_steps > 0 else args.learning_rate)
print(verbose)
current_example, current_epoch = data_reader.get_train_progress()
time_end = time.time()
used_time = time_end - time_begin
print("%s - epoch: %d, progress: %d/%d, step: %d, ave loss: %f, speed: %f steps/s" % \
(get_time(), current_epoch, current_example, num_train_examples, steps, \
outputs["loss"], args.skip_steps / used_time))
time_begin = time.time()
else:
train_exe.run(fetch_list=[])
if nccl2_trainer_id == 0:
if steps % args.save_steps == 0 and args.save_checkpoints:
save_path = os.path.join(args.checkpoints,
"step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
if steps % args.validation_steps == 0:
# evaluate dev set
if args.do_val:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.dev_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
outputs = evaluate(args, test_exe, test_prog, test_pyreader, graph_vars, "dev")
dev_ret_history.append((steps, outputs['key_eval'], outputs[outputs['key_eval']]))
# evaluate test set
if args.do_test:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.test_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
outputs = evaluate(args, test_exe, test_prog, test_pyreader, graph_vars, "test")
if args.do_pred:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.test_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
qids, preds, probs = predict(test_exe, test_prog, test_pyreader, graph_vars, dev_count=1)
save_path = args.pred_save + '.test.' + str(steps) + '.txt'
print("testing {}, save to {}".format(args.test_set, save_path))
with open(save_path, 'w') as f:
for id, s, p in zip(qids, preds, probs):
f.write('{}\t{}\t{}\n'.format(id, s, p))
except fluid.core.EOFException:
if args.save_checkpoints:
save_path = os.path.join(args.checkpoints, "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
train_pyreader.reset()
break
if nccl2_trainer_id == 0:
# final eval on dev set
if args.do_val:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.dev_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
print("Final validation result:")
outputs = evaluate(args, test_exe, test_prog, test_pyreader, graph_vars, "dev")
dev_ret_history.append((steps, outputs['key_eval'], outputs[outputs['key_eval']]))
dev_ret_history = sorted(dev_ret_history, key=lambda a: a[2], reverse=True)
print("Best validation result: step %d %s %f" \
% (dev_ret_history[0][0], dev_ret_history[0][1], dev_ret_history[0][2]))
# final eval on test set
if args.do_test:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.test_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
print("Final test result:")
outputs = evaluate(args, test_exe, test_prog, test_pyreader, graph_vars, "test")
# final eval on test set
if args.do_pred:
test_pyreader.decorate_tensor_provider(
data_reader.data_generator(
args.test_set,
batch_size=args.batch_size,
epoch=1,
dev_count=1,
shuffle=False))
qids, preds, probs = predict(test_exe, test_prog, test_pyreader, graph_vars, dev_count=1)
save_path = args.pred_save + '.' + str(steps) + '.txt'
print("testing {}, save to {}".format(args.test_set, save_path))
with open(save_path, 'w') as f:
for id, s, p in zip(qids, preds, probs):
f.write('{}\t{}\t{}\n'.format(id, s, p))
if __name__ == '__main__':
print_arguments(args)
main(args)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Finetuning on classification tasks."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
import multiprocessing
import paddle.fluid as fluid
import numpy as np
from reader.retrieval_reader import RetrievalTrainReader, RetrievalTestReader
from model.unimo_finetune import UNIMOConfig
from model.tokenization import GptBpeTokenizer
from finetune.retrieval import create_model, evaluate
from utils.optimization import optimization
from utils.args import print_arguments
from utils.utils import get_time
from utils.init import init_pretraining_params, init_checkpoint
from args.retrieval_args import parser
args = parser.parse_args()
def main(args):
"""main"""
model_config = UNIMOConfig(args.unimo_config_path)
model_config.print_config()
gpu_id = 0
gpus = fluid.core.get_cuda_device_count()
if args.is_distributed and os.getenv("FLAGS_selected_gpus") is not None:
gpu_list = os.getenv("FLAGS_selected_gpus").split(",")
gpus = len(gpu_list)
gpu_id = int(gpu_list[0])
if args.use_cuda:
place = fluid.CUDAPlace(gpu_id)
dev_count = gpus
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
tokenizer = GptBpeTokenizer(vocab_file=args.unimo_vocab_file,
encoder_json_file=args.encoder_json_file,
vocab_bpe_file=args.vocab_bpe_file,
do_lower_case=args.do_lower_case)
if not (args.do_train or args.do_val or args.do_test):
raise ValueError("For args `do_train`, `do_val`, `do_test`, at "
"least one of them must be True.")
startup_prog = fluid.Program()
if args.random_seed is not None:
startup_prog.random_seed = args.random_seed
trainers_num = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
if args.do_train:
train_data_reader = RetrievalTrainReader(tokenizer, args, args.train_image_feature_dir,
args.train_image_caption)
train_data_generator = train_data_reader.data_generator()
num_train_examples, captions_num, image_num = train_data_reader.get_num_examples()
step_num_per_epoch = num_train_examples // args.batch_size // trainers_num
max_train_steps = args.epoch * step_num_per_epoch
args.learning_rate_decay_step1 = args.learning_rate_decay_epoch1 * step_num_per_epoch
args.learning_rate_decay_step2 = args.learning_rate_decay_epoch2 * step_num_per_epoch
print("Device count: %d, gpu_id: %d" % (dev_count, gpu_id))
print("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps)
train_program = fluid.Program()
lr_boundaries = [args.learning_rate_decay_step1, args.learning_rate_decay_step2]
lr_value = [args.learning_rate * args.learning_rate_scale ** i for i in range(len(lr_boundaries)+1)]
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
train_pyreader, graph_vars = create_model(
args,
phase='train',
config=model_config,
samples_num=args.samples_num + 1)
scheduled_lr, loss_scaling = optimization(
loss=graph_vars["loss"],
warmup_steps=args.warmup_step,
num_train_steps=max_train_steps,
learning_rate=args.learning_rate,
train_program=train_program,
weight_decay=args.weight_decay,
scheduler=args.lr_scheduler,
use_fp16=args.use_fp16,
use_dynamic_loss_scaling=args.use_dynamic_loss_scaling,
init_loss_scaling=args.init_loss_scaling,
beta1=args.beta1,
beta2=args.beta2,
epsilon=args.epsilon,
boundaries=lr_boundaries,
values=lr_value)
if args.do_val or args.do_test:
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, test_graph_vars = create_model(
args,
phase='dev',
config=model_config,
samples_num=1)
test_prog = test_prog.clone(for_test=True)
if args.do_val:
dev_data_reader = RetrievalTestReader(tokenizer, args, \
args.dev_image_feature_dir, args.dev_image_caption)
dev_data_generator = dev_data_reader.data_generator()
if args.do_test:
test_data_reader = RetrievalTestReader(tokenizer, args, \
args.test_image_feature_dir, args.test_image_caption)
test_data_generator = test_data_reader.data_generator()
nccl2_num_trainers = 1
nccl2_trainer_id = 0
print("args.is_distributed:", args.is_distributed)
if args.is_distributed:
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
worker_endpoints = worker_endpoints_env.split(",")
trainers_num = len(worker_endpoints)
print("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
trainer_id:{}".format(worker_endpoints, trainers_num,
current_endpoint, trainer_id))
# prepare nccl2 env.
config = fluid.DistributeTranspilerConfig()
config.mode = "nccl2"
if args.nccl_comm_num > 1:
config.nccl_comm_num = args.nccl_comm_num
if args.use_hierarchical_allreduce and trainers_num > args.hierarchical_allreduce_inter_nranks:
config.use_hierarchical_allreduce = args.use_hierarchical_allreduce
config.hierarchical_allreduce_inter_nranks = args.hierarchical_allreduce_inter_nranks
assert config.hierarchical_allreduce_inter_nranks > 1
assert trainers_num % config.hierarchical_allreduce_inter_nranks == 0
config.hierarchical_allreduce_exter_nranks = \
trainers_num / config.hierarchical_allreduce_inter_nranks
t = fluid.DistributeTranspiler(config=config)
t.transpile(
trainer_id,
trainers=worker_endpoints_env,
current_endpoint=current_endpoint,
program=train_program if args.do_train else test_prog,
startup_program=startup_prog)
nccl2_num_trainers = trainers_num
nccl2_trainer_id = trainer_id
exe = fluid.Executor(place)
exe.run(startup_prog)
if args.do_train:
if not args.run_random:
if args.init_checkpoint and args.init_pretraining_params:
print(
"WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
"both are set! Only arg 'init_checkpoint' is made valid.")
if args.init_checkpoint:
init_checkpoint(
exe,
args.init_checkpoint,
main_program=train_program)
elif args.init_pretraining_params:
init_pretraining_params(
exe,
args.init_pretraining_params,
main_program=train_program)
elif args.do_val or args.do_test:
args.init_checkpoint = args.init_pretraining_params
if not args.init_checkpoint:
raise ValueError("args 'init_checkpoint' should be set if"
"only doing validation or testing!")
init_checkpoint(
exe,
args.init_checkpoint,
main_program=test_prog)
if args.do_train:
exec_strategy = fluid.ExecutionStrategy()
if args.use_fast_executor:
exec_strategy.use_experimental_executor = True
exec_strategy.num_threads = 4 if args.use_fp16 else 2
exec_strategy.num_iteration_per_drop_scope = min(args.num_iteration_per_drop_scope, args.skip_steps)
build_strategy = fluid.BuildStrategy()
build_strategy.remove_unnecessary_lock = False
if args.use_fuse:
build_strategy.fuse_all_reduce_ops = True
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
loss_name=graph_vars["loss"].name,
build_strategy=build_strategy,
exec_strategy=exec_strategy,
main_program=train_program,
num_trainers=nccl2_num_trainers,
trainer_id=nccl2_trainer_id)
train_pyreader.set_batch_generator(train_data_generator, places=place)
else:
train_exe = None
if args.do_val or args.do_test:
test_exe = fluid.ParallelExecutor(use_cuda=args.use_cuda,
main_program=test_prog,
share_vars_from=train_exe)
dev_ret_history = [] # (steps, key_eval, eval)
test_ret_history = [] # (steps, key_eval, eval)
steps = 0
if args.do_train:
train_pyreader.start()
time_begin = time.time()
skip_steps = args.skip_steps
while True:
try:
steps += 1
if steps % skip_steps == 0:
train_fetch_list = [graph_vars["loss"].name, scheduled_lr.name]
res = train_exe.run(fetch_list=train_fetch_list)
outputs = {"loss": np.mean(res[0]), 'learning_rate': float(res[1][0])}
if args.verbose:
verbose = "train pyreader queue size: %d, learning_rate: %.10f" % \
(train_pyreader.queue.size(), outputs['learning_rate'])
print(verbose)
current_example, current_epoch = train_data_reader.get_train_progress()
time_end = time.time()
used_time = time_end - time_begin
print("%s - epoch: %d, progress: %d/%d, step: %d, ave loss: %f, speed: %f steps/s" % \
(get_time(), current_epoch, current_example, num_train_examples, \
steps, outputs["loss"], args.skip_steps / used_time))
time_begin = time.time()
else:
train_exe.run(fetch_list=[])
if nccl2_trainer_id == 0:
if steps % args.save_steps == 0 and args.save_checkpoints:
save_path = os.path.join(args.checkpoints,
"step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
if steps % args.validation_steps == 0:
# evaluate dev set
if args.do_val:
test_pyreader.set_batch_generator(dev_data_generator, places=place)
outputs = evaluate(args, test_exe, test_pyreader, test_graph_vars, "dev", \
trainers_num, nccl2_trainer_id, data_reader=dev_data_reader)
if nccl2_trainer_id == 0:
dev_ret_history.append((steps, outputs['key_eval'], outputs[outputs['key_eval']]))
# evaluate test set
if args.do_test:
test_pyreader.set_batch_generator(test_data_generator, places=place)
outputs = evaluate(args, test_exe, test_pyreader, test_graph_vars, "test", \
trainers_num, nccl2_trainer_id, data_reader=test_data_reader)
if nccl2_trainer_id == 0:
test_ret_history.append((steps, outputs['key_eval'], outputs[outputs['key_eval']]))
except fluid.core.EOFException:
if args.save_checkpoints:
save_path = os.path.join(args.checkpoints, "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, train_program)
train_pyreader.reset()
break
# final eval on dev set
if args.do_val:
test_pyreader.set_batch_generator(dev_data_generator, places=place)
if nccl2_trainer_id == 0:
print("Final validation result:")
outputs = evaluate(args, test_exe, test_pyreader, test_graph_vars, "dev", \
trainers_num, nccl2_trainer_id, data_reader=dev_data_reader)
if nccl2_trainer_id == 0:
dev_ret_history.append((steps, outputs['key_eval'], outputs[outputs['key_eval']]))
dev_ret_history = sorted(dev_ret_history, key=lambda a: a[2], reverse=True)
print("Best validation result: step %d %s %f" % \
(dev_ret_history[0][0], dev_ret_history[0][1], dev_ret_history[0][2]))
# final eval on test set
if args.do_test:
test_pyreader.set_batch_generator(test_data_generator, places=place)
if nccl2_trainer_id == 0:
print("Final test result:")
outputs = evaluate(args, test_exe, test_pyreader, test_graph_vars, "test", \
trainers_num, nccl2_trainer_id, data_reader=test_data_reader)
if __name__ == '__main__':
print_arguments(args)
main(args)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Finetuning on seq2seq text generation tasks."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
import multiprocessing
import paddle.fluid as fluid
from reader.seq2seq_reader import Seq2SeqReader
from model.tokenization import GptBpeTokenizer
from model.unimo_finetune import UNIMOConfig
from utils.optimization import optimization
from utils.init import init_model
from utils.args import print_arguments
from utils.utils import visualdl_log
from finetune.seq2seq import Seq2Seq
from args.seq2seq_args import parser
from functools import partial
from collections import OrderedDict
args = parser.parse_args()
def evaluate_datasets(pyreader, reader, eval_func, data_generator,
do_pred=False, suffix="out"):
"""evaluate"""
def evaluate_dataset(phase, path):
"""run evaluation"""
pyreader.set_batch_generator(data_generator(input_file=path, phase=phase))
eval_func(eval_phase="%s_%s" % (phase, suffix),
features=reader.get_features(phase))
if args.do_val:
evaluate_dataset("dev", args.dev_set)
if args.do_test:
evaluate_dataset("test", args.test_set)
if args.do_pred and do_pred:
evaluate_dataset("pred", args.pred_set)
def save_checkpoint(program, exe, suffix):
"""save model checkpoint"""
save_path = os.path.join(args.checkpoints, suffix)
fluid.io.save_persistables(exe, save_path, program)
def main(args):
"""main func"""
unimo_config = UNIMOConfig(args.unimo_config_path)
if args.task_type == "dialog":
unimo_config["role_type_size"] = args.role_type_size
unimo_config["turn_type_size"] = args.turn_type_size
if args.hidden_dropout_prob >= 0:
unimo_config["hidden_dropout_prob"] = args.hidden_dropout_prob
if args.attention_probs_dropout_prob >= 0:
unimo_config["attention_probs_dropout_prob"] = args.attention_probs_dropout_prob
unimo_config.print_config()
if args.pred_batch_size <= 0:
args.pred_batch_size = args.batch_size
gpu_id = 0
gpus = fluid.core.get_cuda_device_count()
if args.is_distributed and os.getenv("FLAGS_selected_gpus") is not None:
gpu_list = os.getenv("FLAGS_selected_gpus").split(",")
gpus = len(gpu_list)
gpu_id = int(gpu_list[0])
if args.use_cuda:
place = fluid.CUDAPlace(gpu_id)
dev_count = gpus
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
"""load vocabulary"""
tokenizer = GptBpeTokenizer(vocab_file=args.unimo_vocab_file,
encoder_json_file=args.encoder_json_file,
vocab_bpe_file=args.vocab_bpe_file,
do_lower_case=True)
reader = Seq2SeqReader(tokenizer, args)
unimo_seq2seq = Seq2Seq(args, unimo_config, tokenizer)
if not (args.do_train or args.do_val or args.do_test or args.do_pred):
raise ValueError("For args `do_train`, `do_val` and `do_test`, at "
"least one of them must be True.")
startup_prog = fluid.Program()
if args.random_seed is not None:
startup_prog.random_seed = args.random_seed
if args.do_train:
trainers_num = int(os.getenv("PADDLE_TRAINERS_NUM", 1))
train_data_generator = reader.data_generator(
input_file=args.train_set,
batch_size=args.batch_size,
epoch=args.epoch,
dev_count=trainers_num,
shuffle=True,
phase="train")
num_train_examples = reader.get_num_examples(args.train_set)
if args.in_tokens:
max_train_steps = args.epoch * num_train_examples // (
args.batch_size // args.max_seq_len) // trainers_num
else:
max_train_steps = args.epoch * num_train_examples // args.batch_size // trainers_num
warmup_steps = int(max_train_steps * args.warmup_proportion)
print("Device count: %d, gpu_id: %d" % (dev_count, gpu_id))
print("Num train examples: %d" % num_train_examples)
print("Max train steps: %d" % max_train_steps)
print("Num warmup steps: %d" % warmup_steps)
train_program = fluid.Program()
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
train_pyreader, graph_vars = unimo_seq2seq.create_model()
scheduled_lr, loss_scaling = optimization(
loss=graph_vars["loss"],
warmup_steps=warmup_steps,
num_train_steps=max_train_steps,
learning_rate=args.learning_rate,
train_program=train_program,
weight_decay=args.weight_decay,
scheduler=args.lr_scheduler,
use_fp16=args.use_fp16,
use_dynamic_loss_scaling=args.use_dynamic_loss_scaling,
init_loss_scaling=args.init_loss_scaling,
beta1=args.beta1,
beta2=args.beta2,
epsilon=args.epsilon)
if args.verbose:
if args.in_tokens:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program,
batch_size=args.batch_size // args.max_seq_len)
else:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size)
print("Theoretical memory usage in training: %.3f - %.3f %s" %
(lower_mem, upper_mem, unit))
if args.do_val or args.do_test or args.do_pred:
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, test_graph_vars = unimo_seq2seq.create_model(decoding=args.do_decode)
test_prog = test_prog.clone(for_test=True)
nccl2_num_trainers = 1
nccl2_trainer_id = 0
print("args.is_distributed:", args.is_distributed)
if args.is_distributed:
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
worker_endpoints = worker_endpoints_env.split(",")
trainers_num = len(worker_endpoints)
print("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
trainer_id:{}".format(worker_endpoints, trainers_num,
current_endpoint, trainer_id))
# prepare nccl2 env.
config = fluid.DistributeTranspilerConfig()
config.mode = "nccl2"
if args.nccl_comm_num > 1:
config.nccl_comm_num = args.nccl_comm_num
if args.use_hierarchical_allreduce and trainers_num > args.hierarchical_allreduce_inter_nranks:
config.use_hierarchical_allreduce = args.use_hierarchical_allreduce
config.hierarchical_allreduce_inter_nranks = args.hierarchical_allreduce_inter_nranks
assert config.hierarchical_allreduce_inter_nranks > 1
assert trainers_num % config.hierarchical_allreduce_inter_nranks == 0
config.hierarchical_allreduce_exter_nranks = \
trainers_num / config.hierarchical_allreduce_inter_nranks
t = fluid.DistributeTranspiler(config=config)
t.transpile(
trainer_id,
trainers=worker_endpoints_env,
current_endpoint=current_endpoint,
program=train_program if args.do_train else test_prog,
startup_program=startup_prog)
nccl2_num_trainers = trainers_num
nccl2_trainer_id = trainer_id
exe = fluid.Executor(place)
exe.run(startup_prog)
init_model(args, exe, train_program if args.do_train else test_prog)
if args.do_train:
exec_strategy = fluid.ExecutionStrategy()
if args.use_fast_executor:
exec_strategy.use_experimental_executor = True
exec_strategy.num_threads = 4 if args.use_fp16 else 2 # 2 for fp32 4 for fp16
exec_strategy.num_iteration_per_drop_scope = min(args.num_iteration_per_drop_scope, args.skip_steps)
build_strategy = fluid.BuildStrategy()
build_strategy.remove_unnecessary_lock = False
if args.use_fuse:
build_strategy.fuse_all_reduce_ops = True
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
loss_name=graph_vars["loss"].name,
build_strategy=build_strategy,
exec_strategy=exec_strategy,
main_program=train_program,
num_trainers=nccl2_num_trainers,
trainer_id=nccl2_trainer_id)
train_pyreader.set_batch_generator(train_data_generator)
train_resource = {"exe": train_exe,
"program": train_program,
"pyreader": train_pyreader}
save_model = partial(save_checkpoint, program=train_program, exe=exe)
test_dev_count = 1
if args.do_val or args.do_test or args.do_pred:
test_exe = exe
if args.use_multi_gpu_test:
test_dev_count = nccl2_num_trainers
test_resource = {"exe": test_exe,
"program": test_prog,
"pyreader": test_pyreader}
eval_data_generator = partial(reader.data_generator, batch_size=args.pred_batch_size,
epoch=1, dev_count=test_dev_count, shuffle=False, do_decode=args.do_decode,
place=place)
eval_func = partial(unimo_seq2seq.evaluate, resource=test_resource, graph_vars=test_graph_vars,
dev_count=test_dev_count, output_path=args.checkpoints, gpu_id=nccl2_trainer_id)
evaluate = partial(evaluate_datasets, pyreader=test_pyreader, reader=reader,
eval_func=eval_func, data_generator=eval_data_generator)
if args.do_train:
train_pyreader.start()
steps = 0
last_epoch = 0
if warmup_steps > 0:
graph_vars["learning_rate"] = scheduled_lr
time_begin = time.time()
skip_steps = args.skip_steps
while True:
try:
steps += 1
if args.save_and_valid_by_epoch:
suffix = "epoch_" + str(last_epoch)
else:
suffix = "step_" + str(steps)
if steps % skip_steps == 0:
outputs = unimo_seq2seq.evaluate(train_resource, "train", graph_vars)
if args.verbose:
verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size()
verbose += "learning rate: %.8f" % (
outputs["learning_rate"] if warmup_steps > 0 else args.learning_rate)
print(verbose)
if args.in_tokens:
current_example, current_epoch = reader.get_train_progress()
else:
current_epoch = steps * args.batch_size * trainers_num // num_train_examples
current_example = steps * args.batch_size * trainers_num % num_train_examples
time_end = time.time()
used_time = time_end - time_begin
print("epoch: %d, progress: %d/%d, step: %d, loss: %f, "
"ppl: %f, speed: %f steps/s"
% (current_epoch, current_example, num_train_examples,
steps, outputs["loss"], outputs["ppl"],
args.skip_steps / used_time))
time_begin = time.time()
if args.visualdl_log and nccl2_trainer_id == 0:
visuallog_dict = OrderedDict()
visuallog_dict["ppl"] = outputs["ppl"]
visualdl_log(visuallog_dict, outputs["ppl"], steps, phase='train')
else:
train_exe.run(fetch_list=[])
if nccl2_trainer_id >= test_dev_count:
continue
do_save = False
do_eval = False
if not args.save_and_valid_by_epoch:
if steps % args.save_steps == 0 and nccl2_trainer_id == 0:
do_save = True
if steps % args.validation_steps == 0:
do_eval = True
else:
if args.in_tokens:
current_example, current_epoch = reader.get_train_progress()
else:
current_epoch = steps * args.batch_size * trainers_num // num_train_examples
if current_epoch != last_epoch:
if nccl2_trainer_id == 0:
do_save = True
do_eval = True
if do_save:
save_model(suffix=suffix)
if do_eval:
evaluate(suffix=suffix)
if args.save_and_valid_by_epoch:
last_epoch = current_epoch
except fluid.core.EOFException:
save_model(suffix=suffix)
train_pyreader.reset()
break
if nccl2_trainer_id >= test_dev_count:
return
if args.do_val or args.do_test or args.do_pred:
suffix = "output"
if args.do_train:
if not args.save_and_valid_by_epoch:
suffix = "step_" + str(steps)
else:
suffix = "epoch_" + str(last_epoch)
evaluate(suffix=suffix, do_pred=True)
if __name__ == '__main__':
print_arguments(args)
main(args)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Arguments for configuration."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import six
import argparse
def str2bool(v):
"""str to bool"""
# because argparse does not support to parse "true, False" as python
# boolean directly
return v.lower() in ("true", "t", "1")
class ArgumentGroup(object):
"""argument group"""
def __init__(self, parser, title, des):
self._group = parser.add_argument_group(title=title, description=des)
def add_arg(self, name, type, default, help, positional_arg=False, **kwargs):
"""add argument"""
prefix = "" if positional_arg else "--"
type = str2bool if type == bool else type
self._group.add_argument(
prefix + name,
default=default,
type=type,
help=help + ' Default: %(default)s.',
**kwargs)
def print_arguments(args):
"""print arguments"""
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
def inv_arguments(args):
"""inverse arguments"""
print('[Warning] Only keyword argument type is supported.')
args_list = []
for arg, value in sorted(six.iteritems(vars(args))):
args_list.extend(['--' + str(arg), str(value)])
return args_list
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""extract eval results"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
import os
import re
import argparse
from args import ArgumentGroup
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "stat", "stat configuration")
model_g.add_arg("log_dir", str, None, "stat log dir")
model_g.add_arg("file_name", str, "job.log.0", "key words indentify log file")
model_g.add_arg("final_res_file", str, "final_res.txt", "the file to save final stat score")
args = parser.parse_args()
def extract_res(infile):
"""extract eval results"""
res = []
with open(infile) as fr:
for line in fr.readlines():
line = line.strip('\r\n')
pattern = re.compile(r'\[\w+_step_\d+ evaluation\]')
if pattern.match(line):
res.append(line)
return res
eval_res = {}
log_file = os.path.join(args.log_dir, args.file_name)
if os.path.exists(log_file):
eval_res[args.file_name] = extract_res(log_file)
else:
for sub_dir in os.listdir(args.log_dir):
cur_log_dir = os.path.join(args.log_dir, sub_dir)
log_file = os.path.join(cur_log_dir, args.file_name)
if os.path.exists(log_file):
res = extract_res(log_file)
eval_res[sub_dir] = res
sys.stdout = open(os.path.join(args.log_dir, args.final_res_file), 'w')
for name, all_res in eval_res.items():
print(name)
for val in all_res:
print(val)
print('\n')
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""fp16 mixed precision training"""
from __future__ import print_function
import paddle
import paddle.fluid as fluid
def append_cast_op(i, o, prog):
"""
Append a cast op in a given Program to cast input `i` to data type `o.dtype`.
Args:
i (Variable): The input Variable.
o (Variable): The output Variable.
prog (Program): The Program to append cast op.
"""
prog.global_block().append_op(
type="cast",
inputs={"X": i},
outputs={"Out": o},
attrs={"in_dtype": i.dtype,
"out_dtype": o.dtype})
def copy_to_master_param(p, block):
"""copy to master"""
v = block.vars.get(p.name, None)
if v is None:
raise ValueError("no param name %s found!" % p.name)
new_p = fluid.framework.Parameter(
block=block,
shape=v.shape,
dtype=fluid.core.VarDesc.VarType.FP32,
type=v.type,
lod_level=v.lod_level,
stop_gradient=p.stop_gradient,
trainable=p.trainable,
optimize_attr=p.optimize_attr,
regularizer=p.regularizer,
gradient_clip_attr=p.gradient_clip_attr,
error_clip=p.error_clip,
name=v.name + ".master")
return new_p
def apply_dynamic_loss_scaling(loss_scaling, master_params_grads,
incr_every_n_steps, decr_every_n_nan_or_inf,
incr_ratio, decr_ratio):
"""dynamix loss scaling"""
_incr_every_n_steps = fluid.layers.fill_constant(
shape=[1], dtype='int32', value=incr_every_n_steps)
_decr_every_n_nan_or_inf = fluid.layers.fill_constant(
shape=[1], dtype='int32', value=decr_every_n_nan_or_inf)
_num_good_steps = fluid.layers.create_global_var(
name=fluid.unique_name.generate("num_good_steps"),
shape=[1],
value=0,
dtype='int32',
persistable=True)
_num_bad_steps = fluid.layers.create_global_var(
name=fluid.unique_name.generate("num_bad_steps"),
shape=[1],
value=0,
dtype='int32',
persistable=True)
grads = [fluid.layers.reduce_sum(g) for [_, g] in master_params_grads]
all_grads = fluid.layers.concat(grads)
all_grads_sum = fluid.layers.reduce_sum(all_grads)
is_overall_finite = fluid.layers.isfinite(all_grads_sum)
update_loss_scaling(is_overall_finite, loss_scaling, _num_good_steps,
_num_bad_steps, _incr_every_n_steps,
_decr_every_n_nan_or_inf, incr_ratio, decr_ratio)
# apply_gradient append all ops in global block, thus we shouldn't
# apply gradient in the switch branch.
with fluid.layers.Switch() as switch:
with switch.case(is_overall_finite):
pass
with switch.default():
for _, g in master_params_grads:
fluid.layers.assign(fluid.layers.zeros_like(g), g)
def create_master_params_grads(params_grads, main_prog, startup_prog, loss_scaling):
"""create master params grads"""
master_params_grads = []
for p, g in params_grads:
with main_prog._optimized_guard([p, g]):
# create master parameters
# cast fp16->fp32 in main_prog
master_param = copy_to_master_param(p, main_prog.global_block())
startup_master_param = startup_prog.global_block()._clone_variable(
master_param)
startup_p = startup_prog.global_block().var(p.name)
# cast fp16->fp32 in startup_prog
append_cast_op(startup_p, startup_master_param, startup_prog)
# cast fp16 gradients to fp32 before apply gradients
if g.name.find("layer_norm") > -1:
scaled_g = g / loss_scaling
master_params_grads.append([p, scaled_g])
continue
master_grad = fluid.layers.cast(g, "float32")
master_grad = master_grad / loss_scaling
master_params_grads.append([master_param, master_grad])
return master_params_grads
def master_param_to_train_param(master_params_grads, params_grads, main_prog):
"""convert master param to train param"""
for idx, m_p_g in enumerate(master_params_grads):
train_p, _ = params_grads[idx]
# if train_p.name.find("layer_norm") > -1:
# continue
with main_prog._optimized_guard([m_p_g[0], m_p_g[1]]):
# append_cast_op(m_p_g[0], train_p, main_prog)
if train_p.name.find("layer_norm") > -1:
fluid.layers.assign(m_p_g[0], train_p)
else:
append_cast_op(m_p_g[0], train_p, main_prog)
def update_loss_scaling(is_overall_finite, prev_loss_scaling, num_good_steps,
num_bad_steps, incr_every_n_steps,
decr_every_n_nan_or_inf, incr_ratio, decr_ratio):
"""
Update loss scaling according to overall gradients. If all gradients is
finite after incr_every_n_steps, loss scaling will increase by incr_ratio.
Otherwisw, loss scaling will decrease by decr_ratio after
decr_every_n_nan_or_inf steps and each step some gradients are infinite.
Args:
is_overall_finite (Variable): A boolean variable indicates whether
all gradients are finite.
prev_loss_scaling (Variable): Previous loss scaling.
num_good_steps (Variable): A variable accumulates good steps in which
all gradients are finite.
num_bad_steps (Variable): A variable accumulates bad steps in which
some gradients are infinite.
incr_every_n_steps (Variable): A variable represents increasing loss
scaling every n consecutive steps with
finite gradients.
decr_every_n_nan_or_inf (Variable): A variable represents decreasing
loss scaling every n accumulated
steps with nan or inf gradients.
incr_ratio(float): The multiplier to use when increasing the loss
scaling.
decr_ratio(float): The less-than-one-multiplier to use when decreasing
loss scaling.
"""
zero_steps = fluid.layers.fill_constant(shape=[1], dtype='int32', value=0)
with fluid.layers.Switch() as switch:
with switch.case(is_overall_finite):
should_incr_loss_scaling = fluid.layers.less_than(
incr_every_n_steps, num_good_steps + 1)
with fluid.layers.Switch() as switch1:
with switch1.case(should_incr_loss_scaling):
new_loss_scaling = prev_loss_scaling * incr_ratio
loss_scaling_is_finite = fluid.layers.isfinite(
new_loss_scaling)
with fluid.layers.Switch() as switch2:
with switch2.case(loss_scaling_is_finite):
fluid.layers.assign(new_loss_scaling,
prev_loss_scaling)
with switch2.default():
pass
fluid.layers.assign(zero_steps, num_good_steps)
fluid.layers.assign(zero_steps, num_bad_steps)
with switch1.default():
fluid.layers.increment(num_good_steps)
fluid.layers.assign(zero_steps, num_bad_steps)
with switch.default():
should_decr_loss_scaling = fluid.layers.less_than(
decr_every_n_nan_or_inf, num_bad_steps + 1)
with fluid.layers.Switch() as switch3:
with switch3.case(should_decr_loss_scaling):
new_loss_scaling = prev_loss_scaling * decr_ratio
static_loss_scaling = \
fluid.layers.fill_constant(shape=[1],
dtype='float32',
value=1.0)
less_than_one = fluid.layers.less_than(new_loss_scaling,
static_loss_scaling)
with fluid.layers.Switch() as switch4:
with switch4.case(less_than_one):
fluid.layers.assign(static_loss_scaling,
prev_loss_scaling)
with switch4.default():
fluid.layers.assign(new_loss_scaling,
prev_loss_scaling)
fluid.layers.assign(zero_steps, num_good_steps)
fluid.layers.assign(zero_steps, num_bad_steps)
with switch3.default():
fluid.layers.assign(zero_steps, num_good_steps)
fluid.layers.increment(num_bad_steps)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""init model params"""
from __future__ import print_function
import os
import six
import ast
import copy
import numpy as np
import paddle.fluid as fluid
def cast_fp32_to_fp16(exe, main_program):
"""cast fp32 to fp16"""
print("Cast parameters to float16 data format.")
for param in main_program.global_block().all_parameters():
if not param.name.endswith(".master"):
#load fp16
param_t = fluid.global_scope().find_var(param.name).get_tensor()
data = np.array(param_t)
if param.name.startswith("encoder_layer") \
and "layer_norm" not in param.name:
print(param.name)
param_t.set(np.float16(data).view(np.uint16), exe.place)
#load fp32
master_param_var = fluid.global_scope().find_var(param.name +
".master")
if master_param_var is not None:
master_param_var.get_tensor().set(data, exe.place)
def init_checkpoint(exe, init_checkpoint_path, main_program):
"""init model checkpoint"""
assert os.path.exists(
init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
def existed_persitables(var):
if not fluid.io.is_persistable(var):
return False
return os.path.exists(os.path.join(init_checkpoint_path, var.name))
fluid.io.load_vars(
exe,
init_checkpoint_path,
main_program=main_program,
predicate=existed_persitables)
print("Load model from {}".format(init_checkpoint_path))
def init_pretraining_params(exe,
pretraining_params_path,
main_program):
"""init pretraining params"""
assert os.path.exists(pretraining_params_path
), "[%s] cann't be found." % pretraining_params_path
def existed_params(var):
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(os.path.join(pretraining_params_path, var.name))
fluid.io.load_vars(
exe,
pretraining_params_path,
main_program=main_program,
predicate=existed_params)
print("Load pretraining parameters from {}.".format(
pretraining_params_path))
def init_model(args, exe, startup_prog):
"""init model params"""
init_func, init_path = None, None
if args.do_train:
if args.init_checkpoint and args.init_pretraining_params:
print(
"WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
"both are set! Only arg 'init_checkpoint' is made valid.")
if args.init_checkpoint:
init_func = init_checkpoint
init_path = args.init_checkpoint
elif args.init_pretraining_params:
init_func = init_pretraining_params
init_path = args.init_pretraining_params
elif args.do_val or args.do_test or args.do_pred:
init_path = args.init_checkpoint or args.init_pretraining_params
if not init_path:
raise ValueError("args 'init_checkpoint' should be set if"
"only doing validation or testing!")
init_func = init_checkpoint
if init_path:
init_func(exe, init_path, main_program=startup_prog)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Optimization and learning rate scheduling."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import paddle.fluid as fluid
from utils.fp16 import create_master_params_grads, master_param_to_train_param, apply_dynamic_loss_scaling
def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
""" Applies linear warmup of learning rate from 0 and decay to 0."""
with fluid.default_main_program()._lr_schedule_guard():
lr = fluid.layers.tensor.create_global_var(
shape=[1],
value=0.0,
dtype='float32',
persistable=True,
name="scheduled_learning_rate")
global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()
with fluid.layers.control_flow.Switch() as switch:
with switch.case(global_step < warmup_steps):
warmup_lr = learning_rate * (global_step / warmup_steps)
fluid.layers.tensor.assign(warmup_lr, lr)
with switch.default():
decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
learning_rate=learning_rate,
decay_steps=num_train_steps,
end_learning_rate=0.0,
power=1.0,
cycle=False)
fluid.layers.tensor.assign(decayed_lr, lr)
return lr
def optimization(loss,
warmup_steps,
num_train_steps,
learning_rate,
train_program,
weight_decay,
scheduler='linear_warmup_decay',
use_fp16=False,
use_dynamic_loss_scaling=False,
init_loss_scaling=1.0,
beta1=0.9,
beta2=0.98,
epsilon=1e-06,
boundaries=None,
values=None):
"""optimization funxtion"""
def exclude_from_weight_decay(name):
"""exclude from weight decay"""
name = name.rstrip('.master')
if name.find("layer_norm") > -1:
return True
bias_suffix = ["_bias", "_b", ".b_0"]
for suffix in bias_suffix:
if name.endswith(suffix):
return True
return False
if warmup_steps > 0:
if scheduler == 'noam_decay':
scheduled_lr = fluid.layers.learning_rate_scheduler \
.noam_decay(1 / (warmup_steps * (learning_rate ** 2)),
warmup_steps)
elif scheduler == 'linear_warmup_decay':
scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps,
num_train_steps)
elif scheduler == 'scale_by_epoch_decay':
if boundaries is None:
boundaries = [10000, 20000]
if values is None:
values = [5e-6, 5e-7, 5e-8]
scheduled_lr = fluid.layers.piecewise_decay(boundaries=boundaries, values=values)
else:
raise ValueError("Unkown learning rate scheduler, should be "
"'noam_decay' or 'linear_warmup_decay'")
optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr, beta1=beta1, beta2=beta2, epsilon=epsilon)
else:
scheduled_lr = fluid.layers.create_global_var(
name=fluid.unique_name.generate("learning_rate"),
shape=[1],
value=learning_rate,
dtype='float32',
persistable=True)
optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr, beta1=beta1, beta2=beta2, epsilon=epsilon)
optimizer._learning_rate_map[fluid.default_main_program()] = scheduled_lr
if use_fp16:
optimizer = fluid.contrib.mixed_precision.decorator.decorate(optimizer,
amp_lists=fluid.contrib.mixed_precision.AutoMixedPrecisionLists(
custom_black_varnames={'loss'},
custom_black_list={'layer_norm', 'arg_max', 'argmax'}),
init_loss_scaling=init_loss_scaling,
use_dynamic_loss_scaling=use_dynamic_loss_scaling)
loss_scaling = optimizer.get_loss_scaling()
else:
loss_scaling = None
fluid.clip.set_gradient_clip(
clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0))
param_list = dict()
for param in train_program.global_block().all_parameters():
param_list[param.name] = param * 1.0
param_list[param.name].stop_gradient = True
_, param_grads = optimizer.minimize(loss)
if weight_decay > 0:
for param, grad in param_grads:
if exclude_from_weight_decay(param.name):
continue
with param.block.program._optimized_guard(
[param, grad]), fluid.framework.name_scope("weight_decay"):
updated_param = param - param_list[
param.name] * weight_decay * scheduled_lr
fluid.layers.assign(output=param, input=updated_param)
return scheduled_lr, loss_scaling
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""results statistics"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
import os
import numpy as np
import argparse
from args import ArgumentGroup
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "stat", "stat configuration")
model_g.add_arg("log_dir", str, None, "stat log dir")
model_g.add_arg("random_slot", int, 2, "random slot in log file")
model_g.add_arg("key_words", str, "lanch.log", "key words indentify log file")
model_g.add_arg("line_prefix", str, "Best validation result:", "key words indentify final score to stat")
model_g.add_arg("score_slot", int, -1, "score slot in stat line")
model_g.add_arg("final_res_file", str, "final_res.txt", "the file to save final stat score")
args = parser.parse_args()
def get_res(infile):
"""get results"""
acc = 0
with open(infile) as fr:
for line in fr.readlines():
line = line.strip('\r\n')
if line.startswith(args.line_prefix):
acc = float(line.split(' ')[args.score_slot])
return acc
def print_stat(score_files):
"""print statistics"""
nums = len(score_files)
max_score, max_score_file = score_files[0]
min_score, min_score_file = score_files[-1]
median_score, median_score_file = score_files[int(nums / 2)]
mean_score = np.average([s for s, f in score_files])
log = 'tot_random_seed %d\nmax_score %f max_file %s\nmin_score %f min_file %s' \
'\nmedian_score %f median_file %s\navg_score %f' % \
(nums, max_score, max_score_file, min_score, min_score_file,
median_score, median_score_file, mean_score)
print(log)
score_file = {}
for file in os.listdir(args.log_dir):
if args.key_words in file:
randint = file.split('_')[args.random_slot]
acc = get_res(os.path.join(args.log_dir, file))
if randint in score_file:
score_file[randint].append((acc, file))
else:
score_file[randint] = [(acc, file)]
best_randint_score_file = []
for randint, s_f in score_file.items():
sorted_s_f = sorted(s_f, key=lambda a: a[0], reverse=True)
best_randint_score_file.append(sorted_s_f[0])
best_randint_score_file = sorted(best_randint_score_file, key=lambda a: a[0], reverse=True)
sys.stdout = open(os.path.join(args.log_dir, args.final_res_file), 'w')
print_stat(best_randint_score_file)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""utils"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import time
def visualdl_log(metrics_output, train_loss, steps, phase):
"""log visualization
"""
print("{phase} log: steps {steps}, loss {loss}, metrics: {metrics}".format(
phase=phase, steps=steps, loss=train_loss, metrics=metrics_output))
def print_eval_log(ret):
"""print log"""
prefix_log = "[%s evaluation] ave loss: %.4f," % (ret['phase'], ret['loss'])
postfix_log = "data_num: %d, elapsed time: %.4f s" % (ret['data_num'], ret['used_time'])
mid_log = " "
for k, v in ret.items():
if k not in ['phase', 'loss', 'data_num', 'used_time', 'key_eval']:
mid_log = mid_log + "%s: %.4f, " % (k, round(v, 4))
log = prefix_log + mid_log + postfix_log
print(log)
def get_time():
"""get time"""
res = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
return res
#!/bin/bash
function check_iplist() {
if [ ${iplist:-} ]; then
export PADDLE_PSERVER_PORT=9184
export PADDLE_TRAINER_IPS=${iplist}
export PADDLE_CURRENT_IP=`hostname -i`
iparray=(${iplist//,/ })
for i in "${!iparray[@]}"; do
if [ ${iparray[$i]} == ${PADDLE_CURRENT_IP} ]; then
export PADDLE_TRAINER_ID=$i
fi
done
export TRAINING_ROLE=TRAINER
export PADDLE_INIT_TRAINER_COUNT=${#iparray[@]}
export PADDLE_PORT=${PADDLE_PSERVER_PORT}
export PADDLE_TRAINERS=${PADDLE_TRAINER_IPS}
export POD_IP=${PADDLE_CURRENT_IP}
export PADDLE_TRAINERS_NUM=${PADDLE_INIT_TRAINER_COUNT}
export PADDLE_IS_LOCAL=0
export GLOG_v=0
export GLOG_logtostderr=1
export NCCL_DEBUG=INFO
export NCCL_IB_GID_INDEX=3
fi
}
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册