未验证 提交 69a9e2fa 编写于 作者: T tangjiji 提交者: GitHub

fix paddle install (#574)

* add ernie-vil

* fix requirements.txt

* Update README.md

* Update requirements.txt

* Update requirements.txt

* Update requirements.txt

* Update requirements.txt
上级 da820fc1
English| [简体中文](./README_zh.md)
## _ERNIE-ViL_: Knowledge Enhanced Vision-Language Representations Through Scene Graph
- [Framework](#framework)
- [Pre-trained models](#pre-trained-models)
- [Downstream tasks](#downstream-tasks)
* [VCR](#VCR)
- [Usage](#usage)
* [Install PaddlePaddle](#install-paddlepaddle)
* [Fine-tuning on ERNIE-ViL](#fine-tuning-on-ernie-vil)
* [Inference](#inference)
- [Citation](#citation)
For technical description of the algorithm, please see our paper:
>[_**ERNIE-ViL:Knowledge Enhanced Vision-Language Representations Through Scene Graph**_](https://arxiv.org/abs/2006.16934)
>
>Fei Yu\*, Jiji Tang\*, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution)
>
>Preprint June 2020
>
![ERNIE-ViL](https://img.shields.io/badge/Pretraining-vision_and_language_joint_representions-green)
![VQA](https://img.shields.io/badge/VQA-Visual_Question_Answering-yellow)
![VCR](https://img.shields.io/badge/VCR-Visual_Commensense_Reasoning-blue) ![RefCOCO+](https://img.shields.io/badge/RefCOCO+-Region_to_Phrase_Grounding-green)
![IRTR](https://img.shields.io/badge/IR_&TR-Image_Retrieval&_Text_Retrieval-yellowgreen)
**[ERNIE-ViL](https://arxiv.org/abs/2006.16934) is a knowledge-enhanced joint representations for vision-language tasks**, which is the first work that has **introduced structured knowledge to enhance vision-language pre-training**. Utilizing structured knowledge obtained
from scene graphs, ERNIE-ViL constructs three **Scene Graph Prediction tasks**, i.e., **Object Prediction**, **Attribute Prediction** and **Relationship Prediction** tasks.
Thus, ERNIE-ViL can learn the better joint vision-language representations characterizing the alignments of the detailed semantics across vision and language.
## Framework
Based on the scene graph parsed from the text using Scene Graph Parser, we construct Object Prediction, Attribute Prediction and Relationship Prediction tasks:
- **Object Prediction:** We randomly select a set of the objects in the scene graph, then mask and predict the corresponding words in the sentence.
- **Attribute Prediction:** For the object-attribute pairs in the scene graph, we randomly select a part of them to mask and predict the words related to the attribute nodes in the sentence.
- **Realtionship Prediction:** For the object-relationship-object triplets in the scene graph, we randomly select a part of realtionship nodes to mask and predict them.
![ernie_vil_struct](.meta/ernie_vil_struct.png)
<font face="黑体" color=black size=5>Model Architecture of ERNIE-ViL</font>
## Pre-trained Models
ERNIE-ViL adopts large-scale image-text aligned datasets as the pre-training data. We provide ERNIE-ViL models of two scale settings which are pretrained on [**Conceptual Captions**](https://www.aclweb.org/anthology/P18-1238.pdf) and [**SBU Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio).
- [**ERNIE-ViL _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-en.1.tar.gz) (_lowercased | 12-text-stream-layer, 6-visual-stream-layer_)
- [**ERNIE-ViL _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_)
## Downstream tasks
We finetune ERNIE-ViL on five vision-langage downstream tasks, i.e., Visual Commensense Reasoning([**VCR**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.pdf)),
Visual Question Answering([**VQA**](https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf)),
Cross-modal Image Retrieval([**IR**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)),
Cross-modal Text Retrieval([**TR**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)) and
Region_to_Phrase_Grounding([**RefCOCO+**](https://www.aclweb.org/anthology/D14-1086.pdf)).
_Code and pre-trained models related to VCR task are made public now, and those of more downstream tasks are planed to be public._
### VCR
* datasets
* The training, validation and testing data of VCR task are provided by [**VCR Website**](https://visualcommonsense.com/download/).
* Organization of visual features is modified from [**ViLBERT**](https://github.com/jiasenlu/vilbert_beta), we directly use the data from it. Data can be downloaded [here](https://github.com/jiasenlu/vilbert_beta/tree/master/data).
* Put all downloaded files under diretory "data/vcr".
* Task pre-training: We perform task-pretraining on VCR task, which is also known as task-specific-pretraining. The trained models are as follows:
* [**ERNIE-ViL-VCR-task-pretrain _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-VCR-task-pre-en.1.tar.gz)
* [**ERNIE-ViL-VCR-task-pretrain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-VCR-task-pre-en.1.tar.gz)
* Performance: Results of VCR task for ERNIE-ViL model, compared with previous state-of-the-art pre-trained models([**VILLA**](https://arxiv.org/pdf/2006.06195.pdf)).
| Models | <strong>Q->A</strong> | <strong>QA->R</strong> | <strong>Q->AR</strong> |
| :--------------------------------------| :---------------------------: | :----------------------------: | :-----------------------------: |
| VILLA (task-pretrain) _base_ | 75.54(76.4) | 78.78(79.1) | 59.75(60.6) |
| ERNIE-ViL (task-pretrain) _base_ | 76.37(77.0) | 79.65(80.3) | 61.24(62.1) |
| VILLA (task-pretrain) _large_ | 78.45(78.9) | 82.57(82.8) | 65.18(65.7) |
| ERNIE-ViL (task-pretrain) _large_ | <strong>78.52(79.2)</strong> | <strong>83.37(83.5)</strong> | <strong/>65.81(66.3) </strong> |
_Numerical results outside and inside parentheses represent the dev and test performance of VCR task respectively.
Test results are obtained from the [**VCR leadborad**](https://visualcommonsense.com/leaderboard/)._
## Usage
### Install PaddlePaddle
This code has been tested with Paddle Fluid 1.8 with Python 2.7. Other dependencies of ERNIE-ViL are listed in `requirements.txt`, you can install them by
```script
pip install -r requirements.txt
```
### Fine-tuning on ERNIE-ViL
Please update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2 before fine-tuning. You can easily run fine-tuning through
configuration files. For example, you can finetune ERNIE-ViL model on VCR task by
```script
sh run_finetuning.sh vcr conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $pretrain_models
```
Files which are needed by fine-tuning can be found in our given download links, incluing vocabulary dictionary, configuration
file and pre-trained parameters. Note that our fine-tuning experiments on VCR are carried on 4 NVIDIA V100 (32GB) GPUs.
If your GPU memory is not enough, you can reduce the batch size in the corresponding configuration file, e.g., "conf/vcr/model_conf_vcr".
### Inference
You can use the following command to infer fine-tuned models. For example, you can infer VCR models by the following commands for different sub-tasks:
**Task Q->A**
```script
sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
```
**Task QA->R**
```script
sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
```
## Citation
You can cite the paper as below:
```
@article{yu2020ernie,
title={ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph},
author={Yu, Fei and Tang, Jiji and Yin, Weichong and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2006.16934},
year={2020}
}
```
[English](./README.md) | 简体中文
## _ERNIE-ViL_: Knowledge Enhanced Vision-Language Representations Through Scene Graph
- [模型框架](#模型框架)
- [预训练模型](#预训练模型)
- [下游任务](#下游任务)
* [视觉推理](#视觉推理)
- [使用说明](#使用说明)
* [安装飞桨](#安装飞桨)
* [运行微调](#运行微调)
* [预测](#预测)
- [引用](#引用)
关于算法的详细描述,请参见我们的论文
>[_**ERNIE-ViL:Knowledge Enhanced Vision-Language Representations Through Scene Graph**_](https://arxiv.org/abs/2006.16934)
>
>Fei Yu\*, Jiji Tang\*, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution)
>
>Preprint June 2020
>
![ERNIE-ViL](https://img.shields.io/badge/预训练-视觉语言联合表示-green)![VQA](https://img.shields.io/badge/视觉问答-VQA-yellow) ![VCR](https://img.shields.io/badge/视觉常识推理-VCR-blue) ![RefCOCO](https://img.shields.io/badge/引用表达式理解-RefCOCO+-green) ![IRTR](https://img.shields.io/badge/跨模态检索-IR&TR-yellowgreen)
---
**ERNIE-ViL
是面向视觉-语言任务的知识增强预训练框架**,首次在视觉-语言预训练中引入了结构化的知识。ERNIE-ViL利用场景图中的结构化知识,构建了**物体预测,属性预测,关系预测**三种预训练任务,精细地刻画了视觉-语言模态之间细粒度语义的对齐,从而获得了更好的视觉-语言联合表示。
## 模型框架
基于文本中解析出的场景图,ERNIE-ViL提出了三个多模态场景图预测任务:
- **物体预测**:随机选取图中的一部分物体,然后对其在句子中对应的词进行掩码和预测;
- **属性预测**:对于场景图中的属性-物体组合,随机选取一部分词对其中属性词进行掩码和预测;
- **关系预测**:对于场景图中的物体-关系-物体三元组,对其中的关系词进行掩码和预测。
![ernie_vil_struct](.meta/ernie_vil_struct.png)
ERNIE-ViL 场景图预训练任务结构
## 预训练模型
ERNIE-ViL使用大规模图文对齐数据集作为预训练数据,基于[**Conceptual
Captions**](https://www.aclweb.org/anthology/P18-1238.pdf)[**SBU
Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio)数据集,训练和发布了两种参数规模的模型:
- [**ERNIE-ViL _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-en.1.tar.gz) (_lowercased | 12-text-stream-layer, 6-visual-stream-layer_)
- [**ERNIE-ViL _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_)
## 下游任务
ERNIE-ViL在五个视觉语言下游任务进行了实验,包括[**视觉常识推理**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.pdf)
[**视觉问答**](https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf)
[**跨模态图片检索**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)
[**跨模态文本检索**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)
[**引用式理解**](https://www.aclweb.org/anthology/D14-1086.pdf)
_当前仅开源视觉常识推理任务相关模型和代码,后续计划开源更多下游任务的模型和代码。_
### **视觉常识推理**
* 数据集合
* 训练、验证和测试集合相关数据由[**视觉常识推理官网**](http://visualcommonsense.com/download/)提供;
* 视觉端特征的组织方式借鉴[**ViLBERT**](https://github.com/jiasenlu/vilbert_beta), 因此项目直接使用**ViLBERT**中的数据,数据[下载地址](https://github.com/jiasenlu/vilbert_beta/tree/master/data);
* 将所有获取的文件放在 data/vcr 目录下;
* 任务预训练: 在视觉推理任务中进行了任务预训练,预训练获得模型如下
* [**ERNIE-ViL-VCR-task-pretrain _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-VCR-task-pre-en.1.tar.gz)
* [**ERNIE-ViL-VCR-task-pretrain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-VCR-task-pre-en.1.tar.gz)
* 效果: ERNIE-ViL与之前最优预训练模型[**VILLA**](https://arxiv.org/pdf/2006.06195.pdf)在视觉常识推理任务上的效果对比如下:
| 模型 | <strong>Q->A</strong> | <strong>QA->R</strong> | <strong>Q->AR</strong> |
| :---------------------------------- | :---------------------------: | :----------------------------: | :---------------------------: |
| VILLA (task-pretrain) _base_ | 75.54(76.4) | 78.78(79.1) | 59.75(60.6) |
| ERNIE-ViL (task-pretrain) _base_ | 76.37(77.0) | 79.65(80.3) | 61.24(62.1) |
| VILLA (task-pretrain) _large_ | 78.45(78.9) | 82.57(82.8) | 65.18(65.7) |
| ERNIE-ViL (task-pretrain) _large_ | <strong>78.52(79.2)</strong> | <strong>83.37(83.5)</strong> | <strong/>65.81(66.3) </strong> |
_注:括号外表示验证集效果,括号内表示测试集效果,测试集效果由[VCR榜单](https://visualcommonsense.com/leaderboard/)提供。_
## 使用说明
### 安装飞桨
ERNIE-ViL代码基于Paddle Fluid 1.8 和 Python 2.7, 依赖的其他模块也列举在 requirements.txt,可以通过下面的指令安装:
```script
pip install -r requirements.txt
```
### 运行微调
在运行 ERNIE-ViL 前,需要将 CUDA 、cuDNN 、NCCL2 的动态库路径添加到 LD_LIBRARY_PATH 。 我们把下游任务的参数配置文件放到了 conf/ ,可以简单地通过配置文件运行。 例如,您可以通过下面的指令在VCR上任务上进行微调:
```script
sh run_finetuning.sh vcr conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $pretrain_models_params
```
前面提供的模型链接中包含了所有需要的文件, 包含词表文件,配置文件和预训练参数。VCR任务的微调实验是在 4 张32 GB 的英伟达V100 GPU上运行,如果您的GPU显存不够,可以考虑八张卡运行或者减小配置中的batch_size。
_我们目前开放了预训练模型和VCR的任务代码,其他的下游任务可以参考任务自主尝试。_
### 预测
基于已经训练的模型,您可以通过下面的命令测试VCR的效果:
**Task Q->A**
```script
sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
```
**Task QA->R**
```script
sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
```
VCR的测试可以在一张32GB的英伟达V100 GPU上运行,测试的结果包含Q->A 任务、QA->R任务和Q->AR任务,其中Q->AR任务由前两个任务结果合并所得。
## 引用
可以按下面的格式引用我们的论文:
```
@article{yu2020ernie,
title={ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph},
author={Yu, Fei and Tang, Jiji and Yin, Weichong and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2006.16934},
year={2020}
}
```
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" args defination and default value """
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
import argparse
from utils.args import ArgumentGroup, print_arguments
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("ernie_config_path", str, "./config/ernie_config.json", "json file path for ernie model config.")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
model_g.add_arg("task_name", str, "vcr", "Task to finetune on ERNIE-ViL")
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 100, "Number of epoches for training.")
train_g.add_arg("learning_rate", float, 0.0001, "Learning rate used to train with warmup.")
train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
"scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay', 'manual_warmup_decay'])
train_g.add_arg("decay_steps", str, "", "learning rate decay steps, list with ;")
train_g.add_arg("lr_decay_ratio", float, 0.1, "learning rate decay ratio, used with manual_warmup_decay")
train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
train_g.add_arg("num_train_steps", int, 1000000, "Total steps to perform pretraining.")
train_g.add_arg("warmup_steps", int, 0, "Total steps to perform warmup when pretraining.")
train_g.add_arg("save_steps", int, 100, "The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 6000, "The steps interval to evaluate model performance.")
train_g.add_arg("use_fuse", bool, False, "Whether to use fuse_allreduce_ops.")
train_g.add_arg("nccl_comm_num", int, 1, "NCCL comm num.")
train_g.add_arg("hierarchical_allreduce_inter_nranks", int, 8, "Hierarchical allreduce inter ranks.")
train_g.add_arg("use_hierarchical_allreduce", bool, False, "Use hierarchical allreduce or not.")
train_g.add_arg("use_gpu", bool, True, "Whether to gpu.")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("result_file", str, "./res_tmp", "file to storage results")
data_g.add_arg("lr_decay_dict_file", str, "", "learning rate decay files.")
data_g.add_arg("train_filelist", str, "", "Path to training filelist.")
data_g.add_arg("valid_filelist", str, "", "Path to valid filelist.")
data_g.add_arg("test_filelist", str, "", "Path to test filelist.")
data_g.add_arg("vocab_path", str, "./config/vocab.txt", "Vocabulary path.")
data_g.add_arg("test_split", str, "val", "test of sub tasks, val or test")
data_g.add_arg("max_seq_len", int, 128, "Number of words of the longest seqence.")
data_g.add_arg("max_img_len", int, 100, "Number of image rois of the longest seqence.")
data_g.add_arg("feature_size", int, 2048, "Number of roi feature size of image.")
data_g.add_arg("fusion_method", str, "sum", "Number of roi feature size of image.")
data_g.add_arg("batch_size", int, 16, "Total examples' number in batch for training. see also --in_tokens.")
data_g.add_arg("task_group_json", str, "", "Path to task json")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
run_type_g.add_arg("do_train", bool, False, "Whether to perform evaluation on test data set.")
run_type_g.add_arg("do_test", bool, False, "Whether to perform evaluation on test data set.")
run_type_g.add_arg("output_file", str, "", "The output file to save model output.")
# yapf: enable
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" prepare data format for finetuning tasks """
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
from six.moves import xrange
def prepare_batch_data(batch_records, num_choice, pad_id, task_index, task_num):
"""
prepare batch data for finetuning tasks
"""
batch_input_ids = []
batch_input_pos = []
batch_seg_ids = []
batch_input_masks = []
num_sample = len(batch_records)
batch_lens = [record["input_lens"] for record in batch_records]
batch_labels = [record["target"] for record in batch_records]
binary_labels = np.zeros([num_choice * num_sample, 1], dtype='float32')
for i, l in enumerate(batch_labels):
binary_labels[i * num_choice + l] = 1.0
labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
image_features = [record["features"] for record in batch_records]
image_boxes = [record["boxes"] for record in batch_records]
batch_anno_ids = np.array([record["anno_id"] for record in batch_records]).astype("int64").reshape([-1, 1])
max_len = max([max(lens) for lens in batch_lens])
for i in range(len(batch_records)):
batch_input_ids.append([inst + list([pad_id] * (max_len - len(inst))) \
for inst in batch_records[i]["input_ids"]])
batch_input_pos.append([inst + list([pad_id] * (max_len - len(inst))) \
for inst in batch_records[i]["input_pos"]])
batch_seg_ids.append([inst + list([pad_id] * (max_len - len(inst))) \
for inst in batch_records[i]["segment_ids"]])
batch_input_masks.append([[1] * len(inst) + [0] * (max_len - len(inst)) \
for inst in batch_records[i]["input_ids"]])
image_embedding, image_mask = pad_feature_data(image_features, return_mask=True)
image_loc = pad_feature_data(image_boxes)
src_ids = np.array(batch_input_ids).astype("int64").reshape([num_choice * num_sample, max_len, 1])
src_pos = np.array(batch_input_pos).astype("int64").reshape([num_choice * num_sample, max_len, 1])
src_seg = np.array(batch_seg_ids).astype("int64").reshape([num_choice * num_sample, max_len, 1])
src_masks = np.array(batch_input_masks).astype("float32").reshape([num_choice * num_sample, max_len, 1])
src_task = np.zeros(src_ids.shape, dtype="int64")
batch, seq_len, fea_len = image_embedding.shape
image_embedding = np.tile(np.expand_dims(image_embedding, axis=1), \
(1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, fea_len])
image_mask = np.tile(np.expand_dims(image_mask, axis=1), \
(1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, 1])
image_loc = np.tile(np.expand_dims(image_loc, axis=1), \
(1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, 5])
return_list = [src_ids, src_pos, src_seg, src_task, src_masks, \
image_embedding, image_loc, image_mask, labels, batch_anno_ids]
return_list.append(np.array([task_index]).astype('int64'))
return_list.append(binary_labels)
for i in xrange(task_num):
if i == task_index:
return_list.append(np.array([1.0]).astype("float32"))
else:
return_list.append(np.array([0.0]).astype("float32"))
return return_list
def pad_feature_data(data, pad_value=0.0, dtype="float32", return_mask=False):
"""
pad visual features with given pad value
"""
max_lenth=max([len(item) for item in data])
data_width = len(data[0][0])
out_data = np.ones((len(data), max_lenth, data_width), dtype=dtype) * pad_value
out_mask = np.zeros((len(data), max_lenth, 1), dtype=dtype)
for i in range(len(data)):
out_data[i, 0: len(data[i]), :] = data[i]
if return_mask:
out_mask[i, 0:len(data[i]):] = 1.0
if return_mask:
return out_data, out_mask
else:
return out_data
if __name__ == "__main__":
pass
output_model_path="output_vcr"
lr_scheduler="manual_warmup_decay"
decay_steps="13308;19962"
lr_decay_ratio=0.1
num_train_steps=26640
SAVE_STEPS=6660
WARMUP_STEPS=6654
BATCH_SIZE=64
VALID_STEPS=20000
LR_RATE=2e-5
WEIGHT_DECAY=0.01
MAX_LEN=80
[
{
"task": "VCR_Q-A",
"num_choice": 4,
"annotations_jsonpath_train": "./data/vcr/train.jsonl",
"annotations_jsonpath_val": "./data/vcr/val.jsonl",
"annotations_jsonpath_test": "./data/vcr/test.jsonl",
"feature_lmdb_path": "./data/vcr/VCR_resnet101_faster_rcnn_genome_pickle2.lmdb",
"gt_feature_lmdb_path": "./data/vcr/VCR_gt_resnet101_faster_rcnn_genome_pickle2.lmdb",
"unisex_names_table" : "./data/vcr/unisex_names_table.csv",
"Proprocessor": "PreprocessorBasic",
"tokenizer_name" : "FullTokenizer",
"fusion_method" : "mul",
"dropout_rate" : 0.1,
"max_seq_len" : 60,
"use_gt_fea" : true,
"shufflekeep_across_task": true,
"shuffle_every_epoch": true,
"task_weight": 1.0,
"task_prefix": "vcr_qa"
},
{
"task": "VCR_QA-R",
"num_choice": 4,
"annotations_jsonpath_train": "./data/vcr/train.jsonl",
"annotations_jsonpath_val": "./data/vcr/val.jsonl",
"annotations_jsonpath_test": "./data/vcr/test.jsonl",
"feature_lmdb_path": "./data/vcr/VCR_resnet101_faster_rcnn_genome_pickle2.lmdb",
"gt_feature_lmdb_path": "./data/vcr/VCR_gt_resnet101_faster_rcnn_genome_pickle2.lmdb",
"unisex_names_table" : "./data/vcr/unisex_names_table.csv",
"Proprocessor": "PreprocessorBasic",
"tokenizer_name" : "FullTokenizer",
"fusion_method" : "mul",
"dropout_rate" : 0.1,
"max_seq_len" : 80,
"use_gt_fea" : true,
"shufflekeep_across_task": true,
"shuffle_every_epoch" : true,
"task_weight": 1.0,
"task_prefix": "vcr_qar"
}
]
[
{
"task": "VCR_Q-A",
"num_choice": 4,
"annotations_jsonpath_train": "./data/vcr/train.jsonl",
"annotations_jsonpath_val": "./data/vcr/val.jsonl",
"annotations_jsonpath_test": "./data/vcr/test.jsonl",
"feature_lmdb_path": "./data/vcr/VCR_resnet101_faster_rcnn_genome_pickle2.lmdb",
"gt_feature_lmdb_path": "./data/vcr/VCR_gt_resnet101_faster_rcnn_genome_pickle2.lmdb",
"unisex_names_table" : "./data/vcr/unisex_names_table.csv",
"Proprocessor": "PreprocessorBasic",
"tokenizer_name" : "FullTokenizer",
"tagger_path" : "./script/ntc.pickle",
"nltk_data_path" : "./nltk_data",
"fusion_method" : "mul",
"dropout_rate" : 0.1,
"max_seq_len" : 60,
"use_gt_fea" : true,
"task_prefix" : "vcr_qa"
}
]
[
{
"task": "VCR_QA-R",
"num_choice": 4,
"annotations_jsonpath_train": "./data/vcr/train.jsonl",
"annotations_jsonpath_val": "./data/vcr/val.jsonl",
"annotations_jsonpath_test": "./data/vcr/test.jsonl",
"feature_lmdb_path": "./data/vcr/VCR_resnet101_faster_rcnn_genome_pickle2.lmdb",
"gt_feature_lmdb_path": "./data/vcr/VCR_gt_resnet101_faster_rcnn_genome_pickle2.lmdb",
"unisex_names_table" : "./data/vcr/unisex_names_table.csv",
"Proprocessor": "PreprocessorBasic",
"tokenizer_name" : "FullTokenizer",
"vocab_path" : "./package/vocab.txt",
"tagger_path" : "./script/ntc.pickle",
"nltk_data_path" : "./nltk_data",
"fusion_method" : "mul",
"dropout_rate" : 0.1,
"max_seq_len" : 80,
"use_gt_fea" : true,
"task_prefix" : "vcr_qa"
}
]
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" finetuning vison-language task """
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import sys
import time
import datetime
import argparse
import numpy as np
import multiprocessing
import json
from reader.vcr_finetuning import VCRDataJointReader
from model.ernie_vil import ErnieVilModel, ErnieVilConfig
from optim.optimization import optimization
from utils.args import print_arguments
from utils.init import init_checkpoint, init_pretraining_params
from args.finetune_args import parser
import paddle.fluid as fluid
args = parser.parse_args()
# yapf: enable.
#READERS = {"vcr": VCRDataJointReader, "vqa": VQADataReader, "refcoco+": RefcocoReader, "flickr": FlickrReader}
READERS = {"vcr": VCRDataJointReader}
def format_result(res_arr, qids, pred, labels, scores):
"""
trans batch results into json format
"""
for i in range(len(qids)):
res="\t".join([str(qids[i]), str(pred[i]), str(labels[i]), " ".join(["%.5f" % s for s in scores[i]])])
res_arr.append(res)
return res_arr
def create_vcr_model(pyreader_name, ernie_config, task_group, is_prediction=False):
"""
create model arc for vcr tasks
"""
shapes = [[-1, args.max_seq_len, 1], #src_id
[-1, args.max_seq_len, 1], #pos_id
[-1, args.max_seq_len, 1], #sent_id
[-1, args.max_seq_len, 1], #task_id
[-1, args.max_seq_len, 1], #input_mask
[-1, args.max_img_len, args.feature_size], #image_embedding
[-1, args.max_img_len, 5], #image_loc
[-1, args.max_img_len, 1], #image_mask
[-1, 1], #labels
[-1, 1], #qids
[], #task_index
[-1, 1], #binary_labels
]
dtypes = ['int64', 'int64', 'int64', 'int64', 'float32', 'float32', 'float32', 'float32',
'int64', 'int64', 'int64', 'float32']
lod_levels = [0] * len(dtypes)
for _ in task_group:
shapes.append([])
dtypes.append('float')
lod_levels.append(0)
pyreader = fluid.layers.py_reader(
capacity=30,
shapes=shapes,
dtypes=dtypes,
lod_levels=lod_levels,
name=pyreader_name,
use_double_buffer=False)
inputs = fluid.layers.read_file(pyreader)
src_ids, pos_ids, sent_ids, task_ids, input_mask, image_embeddings, \
image_loc, image_mask, labels, q_ids, task_index, binary_labels = inputs[: 12]
ernie_vil = ErnieVilModel(
src_ids=src_ids,
position_ids=pos_ids,
sentence_ids=sent_ids,
task_ids=task_ids,
input_mask=input_mask,
image_embeddings=image_embeddings,
image_loc=image_loc,
input_image_mask=image_mask,
config=ernie_config
)
h_cls, h_img = ernie_vil.get_pooled_output()
task_conf = task_group[0]
fusion_method = task_conf["fusion_method"]
fusion_fea = ernie_vil.get_match_score(text=h_cls, image=h_img, \
dropout_rate=task_conf["dropout_rate"],
mode=fusion_method)
if is_prediction:
num_choice = int(task_conf['num_choice'])
task_name = task_conf.get('task_prefix', 'vcr')
score = fluid.layers.fc(fusion_fea, 1,
param_attr = fluid.ParamAttr(name = task_name + "_fc.w_0",
initializer = fluid.initializer.TruncatedNormal(scale = 0.02)),
bias_attr = task_name + "_fc.b_0")
score = fluid.layers.reshape(score, shape = [-1, num_choice])
_loss, _softmax = fluid.layers.softmax_with_cross_entropy(logits = score,
label = labels, return_softmax = True)
_acc = fluid.layers.accuracy(input = _softmax, label = labels)
pred = fluid.layers.argmax(score, axis = 1)
mean_loss = fluid.layers.mean(_loss)
task_vars = [mean_loss, _acc, pred, q_ids, labels, _softmax]
for var in task_vars:
var.persistable = True
return pyreader, task_vars
else:
start_ind = 12
mean_loss = fluid.layers.zeros(shape = [1], dtype = 'float32')
mean_acc = fluid.layers.zeros(shape = [1], dtype = 'float32')
for task_conf in task_group:
task_weight = inputs[start_ind]
start_ind += 1
num_choice = int(task_conf['num_choice'])
task_name = task_conf.get('task_prefix', 'vcr')
score = fluid.layers.fc(fusion_fea, 1,
param_attr = fluid.ParamAttr(name = task_name + "_fc.w_0",
initializer = fluid.initializer.TruncatedNormal(scale = 0.02)),
bias_attr = task_name + "_fc.b_0")
_loss = fluid.layers.sigmoid_cross_entropy_with_logits(score,
binary_labels, name = "cross_entropy_loss")
tmp_score = fluid.layers.reshape(score, shape = [-1, num_choice])
_softmax = fluid.layers.softmax(tmp_score)
_acc = fluid.layers.accuracy(input = _softmax, label = labels)
_mean_loss = fluid.layers.mean(_loss)
mean_loss += _mean_loss * task_weight
mean_acc += _acc * task_weight
task_vars = [fluid.layers.reduce_mean(mean_loss), mean_acc]
for var in task_vars:
var.persistable = True
return pyreader, task_vars
#MODELS = {"vcr": create_vcr_model, "vqa": create_vqa_model, "refcoco+": create_refcoco_model}
MODELS = {"vcr": create_vcr_model}
def predict_wrapper(args,
exe,
ernie_config,
task_group,
test_prog=None,
pyreader=None,
graph_vars=None):
"""Context to do validation.
"""
reader_name = READERS[args.task_name]
data_reader = reader_name(
task_group,
split=args.test_split,
vocab_path=args.vocab_path,
is_test=True,
shuffle=False,
batch_size=args.batch_size,
epoch=args.epoch)
if args.do_test:
assert args.init_checkpoint is not None, "[FATAL] Please use --init_checkpoint '/path/to/checkpoints' \
to specify you pretrained model checkpoints"
init_pretraining_params(exe, args.init_checkpoint, test_prog)
print(("testing on %s %s split") % (args.task_name, args.test_split))
def predict(exe=exe, pyreader=pyreader):
"""
inference for downstream tasks
"""
pyreader.decorate_tensor_provider(data_reader.data_generator())
pyreader.start()
cost = 0
appear_step = 0
task_acc = {}
task_steps = {}
steps = 0
case_f1 = 0
appear_f1 = 0
time_begin = time.time()
task_name_list = [v.name for v in graph_vars]
fetch_list = task_name_list
print('task name list : ', task_name_list)
sum_acc = 0
res_arr = []
while True:
try:
outputs = exe.run(fetch_list=fetch_list, program=test_prog)
each_acc = outputs[1][0]
preds = np.reshape(outputs[2], [-1])
qids = np.reshape(outputs[3], [-1])
labels = np.reshape(outputs[4], [-1])
scores = np.reshape(outputs[5], [-1, 4])
sum_acc += each_acc
steps += 1
if steps % 10 == 0:
print('cur_step:', steps, 'cur_acc:', sum_acc / steps)
format_result(res_arr, qids.tolist(), preds.tolist(), labels.tolist(), scores.tolist())
except fluid.core.EOFException:
pyreader.reset()
break
used_time = time.time() - time_begin
with open(args.result_file, "w") as f:
for r in res_arr:
f.write(r + "\n")
print("average_acc:", sum_acc / steps)
ret = {}
ret["acc"] = "acc: %f" % (sum_acc / steps)
for item in ret:
try:
ret[item] = ret[item].split(':')[-1]
except:
pass
return ret
return predict
def get_optimizer(total_loss, train_program, startup_prog, args):
"""
optimization func
"""
decay_steps_str=args.decay_steps
if decay_steps_str == "":
decay_steps = []
else:
decay_steps = [int(s) for s in decay_steps_str.split(";")]
scheduled_lr = optimization(
loss=total_loss,
warmup_steps=args.warmup_steps,
num_train_steps=args.num_train_steps,
learning_rate=args.learning_rate,
train_program=train_program,
startup_prog=startup_prog,
weight_decay=args.weight_decay,
scheduler=args.lr_scheduler,
decay_steps=decay_steps,
lr_decay_ratio=args.lr_decay_ratio)
return scheduled_lr
def main(args):
"""
Main func for downstream tasks
"""
print("finetuning tasks start")
ernie_config = ErnieVilConfig(args.ernie_config_path)
ernie_config.print_config()
with open(args.task_group_json) as f:
task_group = json.load(f)
print('task: ', task_group)
startup_prog = fluid.Program()
if args.do_train and args.do_test:
print("can not set both do_train and do_test as True")
return
model_name = MODELS[args.task_name]
if args.do_train:
train_program = fluid.Program()
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
train_pyreader, model_outputs = model_name(
pyreader_name='train_reader', ernie_config=ernie_config, task_group=task_group)
total_loss = model_outputs[0]
scheduled_lr = get_optimizer(total_loss, train_program, startup_prog, args)
if args.do_test:
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, model_outputs = model_name(
pyreader_name='test_reader', ernie_config=ernie_config, task_group=task_group, is_prediction=True)
total_loss = model_outputs[0]
test_prog = test_prog.clone(for_test=True)
if args.use_gpu:
gpu_id = 0
if os.getenv("FLAGS_selected_gpus"):
gpu_id = int(os.getenv("FLAGS_selected_gpus"))
place = fluid.CUDAPlace(gpu_id) if args.use_gpu else fluid.CPUPlace()
print("theoretical memory usage: ")
if args.do_train:
print(fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size))
if args.do_test:
print(fluid.contrib.memory_usage(
program=test_prog, batch_size=args.batch_size))
nccl2_num_trainers = 1
nccl2_trainer_id = 0
print("args.is_distributed:", args.is_distributed)
trainer_id = 0
if args.is_distributed:
trainer_id = int(os.getenv("PADDLE_TRAINER_ID"))
worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
worker_endpoints = worker_endpoints_env.split(",")
trainers_num = len(worker_endpoints)
print("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
trainer_id:{}".format(worker_endpoints, trainers_num,
current_endpoint, trainer_id))
# prepare nccl2 env.
config = fluid.DistributeTranspilerConfig()
config.mode = "nccl2"
if args.nccl_comm_num > 1:
config.nccl_comm_num = args.nccl_comm_num
if args.use_hierarchical_allreduce and trainers_num > args.hierarchical_allreduce_inter_nranks:
config.use_hierarchical_allreduce=args.use_hierarchical_allreduce
config.hierarchical_allreduce_inter_nranks=args.hierarchical_allreduce_inter_nranks
assert config.hierarchical_allreduce_inter_nranks > 1
assert trainers_num % config.hierarchical_allreduce_inter_nranks == 0
config.hierarchical_allreduce_exter_nranks = \
trainers_num / config.hierarchical_allreduce_inter_nranks
t = fluid.DistributeTranspiler(config=config)
t.transpile(
trainer_id,
trainers=worker_endpoints_env,
current_endpoint=current_endpoint,
program=train_program,
startup_program=startup_prog)
nccl2_num_trainers = trainers_num
nccl2_trainer_id = trainer_id
exe = fluid.Executor(place)
exe.run(startup_prog)
if args.do_train:
if args.init_checkpoint and args.init_checkpoint != "":
sys.stderr.write('############################WARNING############################')
sys.stderr.write('####### using init_pretraining_params, not init_checkpoint ####')
sys.stderr.write('## meaning hyper param e.g. lr won\'t inherit from checkpoint##')
sys.stderr.write('###############################################################')
init_pretraining_params(exe, args.init_checkpoint, train_program)
reader_name=READERS[args.task_name]
data_reader = reader_name(
task_group,
split="train",
vocab_path=args.vocab_path,
batch_size=args.batch_size,
epoch=args.epoch,)
exec_strategy = fluid.ExecutionStrategy()
if args.use_fast_executor:
exec_strategy.use_experimental_executor = True
exec_strategy.num_threads = 2
exec_strategy.num_iteration_per_drop_scope = min(10, args.skip_steps)
build_strategy = fluid.compiler.BuildStrategy()
build_strategy.fuse_all_reduce_ops = False
if args.use_fuse:
build_strategy.fuse_all_reduce_ops = True
if args.do_train:
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
loss_name=total_loss.name,
build_strategy=build_strategy,
exec_strategy=exec_strategy,
main_program=train_program,
num_trainers=nccl2_num_trainers,
trainer_id=nccl2_trainer_id)
if args.do_test:
predict = predict_wrapper(
args,
exe,
ernie_config,
task_group,
test_prog=test_prog,
pyreader=test_pyreader,
graph_vars=model_outputs)
result = predict()
if args.do_train:
train_pyreader.decorate_tensor_provider(data_reader.data_generator())
train_pyreader.start()
steps = 0
time_begin = time.time()
node_nums = 1 #int(os.getenv("PADDLE_NODES_NUM"))
used_time_all = 0
while steps < args.num_train_steps:
try:
steps += node_nums
skip_steps = args.skip_steps * node_nums
fetch_list = []
if nccl2_trainer_id == 0 and steps % skip_steps == 0:
task_name_list = [v.name for v in model_outputs]
fetch_list = task_name_list
fetch_list.append(scheduled_lr.name)
time_begin = time.time()
outputs = train_exe.run(fetch_list=fetch_list)
if outputs:
print("feed_queue size", train_pyreader.queue.size())
progress_file = data_reader.get_progress()
epoch = progress_file["current_epoch"]
current_file_index = progress_file["current_file_index"]
total_file = progress_file["total_file"]
current_file = progress_file["current_file"]
print(
"epoch: %d, progress: %d/%d, step: %d, loss: %f, "
"acc: %f"
% (epoch, current_file_index, total_file, steps,
outputs[0][0],
outputs[1][0]))
print("steps:", steps)
print("save_steps:", args.save_steps)
np_lr = outputs[-1:]
date_str = datetime.datetime.now().strftime("%Y%m%d %H:%M:%S")
np_lr = float(np.mean(np_lr[0]))
print("%s current learning_rate:%.8f" % (date_str, np_lr))
if steps % args.save_steps == 0:
save_path = os.path.join(args.checkpoints, "step_" + str(steps))
print("save_path:", save_path)
fluid.io.save_persistables(exe, save_path, train_program)
time_end = time.time()
used_time = time_end - time_begin
time_end = time_begin
print("used_time:", used_time)
except fluid.core.EOFException:
train_pyreader.reset()
break
if __name__ == '__main__':
print_arguments(args)
main(args)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""ERNIE-ViL model"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import json
import six
import paddle.fluid as fluid
from model.vl_transformer_encoder import encoder, pre_process_layer
class ErnieVilConfig(object):
"""
configuration for ernie-vil
"""
def __init__(self, config_path):
self._config_dict = self._parse(config_path)
def _parse(self, config_path):
try:
with open(config_path) as json_file:
config_dict = json.load(json_file)
except Exception:
raise IOError("Error in parsing Ernie model config file '%s'" %
config_path)
else:
return config_dict
def __getitem__(self, key):
return self._config_dict[key]
def print_config(self):
"""
print configuration value
"""
for arg, value in sorted(six.iteritems(self._config_dict)):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
class ErnieVilModel(object):
"""
main class for ERNIE-ViL model
"""
def __init__(self,
src_ids,
position_ids,
sentence_ids,
task_ids,
input_mask,
image_embeddings,
image_loc,
input_image_mask,
config,
predict_feature=False,
predict_class=True,
use_attr=False,
use_soft_label=True):
self._emb_size = config['hidden_size']
self._n_layer = config['num_hidden_layers']
self._n_head = config['num_attention_heads']
self._v_head = config['v_num_attention_heads']
self._v_emb_size = config['v_hidden_size']
self._v_inter_hid = config['v_intermediate_size']
self._co_head = config['co_num_attention_heads']
self._co_emb_size = config['co_hidden_size']
self._co_inter_hid = config['co_intermediate_size']
self._voc_size = config['vocab_size']
self._class_size = config['class_size']
self._class_attr_size = config['class_attr_size']
self._max_position_seq_len = config['max_position_embeddings']
self._sent_types = config['sent_type_vocab_size']
self._task_types = config['task_type_vocab_size']
self._hidden_act = config['hidden_act']
self._prepostprocess_dropout = config['hidden_dropout_prob']
self._attention_dropout = config['attention_probs_dropout_prob']
self._v_biattention_id = config['v_biattention_id']
self._t_biattention_id = config['t_biattention_id']
self._predict_feature = predict_feature
self._predict_class = predict_class
self._use_attr = use_attr
self._use_soft_label = use_soft_label
self._word_emb_name = "word_embedding"
self._pos_emb_name = "pos_embedding"
self._sent_emb_name = "sent_embedding"
self._image_emb_name = "image_embedding"
self._loc_emb_name = "loc_embedding"
self._dtype = "float32"
self._emb_dtype = "float32"
# Initialize all weigths by truncated normal initializer, and all biases
# will be initialized by constant zero by default.
self._param_initializer = fluid.initializer.TruncatedNormal(
scale=config['initializer_range'])
self._build_model(src_ids, position_ids, sentence_ids, task_ids, input_mask, \
image_embeddings, image_loc, input_image_mask)
def _build_model(self, src_ids, position_ids, sentence_ids, task_ids, input_mask, \
image_embeddings, image_loc, input_image_mask):
# padding id in vocabulary must be set to 0
emb_out = fluid.layers.embedding(
input=src_ids,
size=[self._voc_size, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=self._word_emb_name, initializer=self._param_initializer),
is_sparse=False)
position_emb_out = fluid.layers.embedding(
input=position_ids,
size=[self._max_position_seq_len, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=self._pos_emb_name, initializer=self._param_initializer))
sent_emb_out = fluid.layers.embedding(
sentence_ids,
size=[self._sent_types, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(
name=self._sent_emb_name, initializer=self._param_initializer))
emb_out = emb_out + position_emb_out
emb_out = emb_out + sent_emb_out
emb_out = pre_process_layer(
emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
self_attn_mask = fluid.layers.matmul(
x=input_mask, y=input_mask, transpose_y=True)
self_attn_mask = fluid.layers.scale(
x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
n_head_self_attn_mask = fluid.layers.stack(
x=[self_attn_mask] * self._n_head, axis=1)
n_head_self_attn_mask.stop_gradient = True
image_embeddings = fluid.layers.fc(image_embeddings,
self._v_emb_size,
param_attr=fluid.ParamAttr(
name="image_emb.w_0",
initializer=self._param_initializer),
bias_attr = "image_emb.b_0",
num_flatten_dims = 2)
loc_emb_out = fluid.layers.fc(image_loc,
self._v_emb_size,
param_attr=fluid.ParamAttr(
name="image_loc.w_0",
initializer=self._param_initializer),
bias_attr = "image_loc.b_0",
num_flatten_dims = 2)
emb_vl_out = image_embeddings + loc_emb_out
emb_vl_out = pre_process_layer(
emb_vl_out, 'nd', self._prepostprocess_dropout, name='vl_pre_encoder')
self_attn_image_mask = fluid.layers.matmul(
x=input_image_mask, y=input_image_mask, transpose_y=True)
self_attn_image_mask = fluid.layers.scale(
x=self_attn_image_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
n_head_self_attn_image_mask = fluid.layers.stack(
x=[self_attn_image_mask] * self._v_head, axis=1)
n_head_self_attn_image_mask.stop_gradient = True
self_attn_vl_mask = fluid.layers.matmul(
x=input_image_mask, y=input_mask, transpose_y=True)
self_attn_vl_mask = fluid.layers.scale(
x=self_attn_vl_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
n_head_self_attn_vl_mask = fluid.layers.stack(
x=[self_attn_vl_mask] * self._co_head, axis=1)
n_head_self_attn_vl_mask.stop_gradient = True
self._enc_out, self._enc_vl_out = encoder(
enc_input=emb_out,
enc_vl_input=emb_vl_out,
attn_bias=n_head_self_attn_mask,
attn_image_bias=n_head_self_attn_image_mask,
attn_vl_bias=n_head_self_attn_vl_mask,
n_layer=self._n_layer,
n_head=self._n_head,
d_key=self._emb_size // self._n_head,
d_value=self._emb_size // self._n_head,
d_model=self._emb_size,
d_inner_hid=self._emb_size * 4,
v_head=self._v_head,
v_key=self._v_emb_size // self._v_head,
v_value=self._v_emb_size // self._v_head,
v_model=self._v_emb_size,
v_inner_hid=self._v_inter_hid,
co_head=self._co_head,
co_key=self._co_emb_size // self._co_head,
co_value=self._co_emb_size // self._co_head,
co_model=self._co_emb_size,
co_inner_hid=self._co_inter_hid,
prepostprocess_dropout=self._prepostprocess_dropout,
attention_dropout=self._attention_dropout,
relu_dropout=0,
hidden_act=self._hidden_act,
preprocess_cmd="",
postprocess_cmd="dan",
param_initializer=self._param_initializer,
v_biattention_id = self._v_biattention_id,
t_biattention_id = self._t_biattention_id,
name='encoder')
def get_sequence_output(self):
"""
Return sequence output of all text and img tokens
"""
return self._enc_out, self._enc_vl_out
def get_pooled_output(self):
"""
Get the first feature of each sequence for classification
"""
text_cls_feat = fluid.layers.slice(
input=self._enc_out, axes=[1], starts=[0], ends=[1])
text_cls_feat = fluid.layers.cast(
x=text_cls_feat, dtype=self._emb_dtype)
text_cls_feat = fluid.layers.fc(
input=text_cls_feat,
size=self._co_emb_size,
act="relu",
param_attr=fluid.ParamAttr(
name="pooled_fc_text.w_0", initializer=self._param_initializer),
bias_attr="pooled_fc_text.b_0")
image_cls_feat = fluid.layers.slice(
input=self._enc_vl_out, axes=[1], starts=[0], ends=[1])
image_cls_feat = fluid.layers.cast(
x=image_cls_feat, dtype=self._emb_dtype)
image_cls_feat = fluid.layers.fc(
input=image_cls_feat,
size=self._co_emb_size,
act="relu",
param_attr=fluid.ParamAttr(
name="pooled_fc_image.w_0", initializer=self._param_initializer),
bias_attr="pooled_fc_image.b_0")
return text_cls_feat, image_cls_feat
def get_match_score(self, text, image, dropout_rate=0.0, mode="mul"):
"""
match score for text [cls] and image [img] tokens
"""
if mode == "sum":
emb_fuse = text + image
elif mode == "mul":
emb_fuse = text * image
else:
"current mode %s is not supported" % mode
return
if dropout_rate > 0.0:
emb_fuse = fluid.layers.dropout(emb_fuse,
self._attention_dropout,
dropout_implementation="upscale_in_train")
return emb_fuse
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""two-stream Transformer encoder."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from functools import partial
import paddle.fluid as fluid
import paddle.fluid.layers as layers
def multi_head_attention(queries,
keys,
values,
attn_bias,
d_key,
d_value,
d_model,
n_head=1,
dropout_rate=0.,
cache=None,
param_initializer=None,
name='multi_head_att'):
"""
Multi-Head Attention. Note that attn_bias is added to the logit before
computing softmax activiation to mask certain selected positions so that
they will not considered in attention weights.
"""
keys = queries if keys is None else keys
values = keys if values is None else values
if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
raise ValueError(
"Inputs: quries, keys and values should all be 3-D tensors.")
def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
"""
Add linear projection to queries, keys, and values.
"""
q = layers.fc(input=queries,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_query_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_query_fc.b_0')
k = layers.fc(input=keys,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_key_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_key_fc.b_0')
v = layers.fc(input=values,
size=d_value * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_value_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_value_fc.b_0')
return q, k, v
def __split_heads(x, n_head):
"""
Reshape the last dimension of inpunt tensor x so that it becomes two
dimensions and then transpose. Specifically, input a tensor with shape
[bs, max_sequence_length, n_head * hidden_dim] then output a tensor
with shape [bs, n_head, max_sequence_length, hidden_dim].
"""
hidden_size = x.shape[-1]
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
reshaped = layers.reshape(
x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
# permuate the dimensions into:
# [batch_size, n_head, max_sequence_len, hidden_size_per_head]
return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
def __combine_heads(x):
"""
Transpose and then reshape the last two dimensions of inpunt tensor x
so that it becomes one dimension, which is reverse to __split_heads.
"""
if len(x.shape) == 3: return x
if len(x.shape) != 4:
raise ValueError("Input(x) should be a 4-D Tensor.")
trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
return layers.reshape(
x=trans_x,
shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
inplace=True)
def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
"""
Scaled Dot-Product Attention
"""
scaled_q = layers.scale(x=q, scale=d_key ** -0.5)
product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
if attn_bias:
product += attn_bias
weights = layers.softmax(product)
if dropout_rate:
weights = layers.dropout(
weights,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.matmul(weights, v)
return out
q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
if cache is not None: # use cache and concat time steps
# Since the inplace reshape in __split_heads changes the shape of k and
# v, which is the cache input for next time step, reshape the cache
# input from the previous time step first.
k = cache["k"] = layers.concat(
[layers.reshape(
cache["k"], shape=[0, 0, d_model]), k], axis=1)
v = cache["v"] = layers.concat(
[layers.reshape(
cache["v"], shape=[0, 0, d_model]), v], axis=1)
q = __split_heads(q, n_head)
k = __split_heads(k, n_head)
v = __split_heads(v, n_head)
ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
dropout_rate)
out = __combine_heads(ctx_multiheads)
# Project back to the model size.
proj_out = layers.fc(input=out,
size=d_model,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_output_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_output_fc.b_0')
return proj_out
def positionwise_feed_forward(x,
d_inner_hid,
d_hid,
dropout_rate,
hidden_act,
param_initializer=None,
name='ffn'):
"""
Position-wise Feed-Forward Networks.
This module consists of two linear transformations with a ReLU activation
in between, which is applied to each position separately and identically.
"""
hidden = layers.fc(input=x,
size=d_inner_hid,
num_flatten_dims=2,
act=hidden_act,
param_attr=fluid.ParamAttr(
name=name + '_fc_0.w_0',
initializer=param_initializer),
bias_attr=name + '_fc_0.b_0')
if dropout_rate:
hidden = layers.dropout(
hidden,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.fc(input=hidden,
size=d_hid,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_fc_1.w_0', initializer=param_initializer),
bias_attr=name + '_fc_1.b_0')
return out
def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
name=''):
"""
Add residual connection, layer normalization and droput to the out tensor
optionally according to the value of process_cmd.
This will be used before or after multi-head attention and position-wise
feed-forward networks.
"""
for cmd in process_cmd:
if cmd == "a": # add residual connection
out = out + prev_out if prev_out else out
elif cmd == "n": # add layer normalization
out = layers.layer_norm(
out,
begin_norm_axis=len(out.shape) - 1,
param_attr=fluid.ParamAttr(
name=name + '_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name=name + '_layer_norm_bias',
initializer=fluid.initializer.Constant(0.)))
elif cmd == "d": # add dropout
if dropout_rate:
out = layers.dropout(
out,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
return out
pre_process_layer = partial(pre_post_process_layer, None)
post_process_layer = pre_post_process_layer
def encoder_co_layer(enc_input,
enc_vl_input,
attn_vl_bias,
co_head,
co_key,
co_value,
co_model,
d_model,
d_inner_hid,
v_model,
v_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
name=''):
"""
Co_layer to perform co-attention from visual to language or from language to visual
"""
enc_input_pre = pre_process_layer(
enc_input,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_att')
enc_input_vl_pre = pre_process_layer(
enc_vl_input,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_vl_pre_att')
attn_output = multi_head_attention(
enc_input_pre,
enc_input_vl_pre,
enc_input_vl_pre,
layers.transpose(attn_vl_bias, perm=[0, 1, 3, 2]),
co_key,
co_value,
d_model,
co_head,
attention_dropout,
param_initializer=param_initializer,
name=name + '_multi_head_att')
attn_vl_output = multi_head_attention(
enc_input_vl_pre,
enc_input_pre,
enc_input_pre,
attn_vl_bias,
co_key,
co_value,
v_model,
co_head,
attention_dropout,
param_initializer=param_initializer,
name=name + '_vl_multi_head_att')
attn_output = post_process_layer(
enc_input,
attn_output,
postprocess_cmd,
prepostprocess_dropout,
name=name + '_post_att')
attn_vl_output = post_process_layer(
enc_vl_input,
attn_vl_output,
postprocess_cmd,
prepostprocess_dropout,
name=name + '_vl_post_att')
ffd_output = positionwise_feed_forward(
pre_process_layer(
attn_output,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_ffn'),
d_inner_hid,
d_model,
relu_dropout,
hidden_act,
param_initializer=param_initializer,
name=name + '_ffn')
ffd_vl_output = positionwise_feed_forward(
pre_process_layer(
attn_vl_output,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_vl_ffn'),
v_inner_hid,
v_model,
relu_dropout,
hidden_act,
param_initializer=param_initializer,
name=name + '_vl_ffn')
enc_output = post_process_layer(
attn_output,
ffd_output,
postprocess_cmd,
prepostprocess_dropout,
name=name + '_post_ffn')
enc_vl_output = post_process_layer(
attn_vl_output,
ffd_vl_output,
postprocess_cmd,
prepostprocess_dropout,
name=name + '_vl_post_ffn')
return enc_output, enc_vl_output
def encoder_layer(enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
name=''):
"""The encoder layers that can be stacked to form a deep encoder.
This module consits of a multi-head (self) attention followed by
position-wise feed-forward networks and both the two components companied
with the post_process_layer to add residual connection, layer normalization
and droput.
"""
attn_output = multi_head_attention(
pre_process_layer(
enc_input,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_att'),
None,
None,
attn_bias,
d_key,
d_value,
d_model,
n_head,
attention_dropout,
param_initializer=param_initializer,
name=name + '_multi_head_att')
attn_output = post_process_layer(
enc_input,
attn_output,
postprocess_cmd,
prepostprocess_dropout,
name=name + '_post_att')
ffd_output = positionwise_feed_forward(
pre_process_layer(
attn_output,
preprocess_cmd,
prepostprocess_dropout,
name=name + '_pre_ffn'),
d_inner_hid,
d_model,
relu_dropout,
hidden_act,
param_initializer=param_initializer,
name=name + '_ffn')
return post_process_layer(
attn_output,
ffd_output,
postprocess_cmd,
prepostprocess_dropout,
name=name + '_post_ffn')
def encoder(enc_input,
enc_vl_input,
attn_bias,
attn_image_bias,
attn_vl_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
v_head,
v_key,
v_value,
v_model,
v_inner_hid,
co_head,
co_key,
co_value,
co_model,
co_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
v_biattention_id=[0, 1, 2, 3, 4, 5],
t_biattention_id=[18, 19, 20, 21, 22, 23],
name=''):
"""
The encoder is composed of a stack of identical layers returned by calling
encoder_layer and encoder_co_layer
"""
v_start = 0
t_start = 0
block = 0
for v_layer_id, t_layer_id in zip(v_biattention_id, t_biattention_id):
v_end = v_layer_id
t_end = t_layer_id
for idx in range(t_start, t_end):
enc_output = encoder_layer(
enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd,
postprocess_cmd,
param_initializer=param_initializer,
name=name + '_layer_' + str(idx))
enc_input = enc_output
for idx in range(v_start, v_end):
enc_vl_output = encoder_layer(
enc_vl_input,
attn_image_bias,
v_head,
v_key,
v_value,
v_model,
v_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd,
postprocess_cmd,
param_initializer=param_initializer,
name=name + '_vlayer_' + str(idx))
enc_vl_input = enc_vl_output
enc_output, enc_vl_output = encoder_co_layer(
enc_input,
enc_vl_input,
attn_vl_bias,
co_head,
co_key,
co_value,
co_model,
d_model,
d_inner_hid,
v_model,
v_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd,
postprocess_cmd,
param_initializer=param_initializer,
name=name + '_colayer_' + str(block))
enc_input, enc_vl_input = enc_output, enc_vl_output
block += 1
v_start = v_end
t_start = t_end
enc_output = encoder_layer(
enc_output,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd,
postprocess_cmd,
param_initializer=param_initializer,
name=name + '_layer_' + str(t_end))
enc_vl_output = encoder_layer(
enc_vl_output,
attn_image_bias,
v_head,
v_key,
v_value,
v_model,
v_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd,
postprocess_cmd,
param_initializer=param_initializer,
name=name + '_vlayer_' + str(v_end))
enc_output = pre_process_layer(
enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
enc_vl_output = pre_process_layer(
enc_vl_output, preprocess_cmd, prepostprocess_dropout, name="vl_post_encoder")
return enc_output, enc_vl_output
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" text preprocess """
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import paddle.fluid as fluid
def manual_warmup_decay(learning_rate, warmup_steps, num_train_steps, decay_steps=[], lr_decay_ratio=0.1):
"""
Applies linear warmup of learning rate from 0 and keep constant.
"""
with fluid.default_main_program()._lr_schedule_guard():
lr = fluid.layers.tensor.create_global_var(
shape=[1],
value=0.0,
dtype='float32',
persistable=True,
name="scheduled_learning_rate")
global_step = fluid.layers.learning_rate_scheduler._decay_step_counter(
)
with fluid.layers.control_flow.Switch() as switch:
with switch.case(global_step < warmup_steps):
warmup_lr = learning_rate * (global_step / warmup_steps)
fluid.layers.tensor.assign(warmup_lr, lr)
for i, step in enumerate(decay_steps):
with switch.case(global_step < step):
decayed_lr = learning_rate * (global_step / global_step) * pow(lr_decay_ratio, i)
fluid.layers.tensor.assign(decayed_lr, lr)
with switch.default():
constant_lr = learning_rate * (global_step / global_step) * pow(lr_decay_ratio, len(decay_steps))
fluid.layers.tensor.assign(constant_lr, lr)
return lr
def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
"""
Applies linear warmup of learning rate from 0 and decay to 0.
"""
with fluid.default_main_program()._lr_schedule_guard():
lr = fluid.layers.tensor.create_global_var(
shape=[1],
value=0.0,
dtype='float32',
persistable=True,
name="scheduled_learning_rate")
global_step = fluid.layers.learning_rate_scheduler._decay_step_counter(
)
with fluid.layers.control_flow.Switch() as switch:
with switch.case(global_step < warmup_steps):
warmup_lr = learning_rate * (global_step / warmup_steps)
fluid.layers.tensor.assign(warmup_lr, lr)
with switch.default():
decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
learning_rate=learning_rate,
decay_steps=num_train_steps,
end_learning_rate=0.0,
power=1.0,
cycle=False)
fluid.layers.tensor.assign(decayed_lr, lr)
return lr
def optimization(loss,
warmup_steps,
num_train_steps,
learning_rate,
train_program,
startup_prog,
weight_decay,
scheduler='linear_warmup_decay',
decay_steps=[],
lr_decay_dict_file="",
lr_decay_ratio=0.1):
"""
optimization implementation
"""
if warmup_steps > 0:
if scheduler == 'noam_decay':
scheduled_lr = fluid.layers.learning_rate_scheduler \
.noam_decay(1 / (warmup_steps * (learning_rate ** 2)),
warmup_steps)
elif scheduler == 'linear_warmup_decay':
scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps,
num_train_steps)
elif scheduler == 'manual_warmup_decay':
scheduled_lr = manual_warmup_decay(learning_rate, warmup_steps,
num_train_steps, decay_steps, lr_decay_ratio)
else:
raise ValueError("Unkown learning rate scheduler, should be "
"'noam_decay' or 'linear_warmup_decay' or 'manual_warmup_decay'")
else:
scheduled_lr = fluid.layers.create_global_var(
name=fluid.unique_name.generate("learning_rate"),
shape=[1],
value=learning_rate,
dtype='float32',
persistable=True)
lr_decay_dict = {}
if lr_decay_dict_file != "":
with open(lr_decay_dict_file) as f:
for line in f:
param, decay_rate = line.strip().split('\t')
lr_decay_dict[param] = float(decay_rate)
for param in fluid.default_main_program().block(0).all_parameters():
if param.name in lr_decay_dict:
print (param.name, lr_decay_dict[param.name])
param.optimize_attr['learning_rate'] = lr_decay_dict[param.name]
optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
optimizer._learning_rate_map[fluid.default_main_program(
)] = scheduled_lr
fluid.clip.set_gradient_clip(
clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0))
def exclude_from_weight_decay(name):
"""
Parameters not use weight decay
"""
if name.find("layer_norm") > -1:
return True
bias_suffix = ["_bias", "_b", ".b_0"]
for suffix in bias_suffix:
if name.endswith(suffix):
return True
return False
param_list = dict()
for param in train_program.global_block().all_parameters():
param_list[param.name] = param * 1.0
param_list[param.name].stop_gradient = True
_, param_grads = optimizer.minimize(loss)
if weight_decay > 0:
for param, grad in param_grads:
if exclude_from_weight_decay(param.name):
continue
with param.block.program._optimized_guard(
[param, grad]), fluid.framework.name_scope("weight_decay"):
updated_param = param - param_list[
param.name] * weight_decay * scheduled_lr * param.optimize_attr['learning_rate']
fluid.layers.assign(output=param, input=updated_param)
return scheduled_lr
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" text preprocess """
import random
import sys
import os
import base64
import numpy as np
reload(sys)
sys.setdefaultencoding("utf-8")
from preprocess import tokenization
class PreprocessorBasic(object):
"""
Main class for text preprocess
"""
def __init__(self,
tokenizer_name,
vocab_path,
tagger_path="",
nltk_data_path="",
do_lower_case=True):
self.do_lower_case = do_lower_case
self.tokenizer = getattr(tokenization, tokenizer_name)(vocab_file=vocab_path, do_lower_case=do_lower_case)
self.vocab = self.tokenizer.vocab
def convert_sentence_to_ids_without_cls(self, sentence):
"""
Convert sentence to ids without cls
"""
tokens = self.tokenizer.tokenize(sentence)
ids = self.tokenizer.convert_tokens_to_ids(tokens)
return ids
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" tokenization implemnet """
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import unicodedata
import six
from functools import reduce
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def printable_text(text):
"""Returns text encoded in a way suitable for print or `tf.logging`."""
# These functions want `str` for both Python2 and Python3, but in one case
# it's a Unicode string and in the other it's a byte string.
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text
elif isinstance(text, unicode):
return text.encode("utf-8")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
fin = open(vocab_file)
for num, line in enumerate(fin):
items = convert_to_unicode(line.strip()).split("\t")
if len(items) > 2:
break
token = items[0]
index = items[1] if len(items) == 2 else num
token = token.strip()
vocab[token] = int(index)
return vocab
def convert_by_vocab(vocab, items):
"""Converts a sequence of [tokens|ids] using the vocab."""
output = []
for item in items:
output.append(vocab[item])
return output
def convert_tokens_to_ids(vocab, tokens):
"""
Converts tokens to ids
"""
return convert_by_vocab(vocab, tokens)
def convert_ids_to_tokens(inv_vocab, ids):
"""
Converts ids to tokens
"""
return convert_by_vocab(inv_vocab, ids)
def whitespace_tokenize(text):
"""Runs basic whitespace cleaning and splitting on a peice of text."""
text = text.strip()
if not text:
return []
tokens = text.split()
return tokens
class FullTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
"""
turn text into tokens
"""
split_tokens = []
for token in self.basic_tokenizer.tokenize(text):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def tokenize_case(self, text):
"""
tokenize case
"""
split_tokens = []
case_indexs = []
basic_tokens, case_index = self.basic_tokenizer.tokenize_case(text)
case_indexs += case_index
case_indexs = [[i] for i in case_indexs]
for token_index, token in enumerate(basic_tokens):
wordpiece_tokens = self.wordpiece_tokenizer.tokenize(token)
if len(wordpiece_tokens) > 1:
case_indexs[token_index] = case_indexs[token_index]*(len(wordpiece_tokens))
for sub_token in wordpiece_tokens:
split_tokens.append(sub_token)
if case_indexs:
case_indexs = reduce(lambda x, y: x + y, case_indexs)
return split_tokens, case_indexs
def convert_tokens_to_ids(self, tokens):
"""
Converts tokens to ids
"""
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
"""
Converts ids to tokens
"""
return convert_by_vocab(self.inv_vocab, ids)
class CharTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
"""
Convert text to tokens
"""
split_tokens = []
for token in text.lower().split(" "):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
"""
Convert tokens to ids
"""
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
"""
Convert tokens to ids
"""
return convert_by_vocab(self.inv_vocab, ids)
class BasicTokenizer(object):
"""Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
def __init__(self, do_lower_case=True):
"""Constructs a BasicTokenizer.
Args:
do_lower_case: Whether to lower case the input.
"""
self.do_lower_case = do_lower_case
def tokenize(self, text):
"""Tokenizes a piece of text."""
text = convert_to_unicode(text)
text = self._clean_text(text)
# This was added on November 1st, 2018 for the multilingual and Chinese
# models. This is also applied to the English models now, but it doesn't
# matter since the English models were not trained on any Chinese data
# and generally don't have any Chinese data in them (there are Chinese
# characters in the vocabulary because Wikipedia does have some Chinese
# words in the English Wikipedia.).
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if self.do_lower_case:
token = token.lower()
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token))
output_tokens = whitespace_tokenize(" ".join(split_tokens))
return output_tokens
def tokenize_case(self, text):
"""
tokenize case
"""
text = convert_to_unicode(text)
text = self._clean_text(text)
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
split_tokens = []
case_index = []
for token in orig_tokens:
if self.do_lower_case:
if token.istitle():
case_index.append(1)
else:
case_index.append(0)
token = token.lower()
token = self._run_strip_accents(token)
if token == '':
case_index.pop()
tmpsplit_tokens, case_index = self._run_split_on_punc_case(token, case_index)
split_tokens.extend(tmpsplit_tokens)
output_tokens = whitespace_tokenize(" ".join(split_tokens))
return output_tokens, case_index
def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _run_split_on_punc(self, text):
"""Splits punctuation on a piece of text."""
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1
return ["".join(x) for x in output]
def _run_split_on_punc_case(self, text, case_index):
"""Splits punctuation on a piece of text."""
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1
if len(output) > 1:
case_index.extend([case_index[-1]]*(len(output)-1))
return ["".join(x) for x in output], case_index
def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append(" ")
output.append(char)
output.append(" ")
else:
output.append(char)
return "".join(output)
def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
return False
def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xfffd or _is_control(char):
continue
if _is_whitespace(char):
output.append(" ")
else:
output.append(char)
return "".join(output)
class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
def tokenize(self, text):
"""Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization
using the given vocabulary.
For example:
input = "unaffable"
output = ["un", "##aff", "##able"]
Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer.
Returns:
A list of wordpiece tokens.
"""
text = convert_to_unicode(text)
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
def _is_whitespace(char):
"""Checks whether `chars` is a whitespace character."""
# \t, \n, and \r are technically contorl characters but we treat them
# as whitespace since they are generally considered as such.
if char == " " or char == "\t" or char == "\n" or char == "\r":
return True
cat = unicodedata.category(char)
if cat == "Zs":
return True
return False
def _is_control(char):
"""Checks whether `chars` is a control character."""
# These are technically control characters but we count them as whitespace
# characters.
if char == "\t" or char == "\n" or char == "\r":
return False
cat = unicodedata.category(char)
if cat.startswith("C"):
return True
return False
def _is_punctuation(char):
"""Checks whether `chars` is a punctuation character."""
cp = ord(char)
# We treat all non-letter/number ASCII as punctuation.
# Characters such as "^", "$", and "`" are not in the Unicode
# Punctuation class but we treat them as punctuation anyways, for
# consistency.
if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
(cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
return True
cat = unicodedata.category(char)
if cat.startswith("P"):
return True
return False
"""
Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""
import numpy as np
import copy
import pickle
import lmdb # install lmdb by "pip install lmdb"
import base64
class ImageFeaturesH5Reader(object):
"""
Reader class
"""
def __init__(self, features_path):
self.features_path = features_path
self.env = lmdb.open(self.features_path, max_readers=1, readonly=True,
lock=False, readahead=False, meminit=False)
with self.env.begin(write=False) as txn:
self._image_ids = pickle.loads(txn.get('keys'.encode()))
self.features = [None] * len(self._image_ids)
self.num_boxes = [None] * len(self._image_ids)
self.boxes = [None] * len(self._image_ids)
self.boxes_ori = [None] * len(self._image_ids)
def __len__(self):
return len(self._image_ids)
def __getitem__(self, image_id):
image_id = str(image_id).encode()
index = self._image_ids.index(image_id)
# Read chunk from file everytime if not loaded in memory.
with self.env.begin(write=False) as txn:
item = pickle.loads(txn.get(image_id))
image_id = item['image_id']
image_h = int(item['image_h'])
image_w = int(item['image_w'])
num_boxes = int(item['num_boxes'])
features = np.frombuffer(base64.b64decode(item["features"]), dtype=np.float32).reshape(num_boxes, 2048)
boxes = np.frombuffer(base64.b64decode(item['boxes']), dtype=np.float32).reshape(num_boxes, 4)
g_feat = np.sum(features, axis=0) / num_boxes
num_boxes = num_boxes + 1
features = np.concatenate([np.expand_dims(g_feat, axis=0), features], axis=0)
image_location = np.zeros((boxes.shape[0], 5), dtype=np.float32)
image_location[:, :4] = boxes
image_location[:, 4] = (image_location[:, 3] - image_location[:, 1]) * \
(image_location[:, 2] - image_location[:, 0]) / (float(image_w) * float(image_h))
image_location_ori = copy.deepcopy(image_location)
image_location[:, 0] = image_location[:, 0] / float(image_w)
image_location[:, 1] = image_location[:, 1] / float(image_h)
image_location[:, 2] = image_location[:, 2] / float(image_w)
image_location[:, 3] = image_location[:, 3] / float(image_h)
g_location = np.array([0, 0, 1, 1, 1])
image_location = np.concatenate([np.expand_dims(g_location, axis=0), image_location], axis=0)
g_location_ori = np.array([0, 0, image_w, image_h, image_w * image_h])
image_location_ori = np.concatenate([np.expand_dims(g_location_ori, axis=0), image_location_ori], axis=0)
data_json = {"features": features,
"num_boxes": num_boxes,
"image_location": image_location,
"image_location_ori": image_location_ori
}
return data_json
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" VCR Data Reader implementation """
from __future__ import print_function
from __future__ import division
import os
import base64
import numpy as np
import re
import random
import json
import json_lines
import csv
import sys
import itertools
from reader._image_features_reader import ImageFeaturesH5Reader
from preprocess import preprocessor
from batching.finetune_batching import prepare_batch_data
import paddle.fluid as fluid
def _converId(img_id):
"""
conversion for image ID
"""
img_id = img_id.split('-')
if 'train' in img_id[0]:
new_id = int(img_id[1])
elif 'val' in img_id[0]:
new_id = int(img_id[1]) + 1000000
elif 'test' in img_id[0]:
new_id = int(img_id[1]) + 2000000
else:
print("no split known")
return new_id
def _load_annotationsQ_A(annotations_jsonpath, split):
"""
Build an index out of FOIL annotations, mapping each image ID with its corresponding captions.
"""
entries = []
with open(annotations_jsonpath) as f:
for annotation in json_lines.reader(f):
det_names = ""
question = annotation["question"]
if split == 'test':
ans_label = 0
else:
ans_label = annotation["answer_label"]
img_id = _converId(annotation["img_id"])
anno_id = int(annotation["annot_id"].split('-')[1])
entries.append(
{"question": question,
"answers": annotation["answer_choices"],
"metadata_fn": annotation["metadata_fn"],
"target": ans_label,
"img_id": img_id,
"anno_id": anno_id,
"det_names": annotation['objects']
})
return entries
def _load_annotationsQA_R(annotations_jsonpath, split):
"""
Build an index out of FOIL annotations, mapping each image ID with its corresponding captions.
"""
entries = []
with open(annotations_jsonpath, 'rb') as f:
for annotation in json_lines.reader(f):
if split == 'test':
for answer in annotation["answer_choices"]:
question = annotation["question"] + ["[MARK]"] + answer
img_id = _converId(annotation["img_id"])
ans_label = 0
anno_id = int(annotation["annot_id"].split('-')[1])
entries.append(
{"question": question,
"answers": annotation["rationale_choices"],
"metadata_fn": annotation["metadata_fn"],
"target": ans_label,
"img_id": img_id,
"anno_id": anno_id,
"det_names": annotation['objects']
})
else:
det_names = ""
question = annotation["question"] + ["[MARK]"] + \
annotation["answer_choices"][annotation['answer_label']]
ans_label = annotation["rationale_label"]
img_id = _converId(annotation["img_id"])
anno_id = int(annotation["annot_id"].split('-')[1])
entries.append(
{"question": question,
"answers": annotation["rationale_choices"],
"metadata_fn": annotation["metadata_fn"],
"target": ans_label,
"img_id": img_id,
"anno_id": anno_id,
"det_names": annotation['objects']})
return entries
class VCRDataReader(object):
"""
Data reader for sub VCR task
"""
def __init__(self,
task_conf,
split,
vocab_path=None,
batch_size=4096,
shuffle=True,
epoch=100,
is_test=False,
feature_reader_dict={},
random_seed=None,
task_index=0,
task_num=1):
self.task_conf = task_conf
self.processor = getattr(preprocessor,
task_conf["Proprocessor"])(tokenizer_name=self.task_conf["tokenizer_name"],
vocab_path=vocab_path)
self.vocab = self.processor.vocab
self.batch_size = batch_size
self.shuffle = shuffle
self.epoch = epoch
self.current_epoch = 0
self.current_file_index = 0
self.total_file = 0
self.current_file = None
self.random_seed = random_seed
self.max_seq_len = self.task_conf['max_seq_len']
self.pad_id = self.vocab["[PAD]"]
self.cls_id = self.vocab["[CLS]"]
self.sep_id = self.vocab["[SEP]"]
self.mask_id = self.vocab["[MASK]"]
self.is_test = is_test
self.task_index = task_index
self.task_num = task_num
if self.is_test:
self.epoch = 1
self.shuffle_files = False
if self.shuffle:
shufflekeep_across_task = self.task_conf.get('shufflekeep_across_task', True)
if shufflekeep_across_task:
self.global_rng = np.random.RandomState(random_seed)
else:
self.global_rng = np.random.RandomState()
self.shuffle_every_epoch = self.task_conf.get('shuffle_every_epoch', False)
task=self.task_conf['task']
annotations_jsonpath=self.task_conf['annotations_jsonpath_' + split]
self.num_choice = int(self.task_conf['num_choice'])
if task == 'VCR_Q-A':
self._entries = _load_annotationsQ_A(annotations_jsonpath, split)
elif task == "VCR_QA-R":
self._entries = _load_annotationsQA_R(annotations_jsonpath, split)
else:
assert False
self._split = split
self._names = []
with open(self.task_conf['unisex_names_table']) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
if row[1] != 'name':
self._names.append(row[1])
self._feature_reader = feature_reader_dict[self.task_conf['feature_lmdb_path']]
self.use_gt_fea = task_conf.get('use_gt_fea', False)
if self.use_gt_fea:
self._gt_feature_reader = feature_reader_dict[self.task_conf['gt_feature_lmdb_path']]
self._max_region_num = self.task_conf.get('max_region_num', 100)
print("use gt featurre")
else:
self._max_region_num = self.task_conf.get('max_region_num', 37)
print("only butd feature")
self.tokenize()
def generate_random_name(self, det_names):
"""
Replace "person" with a random name
"""
random_name = []
for name in det_names:
if name == 'person':
word = random.choice(self._names)
else:
word = name
random_name.append(word)
return random_name
def replace_det_with_name(self, inputs, random_names):
"""
Replace det with name
"""
tokens = []
mask = []
for w in inputs:
if isinstance(w, list):
for idx in w:
word = random_names[idx]
tokens.append(word)
else:
word = w.encode('utf-8')
tokens.append(word)
return tokens, mask
def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
"""
Truncates a sequence pair in place to the maximum length.
"""
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def get_progress(self):
"""
Return current progress of traning data
"""
progress_dict = {"current_epoch": self.current_epoch,
"current_file_index": self.current_file_index,
"total_file": self.total_file,
"current_file": self.current_file
}
return progress_dict
def tokenize(self):
"""
Tokenizes the captions.
"""
# This will add caption_tokens in each entry of the dataset.
# -1 represents nil, and should be treated as padding_idx in embedding.
count = 0
for entry in self._entries:
det_names = entry["det_names"]
random_names = self.generate_random_name(det_names)
# replace with name
tokens_a, mask_a = self.replace_det_with_name(entry["question"], random_names)
q_str = " ".join(tokens_a)
ids_a = []
for i, q in enumerate(q_str.split(" [MARK] ")):
if i == 1:
ids_a.append(self.vocab["[SEP]"])
ids_a = ids_a + self.processor.convert_sentence_to_ids_without_cls(q)
input_ids_all = []
segment_ids_all = []
input_poss_all = []
input_len_all = []
for answer in entry["answers"]:
tokens_b, mask_b = self.replace_det_with_name(answer, random_names)
ids_b = self.processor.convert_sentence_to_ids_without_cls(" ".join(tokens_b))
self._truncate_seq_pair(ids_a, ids_b, self.max_seq_len - 3)
input_ids = []
segment_ids = []
input_ids.append(self.vocab["[CLS]"])
segment_ids.append(0)
for id in ids_a:
input_ids.append(id)
segment_ids.append(0)
input_ids.append(self.vocab["[SEP]"])
segment_ids.append(0)
assert len(ids_b) > 0
for id in ids_b:
input_ids.append(id)
segment_ids.append(1)
input_ids.append(self.vocab["[SEP]"])
segment_ids.append(1)
input_ids_all.append(input_ids)
segment_ids_all.append(segment_ids)
input_poss = [str(pos) for pos in range(len(input_ids))]
input_poss_all.append(input_poss)
input_len_all.append(len(input_ids))
entry["input_ids"] = input_ids_all
entry["input_poss"] = input_poss_all
entry["segment_ids"] = segment_ids_all
entry["input_lens"] = input_len_all
sys.stdout.write('%d/%d\r' % (count, len(self._entries)))
sys.stdout.flush()
count += 1
def parse_line(self, s_index):
"""
Form slot info with the line information
"""
entry = self._entries[s_index]
image_id = entry["img_id"]
image_fea_json = self._feature_reader[image_id]
features = image_fea_json["features"]
num_boxes = image_fea_json["num_boxes"]
boxes = image_fea_json["image_location"]
if not self.use_gt_fea:
num_boxes = min(num_boxes, self._max_region_num)
boxes = boxes[:num_boxes]
features = features[:num_boxes]
else:
boxes = boxes[:num_boxes]
features = features[:num_boxes]
image_fea_json = self._gt_feature_reader[image_id]
gt_features = image_fea_json["features"]
gt_num_boxes = image_fea_json["num_boxes"]
gt_boxes = image_fea_json["image_location"]
features[0] = (features[0] * num_boxes + gt_features[0] * gt_num_boxes) / (num_boxes + gt_num_boxes)
gt_boxes = gt_boxes[1: gt_num_boxes]
gt_features = gt_features[1: gt_num_boxes]
gt_num_boxes = gt_num_boxes - 1
gt_box_preserve = min(self._max_region_num - 1, gt_num_boxes)
gt_boxes = gt_boxes[:gt_box_preserve]
gt_features = gt_features[:gt_box_preserve]
gt_num_boxes = gt_box_preserve
num_box_preserve = min(self._max_region_num - int(gt_num_boxes), int(num_boxes))
boxes = boxes[:num_box_preserve]
features = features[:num_box_preserve]
# concatenate the boxes
mix_boxes = np.concatenate((boxes, gt_boxes), axis=0)
mix_features = np.concatenate((features, gt_features), axis=0)
mix_num_boxes = num_box_preserve + int(gt_num_boxes)
num_boxes = min(mix_num_boxes, self._max_region_num)
boxes = mix_boxes[:num_boxes]
features = mix_features[:num_boxes]
record = {
"input_ids": entry["input_ids"],
"input_pos": entry["input_poss"],
"segment_ids": entry["segment_ids"],
"input_lens": entry["input_lens"],
"target": int(entry["target"]),
"features": features,
"boxes": boxes,
"anno_id": entry["anno_id"]
}
return record
def data_generator(self):
"""
Data_generator
"""
sample_indice = range(len(self._entries))
def wrapper():
"""
Wrapper
"""
for epoch_index in range(self.epoch):
if self._split == "train":
self.current_example = 0
self.current_epoch = epoch_index
if self.shuffle:
if epoch_index == 0:
self.global_rng.shuffle(sample_indice)
print("shuffle epoch %d" % epoch_index)
elif self.shuffle_every_epoch:
self.global_rng.shuffle(sample_indice)
print("shuffle epoch %d" % epoch_index)
batch_records = []
for index in sample_indice:
batch_records.append(self.parse_line(index))
if len(batch_records) == self.batch_size:
yield prepare_batch_data(
batch_records, self.num_choice, self.pad_id, \
self.task_index, self.task_num), self.task_conf['task']
batch_records = []
if len(batch_records) > 0:
yield prepare_batch_data(
batch_records, self.num_choice, self.pad_id, \
self.task_index, self.task_num), self.task_conf['task']
return wrapper
class VCRDataJointReader(object):
"""
Joint data reader for Q2A task and QA2R task
"""
def __init__(self,
task_conf_group,
split,
batch_size=4096,
shuffle=True,
epoch=100,
vocab_path=None,
is_test=False):
self.task_readers = []
feature_reader_dict = {}
self.task_dup_cnt = []
for task_conf in task_conf_group:
if 'feature_lmdb_path' in task_conf:
if task_conf['feature_lmdb_path'] not in feature_reader_dict:
feature_reader_dict[task_conf['feature_lmdb_path']] = \
ImageFeaturesH5Reader(task_conf['feature_lmdb_path'])
if 'gt_feature_lmdb_path' in task_conf and task_conf.get('use_gt_fea', False):
if task_conf['gt_feature_lmdb_path'] not in feature_reader_dict:
feature_reader_dict[task_conf['gt_feature_lmdb_path']] = \
ImageFeaturesH5Reader(task_conf['gt_feature_lmdb_path'])
task_batch_size = task_conf.get('batch_size', 64)
self.task_dup_cnt.append(max(int(task_batch_size / batch_size), 1))
random_seed=np.random.randint(1000)
for task_index, task_conf in enumerate(task_conf_group):
self.task_readers.append(VCRDataReader(task_conf, split, vocab_path, batch_size, shuffle,
epoch, is_test, feature_reader_dict, random_seed, task_index, len(task_conf_group)))
self.task_generators = [reader.data_generator() for reader in self.task_readers]
def get_progress(self):
"""
Return current progress of traning data
"""
current_epoch = max([reader.current_epoch for reader in self.task_readers])
current_file_index = max([reader.current_file_index for reader in self.task_readers])
total_file = max([reader.total_file for reader in self.task_readers])
current_file = ""
self.progress_dict = {"current_epoch": current_epoch,
"current_file_index": current_file_index,
"total_file": total_file,
"current_file": current_file
}
return self.progress_dict
def data_generator(self):
"""
Data_generator
"""
def wrapper():
"""
warpper
"""
task_buffer = [[] for i in range(len(self.task_dup_cnt))]
for data in itertools.izip(*[generator() for generator in self.task_generators]):
for i, d in enumerate(data):
task_buffer[i].append(d)
if len(task_buffer[i]) >= self.task_dup_cnt[i]:
for t in task_buffer[i]:
yield t[0]
task_buffer[i] = []
return wrapper
if __name__ == "__main__":
pass
nltk==3.2.4
numpy==1.14.3
scipy==1.2.1
six==1.11.0
json_lines==0.5.0
lmdb==0.97
opencv-python==3.2.0.8
paddlepaddle-gpu==1.8.3.post97
set -eu
set -x
#bash -x ./env.sh
TASK_NAME=$1
CONF_FILE=$2
VOCAB_PATH=$3
ERNIE_VIL_CONFIG=$4
PRETRAIN_MODELS=$5
source $CONF_FILE
#configure your cuda and cudnn
#configure nccl
export FLAGS_fast_eager_deletion_mode=1
export FLAGS_eager_delete_tensor_gb=0.0
export FLAGS_fraction_of_gpu_memory_to_use=0.98
e_executor=$(echo ${use_experimental_executor-'True'} | tr '[A-Z]' '[a-z]')
use_fuse=$(echo ${use_fuse-'False'} | tr '[A-Z]' '[a-z]')
if [[ ${use_fuse} == "true" ]]; then
export FLAGS_fuse_parameter_memory_size=131072
export FLAGS_fuse_parameter_groups_size=10
fi
TASK_GROUP_JSON=./conf/$TASK_NAME/task_${TASK_NAME}.json
gpu_cnt=`echo $CUDA_VISIBLE_DEVICES | awk -F"\t" '{len=split($0,vec,",");print len}'`
echo "gpu_cnt", $gpu_cnt
python finetune.py --use_cuda "True" \
--is_distributed "False" \
--use_fast_executor ${e_executor-"True"} \
--nccl_comm_num ${nccl_comm_num:-"1"} \
--batch_size $((BATCH_SIZE/gpu_cnt)) \
--do_train "True" \
--do_test "False" \
--task_name ${TASK_NAME} \
--vocab_path ${VOCAB_PATH} \
--task_group_json ${TASK_GROUP_JSON} \
--lr_scheduler ${lr_scheduler} \
--decay_steps ${decay_steps-""} \
--lr_decay_ratio ${lr_decay_ratio-0.1} \
--num_train_steps ${num_train_steps} \
--checkpoints $output_model_path \
--save_steps ${SAVE_STEPS} \
--init_checkpoint ${PRETRAIN_MODELS} \
--ernie_config_path ${ERNIE_VIL_CONFIG} \
--learning_rate ${LR_RATE} \
--warmup_steps ${WARMUP_STEPS} \
--weight_decay ${WEIGHT_DECAY:-0} \
--max_seq_len ${MAX_LEN} \
--validation_steps ${VALID_STEPS} \
--skip_steps 10
set -eu
#bash -x ./env.sh
TASK_NAME=$1
SUB_TASK_NAME=$2
TEST_SPLIT=$3
CONF_FILE=$4
VOCAB_PATH=$5
ERNIE_VIL_CONFIG=$6
MODEL_PATH=$7
RES_FILE=$8
source $CONF_FILE
#configure your cuda and cudnn
#configure nccl
export FLAGS_eager_delete_tensor_gb=2.0
export FLAGS_fraction_of_gpu_memory_to_use=0.01
export FLAGS_sync_nccl_allreduce=1
e_executor=$(echo ${use_experimental_executor-'True'} | tr '[A-Z]' '[a-z]')
use_fuse=$(echo ${use_fuse-'False'} | tr '[A-Z]' '[a-z]')
if [[ ${use_fuse} == "true" ]]; then
export FLAGS_fuse_parameter_memory_size=131072
export FLAGS_fuse_parameter_groups_size=10
fi
TASK_GROUP_JSON=./conf/$TASK_NAME/task_${TASK_NAME}_${SUB_TASK_NAME}.json
python finetune.py --use_cuda "True" \
--use_fast_executor ${e_executor-"True"} \
--batch_size ${BATCH_SIZE} \
--do_train "False" \
--do_test "True" \
--test_split ${TEST_SPLIT} \
--task_name $TASK_NAME \
--vocab_path ${VOCAB_PATH} \
--task_group_json ${TASK_GROUP_JSON} \
--result_file "$RES_FILE" \
--init_checkpoint "$MODEL_PATH" \
--ernie_config_path ${ERNIE_VIL_CONFIG} \
--max_seq_len ${MAX_LEN} \
--skip_steps 10
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Arguments for configuration."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import six
import argparse
def str2bool(v):
"""
because argparse does not support to parse "true, False" as python
boolean directly
"""
return v.lower() in ("true", "t", "1")
class ArgumentGroup(object):
"""
group of arguments
"""
def __init__(self, parser, title, des):
self._group = parser.add_argument_group(title=title, description=des)
def add_arg(self, name, type, default, help, positional_arg=False, **kwargs):
"""
add arg
"""
prefix = "" if positional_arg else "--"
type = str2bool if type == bool else type
self._group.add_argument(
prefix + name,
default=default,
type=type,
help=help + ' Default: %(default)s.',
**kwargs)
def print_arguments(args):
"""
Arguments print function
"""
print('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""parameters init function implementations"""
from __future__ import print_function
import os
import six
import numpy as np
import paddle.fluid as fluid
def init_checkpoint(exe, init_checkpoint_path, main_program):
"""
init checkpoint params with lr and step info
"""
assert os.path.exists(
init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
def existed_persitables(var):
"""
Check if persitables
"""
if not fluid.io.is_persistable(var):
return False
return os.path.exists(os.path.join(init_checkpoint_path, var.name))
fluid.io.load_vars(
exe,
init_checkpoint_path,
main_program=main_program,
predicate=existed_persitables)
print("Load model from {}".format(init_checkpoint_path))
def init_pretraining_params(exe, pretraining_params_path, main_program):
"""
init pretraining params without lr and step info
"""
assert os.path.exists(pretraining_params_path
), "[%s] cann't be found." % pretraining_params_path
def existed_params(var):
"""
Check existed params
"""
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(os.path.join(pretraining_params_path, var.name))
fluid.io.load_vars(
exe,
pretraining_params_path,
main_program=main_program,
predicate=existed_params)
print("Load pretraining parameters from {}.".format(
pretraining_params_path))
......@@ -6,4 +6,5 @@ scipy==1.2.1
six==1.11.0
sklearn==0.0
sentencepiece==0.1.8
opencv-python==3.4.2.17
paddlepaddle-gpu==1.6.3.post107
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册