diff --git a/ernie-vil/.meta/ernie_vil_struct.png b/ernie-vil/.meta/ernie_vil_struct.png
new file mode 100644
index 0000000000000000000000000000000000000000..cfa72e6116a2d2393f2bf12f25c98a66545c0698
Binary files /dev/null and b/ernie-vil/.meta/ernie_vil_struct.png differ
diff --git a/ernie-vil/README.md b/ernie-vil/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c14475630ca7596dd638a839a6b7f42d27ff5bfa
--- /dev/null
+++ b/ernie-vil/README.md
@@ -0,0 +1,136 @@
+English| [简体中文](./README_zh.md)
+
+## _ERNIE-ViL_: Knowledge Enhanced Vision-Language Representations Through Scene Graph
+- [Framework](#framework)
+- [Pre-trained models](#pre-trained-models)
+- [Downstream tasks](#downstream-tasks)
+ * [VCR](#VCR)
+- [Usage](#usage)
+ * [Install PaddlePaddle](#install-paddlepaddle)
+ * [Fine-tuning on ERNIE-ViL](#fine-tuning-on-ernie-vil)
+ * [Inference](#inference)
+- [Citation](#citation)
+
+For technical description of the algorithm, please see our paper:
+
+>[_**ERNIE-ViL:Knowledge Enhanced Vision-Language Representations Through Scene Graph**_](https://arxiv.org/abs/2006.16934)
+>
+>Fei Yu\*, Jiji Tang\*, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution)
+>
+>Preprint June 2020
+>
+
+
+
+ 
+
+
+**[ERNIE-ViL](https://arxiv.org/abs/2006.16934) is a knowledge-enhanced joint representations for vision-language tasks**, which is the first work that has **introduced structured knowledge to enhance vision-language pre-training**. Utilizing structured knowledge obtained
+from scene graphs, ERNIE-ViL constructs three **Scene Graph Prediction tasks**, i.e., **Object Prediction**, **Attribute Prediction** and **Relationship Prediction** tasks.
+Thus, ERNIE-ViL can learn the better joint vision-language representations characterizing the alignments of the detailed semantics across vision and language.
+
+
+
+## Framework
+
+Based on the scene graph parsed from the text using Scene Graph Parser, we construct Object Prediction, Attribute Prediction and Relationship Prediction tasks:
+- **Object Prediction:** We randomly select a set of the objects in the scene graph, then mask and predict the corresponding words in the sentence.
+- **Attribute Prediction:** For the object-attribute pairs in the scene graph, we randomly select a part of them to mask and predict the words related to the attribute nodes in the sentence.
+- **Realtionship Prediction:** For the object-relationship-object triplets in the scene graph, we randomly select a part of realtionship nodes to mask and predict them.
+
+
+Model Architecture of ERNIE-ViL
+
+
+## Pre-trained Models
+ERNIE-ViL adopts large-scale image-text aligned datasets as the pre-training data. We provide ERNIE-ViL models of two scale settings which are pretrained on [**Conceptual Captions**](https://www.aclweb.org/anthology/P18-1238.pdf) and [**SBU Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio).
+
+- [**ERNIE-ViL _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-en.1.tar.gz) (_lowercased | 12-text-stream-layer, 6-visual-stream-layer_)
+- [**ERNIE-ViL _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_)
+
+## Downstream tasks
+We finetune ERNIE-ViL on five vision-langage downstream tasks, i.e., Visual Commensense Reasoning([**VCR**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.pdf)),
+Visual Question Answering([**VQA**](https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf)),
+Cross-modal Image Retrieval([**IR**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)),
+Cross-modal Text Retrieval([**TR**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)) and
+Region_to_Phrase_Grounding([**RefCOCO+**](https://www.aclweb.org/anthology/D14-1086.pdf)).
+
+_Code and pre-trained models related to VCR task are made public now, and those of more downstream tasks are planed to be public._
+
+### VCR
+ * datasets
+ * The training, validation and testing data of VCR task are provided by [**VCR Website**](https://visualcommonsense.com/download/).
+ * Organization of visual features is modified from [**ViLBERT**](https://github.com/jiasenlu/vilbert_beta), we directly use the data from it. Data can be downloaded [here](https://github.com/jiasenlu/vilbert_beta/tree/master/data).
+ * Put all downloaded files under diretory "data/vcr".
+
+
+ * Task pre-training: We perform task-pretraining on VCR task, which is also known as task-specific-pretraining. The trained models are as follows:
+ * [**ERNIE-ViL-VCR-task-pretrain _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-VCR-task-pre-en.1.tar.gz)
+ * [**ERNIE-ViL-VCR-task-pretrain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-VCR-task-pre-en.1.tar.gz)
+ * Performance: Results of VCR task for ERNIE-ViL model, compared with previous state-of-the-art pre-trained models([**VILLA**](https://arxiv.org/pdf/2006.06195.pdf)).
+
+ | Models | Q->A | QA->R | Q->AR |
+ | :--------------------------------------| :---------------------------: | :----------------------------: | :-----------------------------: |
+ | VILLA (task-pretrain) _base_ | 75.54(76.4) | 78.78(79.1) | 59.75(60.6) |
+ | ERNIE-ViL (task-pretrain) _base_ | 76.37(77.0) | 79.65(80.3) | 61.24(62.1) |
+ | VILLA (task-pretrain) _large_ | 78.45(78.9) | 82.57(82.8) | 65.18(65.7) |
+ | ERNIE-ViL (task-pretrain) _large_ | 78.52(79.2) | 83.37(83.5) | 65.81(66.3) |
+
+ _Numerical results outside and inside parentheses represent the dev and test performance of VCR task respectively.
+ Test results are obtained from the [**VCR leadborad**](https://visualcommonsense.com/leaderboard/)._
+
+
+
+## Usage
+
+### Install PaddlePaddle
+
+This code has been tested with Paddle Fluid 1.8 with Python 2.7. Other dependencies of ERNIE-ViL are listed in `requirements.txt`, you can install them by
+ ```script
+ pip install -r requirements.txt
+ ```
+
+### Fine-tuning on ERNIE-ViL
+Please update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2 before fine-tuning. You can easily run fine-tuning through
+configuration files. For example, you can finetune ERNIE-ViL model on VCR task by
+```script
+ sh run_finetuning.sh vcr conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $pretrain_models
+```
+Files which are needed by fine-tuning can be found in our given download links, incluing vocabulary dictionary, configuration
+file and pre-trained parameters. Note that our fine-tuning experiments on VCR are carried on 4 NVIDIA V100 (32GB) GPUs.
+If your GPU memory is not enough, you can reduce the batch size in the corresponding configuration file, e.g., "conf/vcr/model_conf_vcr".
+
+
+
+### Inference
+
+ You can use the following command to infer fine-tuned models. For example, you can infer VCR models by the following commands for different sub-tasks:
+
+ **Task Q->A**
+
+ ```script
+ sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
+ ```
+ **Task QA->R**
+
+ ```script
+ sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
+ ```
+
+
+
+
+## Citation
+
+You can cite the paper as below:
+
+```
+@article{yu2020ernie,
+ title={ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph},
+ author={Yu, Fei and Tang, Jiji and Yin, Weichong and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
+ journal={arXiv preprint arXiv:2006.16934},
+ year={2020}
+}
+
+```
+
diff --git a/ernie-vil/README_zh.md b/ernie-vil/README_zh.md
new file mode 100644
index 0000000000000000000000000000000000000000..149fc6a93d71d078c62eef141091ce986420c97e
--- /dev/null
+++ b/ernie-vil/README_zh.md
@@ -0,0 +1,132 @@
+
+[English](./README.md) | 简体中文
+
+## _ERNIE-ViL_: Knowledge Enhanced Vision-Language Representations Through Scene Graph
+- [模型框架](#模型框架)
+- [预训练模型](#预训练模型)
+- [下游任务](#下游任务)
+ * [视觉推理](#视觉推理)
+- [使用说明](#使用说明)
+ * [安装飞桨](#安装飞桨)
+ * [运行微调](#运行微调)
+ * [预测](#预测)
+- [引用](#引用)
+
+关于算法的详细描述,请参见我们的论文
+
+>[_**ERNIE-ViL:Knowledge Enhanced Vision-Language Representations Through Scene Graph**_](https://arxiv.org/abs/2006.16934)
+>
+>Fei Yu\*, Jiji Tang\*, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution)
+>
+>Preprint June 2020
+>
+   
+
+
+---
+**ERNIE-ViL
+是面向视觉-语言任务的知识增强预训练框架**,首次在视觉-语言预训练中引入了结构化的知识。ERNIE-ViL利用场景图中的结构化知识,构建了**物体预测,属性预测,关系预测**三种预训练任务,精细地刻画了视觉-语言模态之间细粒度语义的对齐,从而获得了更好的视觉-语言联合表示。
+
+## 模型框架
+
+基于文本中解析出的场景图,ERNIE-ViL提出了三个多模态场景图预测任务:
+- **物体预测**:随机选取图中的一部分物体,然后对其在句子中对应的词进行掩码和预测;
+- **属性预测**:对于场景图中的属性-物体组合,随机选取一部分词对其中属性词进行掩码和预测;
+- **关系预测**:对于场景图中的物体-关系-物体三元组,对其中的关系词进行掩码和预测。
+
+
+
+ERNIE-ViL 场景图预训练任务结构
+
+## 预训练模型
+
+
+ERNIE-ViL使用大规模图文对齐数据集作为预训练数据,基于[**Conceptual
+Captions**](https://www.aclweb.org/anthology/P18-1238.pdf)和[**SBU
+Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio)数据集,训练和发布了两种参数规模的模型:
+
+- [**ERNIE-ViL _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-en.1.tar.gz) (_lowercased | 12-text-stream-layer, 6-visual-stream-layer_)
+- [**ERNIE-ViL _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_)
+
+## 下游任务
+
+ERNIE-ViL在五个视觉语言下游任务进行了实验,包括[**视觉常识推理**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.pdf),
+[**视觉问答**](https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf),
+[**跨模态图片检索**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166),
+[**跨模态文本检索**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166),
+[**引用式理解**](https://www.aclweb.org/anthology/D14-1086.pdf)。
+
+_当前仅开源视觉常识推理任务相关模型和代码,后续计划开源更多下游任务的模型和代码。_
+
+
+### **视觉常识推理**
+ * 数据集合
+ * 训练、验证和测试集合相关数据由[**视觉常识推理官网**](http://visualcommonsense.com/download/)提供;
+ * 视觉端特征的组织方式借鉴[**ViLBERT**](https://github.com/jiasenlu/vilbert_beta), 因此项目直接使用**ViLBERT**中的数据,数据[下载地址](https://github.com/jiasenlu/vilbert_beta/tree/master/data);
+ * 将所有获取的文件放在 data/vcr 目录下;
+
+
+ * 任务预训练: 在视觉推理任务中进行了任务预训练,预训练获得模型如下
+ * [**ERNIE-ViL-VCR-task-pretrain _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-VCR-task-pre-en.1.tar.gz)
+ * [**ERNIE-ViL-VCR-task-pretrain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-VCR-task-pre-en.1.tar.gz)
+ * 效果: ERNIE-ViL与之前最优预训练模型[**VILLA**](https://arxiv.org/pdf/2006.06195.pdf)在视觉常识推理任务上的效果对比如下:
+
+ | 模型 | Q->A | QA->R | Q->AR |
+ | :---------------------------------- | :---------------------------: | :----------------------------: | :---------------------------: |
+ | VILLA (task-pretrain) _base_ | 75.54(76.4) | 78.78(79.1) | 59.75(60.6) |
+ | ERNIE-ViL (task-pretrain) _base_ | 76.37(77.0) | 79.65(80.3) | 61.24(62.1) |
+ | VILLA (task-pretrain) _large_ | 78.45(78.9) | 82.57(82.8) | 65.18(65.7) |
+ | ERNIE-ViL (task-pretrain) _large_ | 78.52(79.2) | 83.37(83.5) | 65.81(66.3) |
+
+ _注:括号外表示验证集效果,括号内表示测试集效果,测试集效果由[VCR榜单](https://visualcommonsense.com/leaderboard/)提供。_
+
+
+## 使用说明
+
+### 安装飞桨
+
+ERNIE-ViL代码基于Paddle Fluid 1.8 和 Python 2.7, 依赖的其他模块也列举在 requirements.txt,可以通过下面的指令安装:
+ ```script
+ pip install -r requirements.txt
+ ```
+### 运行微调
+在运行 ERNIE-ViL 前,需要将 CUDA 、cuDNN 、NCCL2 的动态库路径添加到 LD_LIBRARY_PATH 。 我们把下游任务的参数配置文件放到了 conf/ ,可以简单地通过配置文件运行。 例如,您可以通过下面的指令在VCR上任务上进行微调:
+```script
+ sh run_finetuning.sh vcr conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $pretrain_models_params
+```
+前面提供的模型链接中包含了所有需要的文件, 包含词表文件,配置文件和预训练参数。VCR任务的微调实验是在 4 张32 GB 的英伟达V100 GPU上运行,如果您的GPU显存不够,可以考虑八张卡运行或者减小配置中的batch_size。
+_我们目前开放了预训练模型和VCR的任务代码,其他的下游任务可以参考任务自主尝试。_
+
+### 预测
+基于已经训练的模型,您可以通过下面的命令测试VCR的效果:
+
+ **Task Q->A**
+
+ ```script
+ sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
+ ```
+ **Task QA->R**
+
+ ```script
+ sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
+ ```
+
+
+ VCR的测试可以在一张32GB的英伟达V100 GPU上运行,测试的结果包含Q->A 任务、QA->R任务和Q->AR任务,其中Q->AR任务由前两个任务结果合并所得。
+
+
+
+## 引用
+
+可以按下面的格式引用我们的论文:
+
+```
+@article{yu2020ernie,
+ title={ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph},
+ author={Yu, Fei and Tang, Jiji and Yin, Weichong and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
+ journal={arXiv preprint arXiv:2006.16934},
+ year={2020}
+}
+
+```
+
diff --git a/ernie-vil/args/__init__.py b/ernie-vil/args/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/ernie-vil/args/finetune_args.py b/ernie-vil/args/finetune_args.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd034c673bfe0bceee053f293eff9fc8fba36c15
--- /dev/null
+++ b/ernie-vil/args/finetune_args.py
@@ -0,0 +1,79 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" args defination and default value """
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import time
+import argparse
+
+from utils.args import ArgumentGroup, print_arguments
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
+model_g.add_arg("ernie_config_path", str, "./config/ernie_config.json", "json file path for ernie model config.")
+model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
+model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
+model_g.add_arg("task_name", str, "vcr", "Task to finetune on ERNIE-ViL")
+
+train_g = ArgumentGroup(parser, "training", "training options.")
+train_g.add_arg("epoch", int, 100, "Number of epoches for training.")
+train_g.add_arg("learning_rate", float, 0.0001, "Learning rate used to train with warmup.")
+train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
+ "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay', 'manual_warmup_decay'])
+train_g.add_arg("decay_steps", str, "", "learning rate decay steps, list with ;")
+train_g.add_arg("lr_decay_ratio", float, 0.1, "learning rate decay ratio, used with manual_warmup_decay")
+train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
+train_g.add_arg("num_train_steps", int, 1000000, "Total steps to perform pretraining.")
+train_g.add_arg("warmup_steps", int, 0, "Total steps to perform warmup when pretraining.")
+train_g.add_arg("save_steps", int, 100, "The steps interval to save checkpoints.")
+train_g.add_arg("validation_steps", int, 6000, "The steps interval to evaluate model performance.")
+train_g.add_arg("use_fuse", bool, False, "Whether to use fuse_allreduce_ops.")
+train_g.add_arg("nccl_comm_num", int, 1, "NCCL comm num.")
+train_g.add_arg("hierarchical_allreduce_inter_nranks", int, 8, "Hierarchical allreduce inter ranks.")
+train_g.add_arg("use_hierarchical_allreduce", bool, False, "Use hierarchical allreduce or not.")
+train_g.add_arg("use_gpu", bool, True, "Whether to gpu.")
+
+log_g = ArgumentGroup(parser, "logging", "logging related.")
+log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
+log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
+
+data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
+data_g.add_arg("result_file", str, "./res_tmp", "file to storage results")
+data_g.add_arg("lr_decay_dict_file", str, "", "learning rate decay files.")
+data_g.add_arg("train_filelist", str, "", "Path to training filelist.")
+data_g.add_arg("valid_filelist", str, "", "Path to valid filelist.")
+data_g.add_arg("test_filelist", str, "", "Path to test filelist.")
+data_g.add_arg("vocab_path", str, "./config/vocab.txt", "Vocabulary path.")
+data_g.add_arg("test_split", str, "val", "test of sub tasks, val or test")
+data_g.add_arg("max_seq_len", int, 128, "Number of words of the longest seqence.")
+data_g.add_arg("max_img_len", int, 100, "Number of image rois of the longest seqence.")
+data_g.add_arg("feature_size", int, 2048, "Number of roi feature size of image.")
+data_g.add_arg("fusion_method", str, "sum", "Number of roi feature size of image.")
+data_g.add_arg("batch_size", int, 16, "Total examples' number in batch for training. see also --in_tokens.")
+data_g.add_arg("task_group_json", str, "", "Path to task json")
+
+run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
+run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.")
+run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
+run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
+run_type_g.add_arg("do_train", bool, False, "Whether to perform evaluation on test data set.")
+run_type_g.add_arg("do_test", bool, False, "Whether to perform evaluation on test data set.")
+run_type_g.add_arg("output_file", str, "", "The output file to save model output.")
+# yapf: enable
diff --git a/ernie-vil/batching/__init__.py b/ernie-vil/batching/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/ernie-vil/batching/finetune_batching.py b/ernie-vil/batching/finetune_batching.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9527bfd2d5570e567bd171b250db44202ba2b11
--- /dev/null
+++ b/ernie-vil/batching/finetune_batching.py
@@ -0,0 +1,97 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" prepare data format for finetuning tasks """
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from six.moves import xrange
+
+
+def prepare_batch_data(batch_records, num_choice, pad_id, task_index, task_num):
+ """
+ prepare batch data for finetuning tasks
+ """
+ batch_input_ids = []
+ batch_input_pos = []
+ batch_seg_ids = []
+ batch_input_masks = []
+ num_sample = len(batch_records)
+ batch_lens = [record["input_lens"] for record in batch_records]
+ batch_labels = [record["target"] for record in batch_records]
+ binary_labels = np.zeros([num_choice * num_sample, 1], dtype='float32')
+ for i, l in enumerate(batch_labels):
+ binary_labels[i * num_choice + l] = 1.0
+ labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
+ image_features = [record["features"] for record in batch_records]
+ image_boxes = [record["boxes"] for record in batch_records]
+ batch_anno_ids = np.array([record["anno_id"] for record in batch_records]).astype("int64").reshape([-1, 1])
+ max_len = max([max(lens) for lens in batch_lens])
+ for i in range(len(batch_records)):
+ batch_input_ids.append([inst + list([pad_id] * (max_len - len(inst))) \
+ for inst in batch_records[i]["input_ids"]])
+ batch_input_pos.append([inst + list([pad_id] * (max_len - len(inst))) \
+ for inst in batch_records[i]["input_pos"]])
+ batch_seg_ids.append([inst + list([pad_id] * (max_len - len(inst))) \
+ for inst in batch_records[i]["segment_ids"]])
+ batch_input_masks.append([[1] * len(inst) + [0] * (max_len - len(inst)) \
+ for inst in batch_records[i]["input_ids"]])
+
+ image_embedding, image_mask = pad_feature_data(image_features, return_mask=True)
+ image_loc = pad_feature_data(image_boxes)
+ src_ids = np.array(batch_input_ids).astype("int64").reshape([num_choice * num_sample, max_len, 1])
+ src_pos = np.array(batch_input_pos).astype("int64").reshape([num_choice * num_sample, max_len, 1])
+ src_seg = np.array(batch_seg_ids).astype("int64").reshape([num_choice * num_sample, max_len, 1])
+ src_masks = np.array(batch_input_masks).astype("float32").reshape([num_choice * num_sample, max_len, 1])
+ src_task = np.zeros(src_ids.shape, dtype="int64")
+ batch, seq_len, fea_len = image_embedding.shape
+ image_embedding = np.tile(np.expand_dims(image_embedding, axis=1), \
+ (1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, fea_len])
+ image_mask = np.tile(np.expand_dims(image_mask, axis=1), \
+ (1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, 1])
+ image_loc = np.tile(np.expand_dims(image_loc, axis=1), \
+ (1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, 5])
+ return_list = [src_ids, src_pos, src_seg, src_task, src_masks, \
+ image_embedding, image_loc, image_mask, labels, batch_anno_ids]
+ return_list.append(np.array([task_index]).astype('int64'))
+ return_list.append(binary_labels)
+ for i in xrange(task_num):
+ if i == task_index:
+ return_list.append(np.array([1.0]).astype("float32"))
+ else:
+ return_list.append(np.array([0.0]).astype("float32"))
+ return return_list
+
+
+def pad_feature_data(data, pad_value=0.0, dtype="float32", return_mask=False):
+ """
+ pad visual features with given pad value
+ """
+ max_lenth=max([len(item) for item in data])
+ data_width = len(data[0][0])
+ out_data = np.ones((len(data), max_lenth, data_width), dtype=dtype) * pad_value
+ out_mask = np.zeros((len(data), max_lenth, 1), dtype=dtype)
+ for i in range(len(data)):
+ out_data[i, 0: len(data[i]), :] = data[i]
+ if return_mask:
+ out_mask[i, 0:len(data[i]):] = 1.0
+ if return_mask:
+ return out_data, out_mask
+ else:
+ return out_data
+
+if __name__ == "__main__":
+ pass
diff --git a/ernie-vil/conf/vcr/model_conf_vcr b/ernie-vil/conf/vcr/model_conf_vcr
new file mode 100644
index 0000000000000000000000000000000000000000..d683cbff17d285ed369ed39a432cd1e8eb920885
--- /dev/null
+++ b/ernie-vil/conf/vcr/model_conf_vcr
@@ -0,0 +1,12 @@
+output_model_path="output_vcr"
+lr_scheduler="manual_warmup_decay"
+decay_steps="13308;19962"
+lr_decay_ratio=0.1
+num_train_steps=26640
+SAVE_STEPS=6660
+WARMUP_STEPS=6654
+BATCH_SIZE=64
+VALID_STEPS=20000
+LR_RATE=2e-5
+WEIGHT_DECAY=0.01
+MAX_LEN=80
diff --git a/ernie-vil/conf/vcr/task_vcr.json b/ernie-vil/conf/vcr/task_vcr.json
new file mode 100644
index 0000000000000000000000000000000000000000..9ac9d56d24f05591f29456b8ac50cf603faabcd4
--- /dev/null
+++ b/ernie-vil/conf/vcr/task_vcr.json
@@ -0,0 +1,42 @@
+[
+{
+"task": "VCR_Q-A",
+"num_choice": 4,
+"annotations_jsonpath_train": "./data/vcr/train.jsonl",
+"annotations_jsonpath_val": "./data/vcr/val.jsonl",
+"annotations_jsonpath_test": "./data/vcr/test.jsonl",
+"feature_lmdb_path": "./data/vcr/VCR_resnet101_faster_rcnn_genome_pickle2.lmdb",
+"gt_feature_lmdb_path": "./data/vcr/VCR_gt_resnet101_faster_rcnn_genome_pickle2.lmdb",
+"unisex_names_table" : "./data/vcr/unisex_names_table.csv",
+"Proprocessor": "PreprocessorBasic",
+"tokenizer_name" : "FullTokenizer",
+"fusion_method" : "mul",
+"dropout_rate" : 0.1,
+"max_seq_len" : 60,
+"use_gt_fea" : true,
+"shufflekeep_across_task": true,
+"shuffle_every_epoch": true,
+"task_weight": 1.0,
+"task_prefix": "vcr_qa"
+},
+{
+"task": "VCR_QA-R",
+"num_choice": 4,
+"annotations_jsonpath_train": "./data/vcr/train.jsonl",
+"annotations_jsonpath_val": "./data/vcr/val.jsonl",
+"annotations_jsonpath_test": "./data/vcr/test.jsonl",
+"feature_lmdb_path": "./data/vcr/VCR_resnet101_faster_rcnn_genome_pickle2.lmdb",
+"gt_feature_lmdb_path": "./data/vcr/VCR_gt_resnet101_faster_rcnn_genome_pickle2.lmdb",
+"unisex_names_table" : "./data/vcr/unisex_names_table.csv",
+"Proprocessor": "PreprocessorBasic",
+"tokenizer_name" : "FullTokenizer",
+"fusion_method" : "mul",
+"dropout_rate" : 0.1,
+"max_seq_len" : 80,
+"use_gt_fea" : true,
+"shufflekeep_across_task": true,
+"shuffle_every_epoch" : true,
+"task_weight": 1.0,
+"task_prefix": "vcr_qar"
+}
+]
diff --git a/ernie-vil/conf/vcr/task_vcr_qa.json b/ernie-vil/conf/vcr/task_vcr_qa.json
new file mode 100644
index 0000000000000000000000000000000000000000..c2b4afa714046a94e8f7720506721cc6edde5894
--- /dev/null
+++ b/ernie-vil/conf/vcr/task_vcr_qa.json
@@ -0,0 +1,21 @@
+[
+{
+"task": "VCR_Q-A",
+"num_choice": 4,
+"annotations_jsonpath_train": "./data/vcr/train.jsonl",
+"annotations_jsonpath_val": "./data/vcr/val.jsonl",
+"annotations_jsonpath_test": "./data/vcr/test.jsonl",
+"feature_lmdb_path": "./data/vcr/VCR_resnet101_faster_rcnn_genome_pickle2.lmdb",
+"gt_feature_lmdb_path": "./data/vcr/VCR_gt_resnet101_faster_rcnn_genome_pickle2.lmdb",
+"unisex_names_table" : "./data/vcr/unisex_names_table.csv",
+"Proprocessor": "PreprocessorBasic",
+"tokenizer_name" : "FullTokenizer",
+"tagger_path" : "./script/ntc.pickle",
+"nltk_data_path" : "./nltk_data",
+"fusion_method" : "mul",
+"dropout_rate" : 0.1,
+"max_seq_len" : 60,
+"use_gt_fea" : true,
+"task_prefix" : "vcr_qa"
+}
+]
diff --git a/ernie-vil/conf/vcr/task_vcr_qar.json b/ernie-vil/conf/vcr/task_vcr_qar.json
new file mode 100644
index 0000000000000000000000000000000000000000..8f4c88021f2666ce1779ea47ffd6014e67fb91ad
--- /dev/null
+++ b/ernie-vil/conf/vcr/task_vcr_qar.json
@@ -0,0 +1,22 @@
+[
+{
+"task": "VCR_QA-R",
+"num_choice": 4,
+"annotations_jsonpath_train": "./data/vcr/train.jsonl",
+"annotations_jsonpath_val": "./data/vcr/val.jsonl",
+"annotations_jsonpath_test": "./data/vcr/test.jsonl",
+"feature_lmdb_path": "./data/vcr/VCR_resnet101_faster_rcnn_genome_pickle2.lmdb",
+"gt_feature_lmdb_path": "./data/vcr/VCR_gt_resnet101_faster_rcnn_genome_pickle2.lmdb",
+"unisex_names_table" : "./data/vcr/unisex_names_table.csv",
+"Proprocessor": "PreprocessorBasic",
+"tokenizer_name" : "FullTokenizer",
+"vocab_path" : "./package/vocab.txt",
+"tagger_path" : "./script/ntc.pickle",
+"nltk_data_path" : "./nltk_data",
+"fusion_method" : "mul",
+"dropout_rate" : 0.1,
+"max_seq_len" : 80,
+"use_gt_fea" : true,
+"task_prefix" : "vcr_qa"
+}
+]
diff --git a/ernie-vil/finetune.py b/ernie-vil/finetune.py
new file mode 100755
index 0000000000000000000000000000000000000000..dbee99a0a5d6096a3ae9e902f3137f227470e2af
--- /dev/null
+++ b/ernie-vil/finetune.py
@@ -0,0 +1,465 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" finetuning vison-language task """
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import sys
+import time
+import datetime
+import argparse
+import numpy as np
+import multiprocessing
+import json
+
+from reader.vcr_finetuning import VCRDataJointReader
+from model.ernie_vil import ErnieVilModel, ErnieVilConfig
+from optim.optimization import optimization
+from utils.args import print_arguments
+from utils.init import init_checkpoint, init_pretraining_params
+from args.finetune_args import parser
+
+import paddle.fluid as fluid
+
+args = parser.parse_args()
+
+# yapf: enable.
+
+#READERS = {"vcr": VCRDataJointReader, "vqa": VQADataReader, "refcoco+": RefcocoReader, "flickr": FlickrReader}
+READERS = {"vcr": VCRDataJointReader}
+
+def format_result(res_arr, qids, pred, labels, scores):
+ """
+ trans batch results into json format
+ """
+ for i in range(len(qids)):
+ res="\t".join([str(qids[i]), str(pred[i]), str(labels[i]), " ".join(["%.5f" % s for s in scores[i]])])
+ res_arr.append(res)
+ return res_arr
+
+
+def create_vcr_model(pyreader_name, ernie_config, task_group, is_prediction=False):
+ """
+ create model arc for vcr tasks
+ """
+ shapes = [[-1, args.max_seq_len, 1], #src_id
+ [-1, args.max_seq_len, 1], #pos_id
+ [-1, args.max_seq_len, 1], #sent_id
+ [-1, args.max_seq_len, 1], #task_id
+ [-1, args.max_seq_len, 1], #input_mask
+ [-1, args.max_img_len, args.feature_size], #image_embedding
+ [-1, args.max_img_len, 5], #image_loc
+ [-1, args.max_img_len, 1], #image_mask
+ [-1, 1], #labels
+ [-1, 1], #qids
+ [], #task_index
+ [-1, 1], #binary_labels
+ ]
+ dtypes = ['int64', 'int64', 'int64', 'int64', 'float32', 'float32', 'float32', 'float32',
+ 'int64', 'int64', 'int64', 'float32']
+ lod_levels = [0] * len(dtypes)
+
+ for _ in task_group:
+ shapes.append([])
+ dtypes.append('float')
+ lod_levels.append(0)
+
+ pyreader = fluid.layers.py_reader(
+ capacity=30,
+ shapes=shapes,
+ dtypes=dtypes,
+ lod_levels=lod_levels,
+ name=pyreader_name,
+ use_double_buffer=False)
+
+ inputs = fluid.layers.read_file(pyreader)
+ src_ids, pos_ids, sent_ids, task_ids, input_mask, image_embeddings, \
+ image_loc, image_mask, labels, q_ids, task_index, binary_labels = inputs[: 12]
+
+ ernie_vil = ErnieVilModel(
+ src_ids=src_ids,
+ position_ids=pos_ids,
+ sentence_ids=sent_ids,
+ task_ids=task_ids,
+ input_mask=input_mask,
+ image_embeddings=image_embeddings,
+ image_loc=image_loc,
+ input_image_mask=image_mask,
+ config=ernie_config
+ )
+
+ h_cls, h_img = ernie_vil.get_pooled_output()
+ task_conf = task_group[0]
+ fusion_method = task_conf["fusion_method"]
+ fusion_fea = ernie_vil.get_match_score(text=h_cls, image=h_img, \
+ dropout_rate=task_conf["dropout_rate"],
+ mode=fusion_method)
+ if is_prediction:
+ num_choice = int(task_conf['num_choice'])
+ task_name = task_conf.get('task_prefix', 'vcr')
+ score = fluid.layers.fc(fusion_fea, 1,
+ param_attr = fluid.ParamAttr(name = task_name + "_fc.w_0",
+ initializer = fluid.initializer.TruncatedNormal(scale = 0.02)),
+ bias_attr = task_name + "_fc.b_0")
+ score = fluid.layers.reshape(score, shape = [-1, num_choice])
+ _loss, _softmax = fluid.layers.softmax_with_cross_entropy(logits = score,
+ label = labels, return_softmax = True)
+ _acc = fluid.layers.accuracy(input = _softmax, label = labels)
+ pred = fluid.layers.argmax(score, axis = 1)
+ mean_loss = fluid.layers.mean(_loss)
+ task_vars = [mean_loss, _acc, pred, q_ids, labels, _softmax]
+ for var in task_vars:
+ var.persistable = True
+ return pyreader, task_vars
+ else:
+ start_ind = 12
+ mean_loss = fluid.layers.zeros(shape = [1], dtype = 'float32')
+ mean_acc = fluid.layers.zeros(shape = [1], dtype = 'float32')
+ for task_conf in task_group:
+ task_weight = inputs[start_ind]
+ start_ind += 1
+ num_choice = int(task_conf['num_choice'])
+ task_name = task_conf.get('task_prefix', 'vcr')
+ score = fluid.layers.fc(fusion_fea, 1,
+ param_attr = fluid.ParamAttr(name = task_name + "_fc.w_0",
+ initializer = fluid.initializer.TruncatedNormal(scale = 0.02)),
+ bias_attr = task_name + "_fc.b_0")
+
+ _loss = fluid.layers.sigmoid_cross_entropy_with_logits(score,
+ binary_labels, name = "cross_entropy_loss")
+ tmp_score = fluid.layers.reshape(score, shape = [-1, num_choice])
+ _softmax = fluid.layers.softmax(tmp_score)
+ _acc = fluid.layers.accuracy(input = _softmax, label = labels)
+ _mean_loss = fluid.layers.mean(_loss)
+ mean_loss += _mean_loss * task_weight
+ mean_acc += _acc * task_weight
+ task_vars = [fluid.layers.reduce_mean(mean_loss), mean_acc]
+ for var in task_vars:
+ var.persistable = True
+
+ return pyreader, task_vars
+
+
+#MODELS = {"vcr": create_vcr_model, "vqa": create_vqa_model, "refcoco+": create_refcoco_model}
+MODELS = {"vcr": create_vcr_model}
+
+def predict_wrapper(args,
+ exe,
+ ernie_config,
+ task_group,
+ test_prog=None,
+ pyreader=None,
+ graph_vars=None):
+ """Context to do validation.
+ """
+ reader_name = READERS[args.task_name]
+ data_reader = reader_name(
+ task_group,
+ split=args.test_split,
+ vocab_path=args.vocab_path,
+ is_test=True,
+ shuffle=False,
+ batch_size=args.batch_size,
+ epoch=args.epoch)
+ if args.do_test:
+ assert args.init_checkpoint is not None, "[FATAL] Please use --init_checkpoint '/path/to/checkpoints' \
+ to specify you pretrained model checkpoints"
+
+ init_pretraining_params(exe, args.init_checkpoint, test_prog)
+ print(("testing on %s %s split") % (args.task_name, args.test_split))
+
+ def predict(exe=exe, pyreader=pyreader):
+ """
+ inference for downstream tasks
+ """
+ pyreader.decorate_tensor_provider(data_reader.data_generator())
+ pyreader.start()
+
+ cost = 0
+ appear_step = 0
+ task_acc = {}
+ task_steps = {}
+ steps = 0
+ case_f1 = 0
+ appear_f1 = 0
+ time_begin = time.time()
+ task_name_list = [v.name for v in graph_vars]
+ fetch_list = task_name_list
+
+ print('task name list : ', task_name_list)
+ sum_acc = 0
+ res_arr = []
+ while True:
+ try:
+ outputs = exe.run(fetch_list=fetch_list, program=test_prog)
+ each_acc = outputs[1][0]
+ preds = np.reshape(outputs[2], [-1])
+ qids = np.reshape(outputs[3], [-1])
+ labels = np.reshape(outputs[4], [-1])
+ scores = np.reshape(outputs[5], [-1, 4])
+ sum_acc += each_acc
+ steps += 1
+ if steps % 10 == 0:
+ print('cur_step:', steps, 'cur_acc:', sum_acc / steps)
+ format_result(res_arr, qids.tolist(), preds.tolist(), labels.tolist(), scores.tolist())
+ except fluid.core.EOFException:
+ pyreader.reset()
+ break
+
+ used_time = time.time() - time_begin
+
+ with open(args.result_file, "w") as f:
+ for r in res_arr:
+ f.write(r + "\n")
+
+ print("average_acc:", sum_acc / steps)
+ ret = {}
+ ret["acc"] = "acc: %f" % (sum_acc / steps)
+ for item in ret:
+ try:
+ ret[item] = ret[item].split(':')[-1]
+ except:
+ pass
+ return ret
+ return predict
+
+
+def get_optimizer(total_loss, train_program, startup_prog, args):
+ """
+ optimization func
+ """
+ decay_steps_str=args.decay_steps
+ if decay_steps_str == "":
+ decay_steps = []
+ else:
+ decay_steps = [int(s) for s in decay_steps_str.split(";")]
+ scheduled_lr = optimization(
+ loss=total_loss,
+ warmup_steps=args.warmup_steps,
+ num_train_steps=args.num_train_steps,
+ learning_rate=args.learning_rate,
+ train_program=train_program,
+ startup_prog=startup_prog,
+ weight_decay=args.weight_decay,
+ scheduler=args.lr_scheduler,
+ decay_steps=decay_steps,
+ lr_decay_ratio=args.lr_decay_ratio)
+ return scheduled_lr
+
+
+def main(args):
+ """
+ Main func for downstream tasks
+ """
+ print("finetuning tasks start")
+ ernie_config = ErnieVilConfig(args.ernie_config_path)
+ ernie_config.print_config()
+
+ with open(args.task_group_json) as f:
+ task_group = json.load(f)
+ print('task: ', task_group)
+
+ startup_prog = fluid.Program()
+ if args.do_train and args.do_test:
+ print("can not set both do_train and do_test as True")
+ return
+
+ model_name = MODELS[args.task_name]
+ if args.do_train:
+ train_program = fluid.Program()
+ with fluid.program_guard(train_program, startup_prog):
+ with fluid.unique_name.guard():
+ train_pyreader, model_outputs = model_name(
+ pyreader_name='train_reader', ernie_config=ernie_config, task_group=task_group)
+
+ total_loss = model_outputs[0]
+ scheduled_lr = get_optimizer(total_loss, train_program, startup_prog, args)
+ if args.do_test:
+ test_prog = fluid.Program()
+ with fluid.program_guard(test_prog, startup_prog):
+ with fluid.unique_name.guard():
+ test_pyreader, model_outputs = model_name(
+ pyreader_name='test_reader', ernie_config=ernie_config, task_group=task_group, is_prediction=True)
+ total_loss = model_outputs[0]
+
+ test_prog = test_prog.clone(for_test=True)
+
+ if args.use_gpu:
+ gpu_id = 0
+ if os.getenv("FLAGS_selected_gpus"):
+ gpu_id = int(os.getenv("FLAGS_selected_gpus"))
+ place = fluid.CUDAPlace(gpu_id) if args.use_gpu else fluid.CPUPlace()
+
+ print("theoretical memory usage: ")
+ if args.do_train:
+ print(fluid.contrib.memory_usage(
+ program=train_program, batch_size=args.batch_size))
+ if args.do_test:
+ print(fluid.contrib.memory_usage(
+ program=test_prog, batch_size=args.batch_size))
+
+ nccl2_num_trainers = 1
+ nccl2_trainer_id = 0
+ print("args.is_distributed:", args.is_distributed)
+ trainer_id = 0
+ if args.is_distributed:
+ trainer_id = int(os.getenv("PADDLE_TRAINER_ID"))
+ worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
+ current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
+ worker_endpoints = worker_endpoints_env.split(",")
+ trainers_num = len(worker_endpoints)
+
+ print("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
+ trainer_id:{}".format(worker_endpoints, trainers_num,
+ current_endpoint, trainer_id))
+
+ # prepare nccl2 env.
+ config = fluid.DistributeTranspilerConfig()
+ config.mode = "nccl2"
+ if args.nccl_comm_num > 1:
+ config.nccl_comm_num = args.nccl_comm_num
+ if args.use_hierarchical_allreduce and trainers_num > args.hierarchical_allreduce_inter_nranks:
+ config.use_hierarchical_allreduce=args.use_hierarchical_allreduce
+ config.hierarchical_allreduce_inter_nranks=args.hierarchical_allreduce_inter_nranks
+
+ assert config.hierarchical_allreduce_inter_nranks > 1
+ assert trainers_num % config.hierarchical_allreduce_inter_nranks == 0
+
+ config.hierarchical_allreduce_exter_nranks = \
+ trainers_num / config.hierarchical_allreduce_inter_nranks
+
+ t = fluid.DistributeTranspiler(config=config)
+ t.transpile(
+ trainer_id,
+ trainers=worker_endpoints_env,
+ current_endpoint=current_endpoint,
+ program=train_program,
+ startup_program=startup_prog)
+
+ nccl2_num_trainers = trainers_num
+ nccl2_trainer_id = trainer_id
+
+ exe = fluid.Executor(place)
+ exe.run(startup_prog)
+
+ if args.do_train:
+ if args.init_checkpoint and args.init_checkpoint != "":
+ sys.stderr.write('############################WARNING############################')
+ sys.stderr.write('####### using init_pretraining_params, not init_checkpoint ####')
+ sys.stderr.write('## meaning hyper param e.g. lr won\'t inherit from checkpoint##')
+ sys.stderr.write('###############################################################')
+ init_pretraining_params(exe, args.init_checkpoint, train_program)
+
+ reader_name=READERS[args.task_name]
+ data_reader = reader_name(
+ task_group,
+ split="train",
+ vocab_path=args.vocab_path,
+ batch_size=args.batch_size,
+ epoch=args.epoch,)
+
+ exec_strategy = fluid.ExecutionStrategy()
+ if args.use_fast_executor:
+ exec_strategy.use_experimental_executor = True
+ exec_strategy.num_threads = 2
+
+ exec_strategy.num_iteration_per_drop_scope = min(10, args.skip_steps)
+
+ build_strategy = fluid.compiler.BuildStrategy()
+ build_strategy.fuse_all_reduce_ops = False
+
+ if args.use_fuse:
+ build_strategy.fuse_all_reduce_ops = True
+
+ if args.do_train:
+ train_exe = fluid.ParallelExecutor(
+ use_cuda=args.use_cuda,
+ loss_name=total_loss.name,
+ build_strategy=build_strategy,
+ exec_strategy=exec_strategy,
+ main_program=train_program,
+ num_trainers=nccl2_num_trainers,
+ trainer_id=nccl2_trainer_id)
+
+ if args.do_test:
+ predict = predict_wrapper(
+ args,
+ exe,
+ ernie_config,
+ task_group,
+ test_prog=test_prog,
+ pyreader=test_pyreader,
+ graph_vars=model_outputs)
+ result = predict()
+
+ if args.do_train:
+ train_pyreader.decorate_tensor_provider(data_reader.data_generator())
+ train_pyreader.start()
+ steps = 0
+ time_begin = time.time()
+ node_nums = 1 #int(os.getenv("PADDLE_NODES_NUM"))
+ used_time_all = 0
+ while steps < args.num_train_steps:
+ try:
+ steps += node_nums
+ skip_steps = args.skip_steps * node_nums
+ fetch_list = []
+ if nccl2_trainer_id == 0 and steps % skip_steps == 0:
+ task_name_list = [v.name for v in model_outputs]
+ fetch_list = task_name_list
+ fetch_list.append(scheduled_lr.name)
+
+ time_begin = time.time()
+ outputs = train_exe.run(fetch_list=fetch_list)
+ if outputs:
+ print("feed_queue size", train_pyreader.queue.size())
+ progress_file = data_reader.get_progress()
+ epoch = progress_file["current_epoch"]
+ current_file_index = progress_file["current_file_index"]
+ total_file = progress_file["total_file"]
+ current_file = progress_file["current_file"]
+ print(
+ "epoch: %d, progress: %d/%d, step: %d, loss: %f, "
+ "acc: %f"
+ % (epoch, current_file_index, total_file, steps,
+ outputs[0][0],
+ outputs[1][0]))
+ print("steps:", steps)
+ print("save_steps:", args.save_steps)
+
+ np_lr = outputs[-1:]
+
+ date_str = datetime.datetime.now().strftime("%Y%m%d %H:%M:%S")
+
+ np_lr = float(np.mean(np_lr[0]))
+ print("%s current learning_rate:%.8f" % (date_str, np_lr))
+
+ if steps % args.save_steps == 0:
+ save_path = os.path.join(args.checkpoints, "step_" + str(steps))
+ print("save_path:", save_path)
+ fluid.io.save_persistables(exe, save_path, train_program)
+ time_end = time.time()
+ used_time = time_end - time_begin
+ time_end = time_begin
+ print("used_time:", used_time)
+ except fluid.core.EOFException:
+ train_pyreader.reset()
+ break
+
+
+if __name__ == '__main__':
+ print_arguments(args)
+ main(args)
+
diff --git a/ernie-vil/model/__init__.py b/ernie-vil/model/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/ernie-vil/model/ernie_vil.py b/ernie-vil/model/ernie_vil.py
new file mode 100644
index 0000000000000000000000000000000000000000..13b53097898e4c01416f12105dd0421ed72bd5e1
--- /dev/null
+++ b/ernie-vil/model/ernie_vil.py
@@ -0,0 +1,288 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""ERNIE-ViL model"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import json
+
+import six
+import paddle.fluid as fluid
+
+from model.vl_transformer_encoder import encoder, pre_process_layer
+
+
+class ErnieVilConfig(object):
+ """
+ configuration for ernie-vil
+ """
+ def __init__(self, config_path):
+ self._config_dict = self._parse(config_path)
+
+ def _parse(self, config_path):
+ try:
+ with open(config_path) as json_file:
+ config_dict = json.load(json_file)
+ except Exception:
+ raise IOError("Error in parsing Ernie model config file '%s'" %
+ config_path)
+ else:
+ return config_dict
+
+ def __getitem__(self, key):
+ return self._config_dict[key]
+
+ def print_config(self):
+ """
+ print configuration value
+ """
+ for arg, value in sorted(six.iteritems(self._config_dict)):
+ print('%s: %s' % (arg, value))
+ print('------------------------------------------------')
+
+
+class ErnieVilModel(object):
+ """
+ main class for ERNIE-ViL model
+ """
+ def __init__(self,
+ src_ids,
+ position_ids,
+ sentence_ids,
+ task_ids,
+ input_mask,
+ image_embeddings,
+ image_loc,
+ input_image_mask,
+ config,
+ predict_feature=False,
+ predict_class=True,
+ use_attr=False,
+ use_soft_label=True):
+
+ self._emb_size = config['hidden_size']
+ self._n_layer = config['num_hidden_layers']
+ self._n_head = config['num_attention_heads']
+
+ self._v_head = config['v_num_attention_heads']
+ self._v_emb_size = config['v_hidden_size']
+ self._v_inter_hid = config['v_intermediate_size']
+
+ self._co_head = config['co_num_attention_heads']
+ self._co_emb_size = config['co_hidden_size']
+ self._co_inter_hid = config['co_intermediate_size']
+
+ self._voc_size = config['vocab_size']
+ self._class_size = config['class_size']
+ self._class_attr_size = config['class_attr_size']
+ self._max_position_seq_len = config['max_position_embeddings']
+ self._sent_types = config['sent_type_vocab_size']
+ self._task_types = config['task_type_vocab_size']
+ self._hidden_act = config['hidden_act']
+ self._prepostprocess_dropout = config['hidden_dropout_prob']
+ self._attention_dropout = config['attention_probs_dropout_prob']
+ self._v_biattention_id = config['v_biattention_id']
+ self._t_biattention_id = config['t_biattention_id']
+
+ self._predict_feature = predict_feature
+ self._predict_class = predict_class
+ self._use_attr = use_attr
+ self._use_soft_label = use_soft_label
+ self._word_emb_name = "word_embedding"
+ self._pos_emb_name = "pos_embedding"
+ self._sent_emb_name = "sent_embedding"
+ self._image_emb_name = "image_embedding"
+ self._loc_emb_name = "loc_embedding"
+ self._dtype = "float32"
+ self._emb_dtype = "float32"
+
+ # Initialize all weigths by truncated normal initializer, and all biases
+ # will be initialized by constant zero by default.
+ self._param_initializer = fluid.initializer.TruncatedNormal(
+ scale=config['initializer_range'])
+
+ self._build_model(src_ids, position_ids, sentence_ids, task_ids, input_mask, \
+ image_embeddings, image_loc, input_image_mask)
+
+ def _build_model(self, src_ids, position_ids, sentence_ids, task_ids, input_mask, \
+ image_embeddings, image_loc, input_image_mask):
+ # padding id in vocabulary must be set to 0
+ emb_out = fluid.layers.embedding(
+ input=src_ids,
+ size=[self._voc_size, self._emb_size],
+ dtype=self._emb_dtype,
+ param_attr=fluid.ParamAttr(
+ name=self._word_emb_name, initializer=self._param_initializer),
+ is_sparse=False)
+
+ position_emb_out = fluid.layers.embedding(
+ input=position_ids,
+ size=[self._max_position_seq_len, self._emb_size],
+ dtype=self._emb_dtype,
+ param_attr=fluid.ParamAttr(
+ name=self._pos_emb_name, initializer=self._param_initializer))
+
+ sent_emb_out = fluid.layers.embedding(
+ sentence_ids,
+ size=[self._sent_types, self._emb_size],
+ dtype=self._emb_dtype,
+ param_attr=fluid.ParamAttr(
+ name=self._sent_emb_name, initializer=self._param_initializer))
+
+ emb_out = emb_out + position_emb_out
+ emb_out = emb_out + sent_emb_out
+
+ emb_out = pre_process_layer(
+ emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
+
+ self_attn_mask = fluid.layers.matmul(
+ x=input_mask, y=input_mask, transpose_y=True)
+
+ self_attn_mask = fluid.layers.scale(
+ x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
+ n_head_self_attn_mask = fluid.layers.stack(
+ x=[self_attn_mask] * self._n_head, axis=1)
+ n_head_self_attn_mask.stop_gradient = True
+
+ image_embeddings = fluid.layers.fc(image_embeddings,
+ self._v_emb_size,
+ param_attr=fluid.ParamAttr(
+ name="image_emb.w_0",
+ initializer=self._param_initializer),
+ bias_attr = "image_emb.b_0",
+ num_flatten_dims = 2)
+ loc_emb_out = fluid.layers.fc(image_loc,
+ self._v_emb_size,
+ param_attr=fluid.ParamAttr(
+ name="image_loc.w_0",
+ initializer=self._param_initializer),
+ bias_attr = "image_loc.b_0",
+ num_flatten_dims = 2)
+
+ emb_vl_out = image_embeddings + loc_emb_out
+ emb_vl_out = pre_process_layer(
+ emb_vl_out, 'nd', self._prepostprocess_dropout, name='vl_pre_encoder')
+
+ self_attn_image_mask = fluid.layers.matmul(
+ x=input_image_mask, y=input_image_mask, transpose_y=True)
+
+ self_attn_image_mask = fluid.layers.scale(
+ x=self_attn_image_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
+ n_head_self_attn_image_mask = fluid.layers.stack(
+ x=[self_attn_image_mask] * self._v_head, axis=1)
+ n_head_self_attn_image_mask.stop_gradient = True
+
+ self_attn_vl_mask = fluid.layers.matmul(
+ x=input_image_mask, y=input_mask, transpose_y=True)
+ self_attn_vl_mask = fluid.layers.scale(
+ x=self_attn_vl_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
+ n_head_self_attn_vl_mask = fluid.layers.stack(
+ x=[self_attn_vl_mask] * self._co_head, axis=1)
+ n_head_self_attn_vl_mask.stop_gradient = True
+
+ self._enc_out, self._enc_vl_out = encoder(
+ enc_input=emb_out,
+ enc_vl_input=emb_vl_out,
+ attn_bias=n_head_self_attn_mask,
+ attn_image_bias=n_head_self_attn_image_mask,
+ attn_vl_bias=n_head_self_attn_vl_mask,
+ n_layer=self._n_layer,
+ n_head=self._n_head,
+ d_key=self._emb_size // self._n_head,
+ d_value=self._emb_size // self._n_head,
+ d_model=self._emb_size,
+ d_inner_hid=self._emb_size * 4,
+ v_head=self._v_head,
+ v_key=self._v_emb_size // self._v_head,
+ v_value=self._v_emb_size // self._v_head,
+ v_model=self._v_emb_size,
+ v_inner_hid=self._v_inter_hid,
+ co_head=self._co_head,
+ co_key=self._co_emb_size // self._co_head,
+ co_value=self._co_emb_size // self._co_head,
+ co_model=self._co_emb_size,
+ co_inner_hid=self._co_inter_hid,
+ prepostprocess_dropout=self._prepostprocess_dropout,
+ attention_dropout=self._attention_dropout,
+ relu_dropout=0,
+ hidden_act=self._hidden_act,
+ preprocess_cmd="",
+ postprocess_cmd="dan",
+ param_initializer=self._param_initializer,
+ v_biattention_id = self._v_biattention_id,
+ t_biattention_id = self._t_biattention_id,
+ name='encoder')
+
+ def get_sequence_output(self):
+ """
+ Return sequence output of all text and img tokens
+ """
+ return self._enc_out, self._enc_vl_out
+
+ def get_pooled_output(self):
+ """
+ Get the first feature of each sequence for classification
+ """
+ text_cls_feat = fluid.layers.slice(
+ input=self._enc_out, axes=[1], starts=[0], ends=[1])
+
+ text_cls_feat = fluid.layers.cast(
+ x=text_cls_feat, dtype=self._emb_dtype)
+
+ text_cls_feat = fluid.layers.fc(
+ input=text_cls_feat,
+ size=self._co_emb_size,
+ act="relu",
+ param_attr=fluid.ParamAttr(
+ name="pooled_fc_text.w_0", initializer=self._param_initializer),
+ bias_attr="pooled_fc_text.b_0")
+
+ image_cls_feat = fluid.layers.slice(
+ input=self._enc_vl_out, axes=[1], starts=[0], ends=[1])
+
+ image_cls_feat = fluid.layers.cast(
+ x=image_cls_feat, dtype=self._emb_dtype)
+
+ image_cls_feat = fluid.layers.fc(
+ input=image_cls_feat,
+ size=self._co_emb_size,
+ act="relu",
+ param_attr=fluid.ParamAttr(
+ name="pooled_fc_image.w_0", initializer=self._param_initializer),
+ bias_attr="pooled_fc_image.b_0")
+ return text_cls_feat, image_cls_feat
+
+ def get_match_score(self, text, image, dropout_rate=0.0, mode="mul"):
+ """
+ match score for text [cls] and image [img] tokens
+ """
+ if mode == "sum":
+ emb_fuse = text + image
+ elif mode == "mul":
+ emb_fuse = text * image
+ else:
+ "current mode %s is not supported" % mode
+ return
+ if dropout_rate > 0.0:
+
+ emb_fuse = fluid.layers.dropout(emb_fuse,
+ self._attention_dropout,
+ dropout_implementation="upscale_in_train")
+ return emb_fuse
+
+
+
diff --git a/ernie-vil/model/vl_transformer_encoder.py b/ernie-vil/model/vl_transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a477541d5fb7edb9e5322e76c023fd6cd66197b
--- /dev/null
+++ b/ernie-vil/model/vl_transformer_encoder.py
@@ -0,0 +1,561 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""two-stream Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+ keys,
+ values,
+ attn_bias,
+ d_key,
+ d_value,
+ d_model,
+ n_head=1,
+ dropout_rate=0.,
+ cache=None,
+ param_initializer=None,
+ name='multi_head_att'):
+ """
+ Multi-Head Attention. Note that attn_bias is added to the logit before
+ computing softmax activiation to mask certain selected positions so that
+ they will not considered in attention weights.
+ """
+ keys = queries if keys is None else keys
+ values = keys if values is None else values
+
+ if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+ raise ValueError(
+ "Inputs: quries, keys and values should all be 3-D tensors.")
+
+ def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+ """
+ Add linear projection to queries, keys, and values.
+ """
+ q = layers.fc(input=queries,
+ size=d_key * n_head,
+ num_flatten_dims=2,
+ param_attr=fluid.ParamAttr(
+ name=name + '_query_fc.w_0',
+ initializer=param_initializer),
+ bias_attr=name + '_query_fc.b_0')
+ k = layers.fc(input=keys,
+ size=d_key * n_head,
+ num_flatten_dims=2,
+ param_attr=fluid.ParamAttr(
+ name=name + '_key_fc.w_0',
+ initializer=param_initializer),
+ bias_attr=name + '_key_fc.b_0')
+ v = layers.fc(input=values,
+ size=d_value * n_head,
+ num_flatten_dims=2,
+ param_attr=fluid.ParamAttr(
+ name=name + '_value_fc.w_0',
+ initializer=param_initializer),
+ bias_attr=name + '_value_fc.b_0')
+ return q, k, v
+
+ def __split_heads(x, n_head):
+ """
+ Reshape the last dimension of inpunt tensor x so that it becomes two
+ dimensions and then transpose. Specifically, input a tensor with shape
+ [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+ with shape [bs, n_head, max_sequence_length, hidden_dim].
+ """
+ hidden_size = x.shape[-1]
+ # The value 0 in shape attr means copying the corresponding dimension
+ # size of the input as the output dimension size.
+ reshaped = layers.reshape(
+ x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+ # permuate the dimensions into:
+ # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+ return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+ def __combine_heads(x):
+ """
+ Transpose and then reshape the last two dimensions of inpunt tensor x
+ so that it becomes one dimension, which is reverse to __split_heads.
+ """
+ if len(x.shape) == 3: return x
+ if len(x.shape) != 4:
+ raise ValueError("Input(x) should be a 4-D Tensor.")
+
+ trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+ # The value 0 in shape attr means copying the corresponding dimension
+ # size of the input as the output dimension size.
+ return layers.reshape(
+ x=trans_x,
+ shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
+ inplace=True)
+
+ def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+ """
+ Scaled Dot-Product Attention
+ """
+ scaled_q = layers.scale(x=q, scale=d_key ** -0.5)
+ product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+ if attn_bias:
+ product += attn_bias
+ weights = layers.softmax(product)
+ if dropout_rate:
+ weights = layers.dropout(
+ weights,
+ dropout_prob=dropout_rate,
+ dropout_implementation="upscale_in_train",
+ is_test=False)
+ out = layers.matmul(weights, v)
+ return out
+
+ q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+ if cache is not None: # use cache and concat time steps
+ # Since the inplace reshape in __split_heads changes the shape of k and
+ # v, which is the cache input for next time step, reshape the cache
+ # input from the previous time step first.
+ k = cache["k"] = layers.concat(
+ [layers.reshape(
+ cache["k"], shape=[0, 0, d_model]), k], axis=1)
+ v = cache["v"] = layers.concat(
+ [layers.reshape(
+ cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+ q = __split_heads(q, n_head)
+ k = __split_heads(k, n_head)
+ v = __split_heads(v, n_head)
+
+ ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
+ dropout_rate)
+
+ out = __combine_heads(ctx_multiheads)
+
+ # Project back to the model size.
+ proj_out = layers.fc(input=out,
+ size=d_model,
+ num_flatten_dims=2,
+ param_attr=fluid.ParamAttr(
+ name=name + '_output_fc.w_0',
+ initializer=param_initializer),
+ bias_attr=name + '_output_fc.b_0')
+ return proj_out
+
+
+def positionwise_feed_forward(x,
+ d_inner_hid,
+ d_hid,
+ dropout_rate,
+ hidden_act,
+ param_initializer=None,
+ name='ffn'):
+ """
+ Position-wise Feed-Forward Networks.
+ This module consists of two linear transformations with a ReLU activation
+ in between, which is applied to each position separately and identically.
+ """
+ hidden = layers.fc(input=x,
+ size=d_inner_hid,
+ num_flatten_dims=2,
+ act=hidden_act,
+ param_attr=fluid.ParamAttr(
+ name=name + '_fc_0.w_0',
+ initializer=param_initializer),
+ bias_attr=name + '_fc_0.b_0')
+ if dropout_rate:
+ hidden = layers.dropout(
+ hidden,
+ dropout_prob=dropout_rate,
+ dropout_implementation="upscale_in_train",
+ is_test=False)
+ out = layers.fc(input=hidden,
+ size=d_hid,
+ num_flatten_dims=2,
+ param_attr=fluid.ParamAttr(
+ name=name + '_fc_1.w_0', initializer=param_initializer),
+ bias_attr=name + '_fc_1.b_0')
+ return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
+ name=''):
+ """
+ Add residual connection, layer normalization and droput to the out tensor
+ optionally according to the value of process_cmd.
+ This will be used before or after multi-head attention and position-wise
+ feed-forward networks.
+ """
+ for cmd in process_cmd:
+ if cmd == "a": # add residual connection
+ out = out + prev_out if prev_out else out
+ elif cmd == "n": # add layer normalization
+ out = layers.layer_norm(
+ out,
+ begin_norm_axis=len(out.shape) - 1,
+ param_attr=fluid.ParamAttr(
+ name=name + '_layer_norm_scale',
+ initializer=fluid.initializer.Constant(1.)),
+ bias_attr=fluid.ParamAttr(
+ name=name + '_layer_norm_bias',
+ initializer=fluid.initializer.Constant(0.)))
+ elif cmd == "d": # add dropout
+ if dropout_rate:
+ out = layers.dropout(
+ out,
+ dropout_prob=dropout_rate,
+ dropout_implementation="upscale_in_train",
+ is_test=False)
+ return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_co_layer(enc_input,
+ enc_vl_input,
+ attn_vl_bias,
+ co_head,
+ co_key,
+ co_value,
+ co_model,
+ d_model,
+ d_inner_hid,
+ v_model,
+ v_inner_hid,
+ prepostprocess_dropout,
+ attention_dropout,
+ relu_dropout,
+ hidden_act,
+ preprocess_cmd="n",
+ postprocess_cmd="da",
+ param_initializer=None,
+ name=''):
+ """
+ Co_layer to perform co-attention from visual to language or from language to visual
+ """
+ enc_input_pre = pre_process_layer(
+ enc_input,
+ preprocess_cmd,
+ prepostprocess_dropout,
+ name=name + '_pre_att')
+
+ enc_input_vl_pre = pre_process_layer(
+ enc_vl_input,
+ preprocess_cmd,
+ prepostprocess_dropout,
+ name=name + '_vl_pre_att')
+
+ attn_output = multi_head_attention(
+ enc_input_pre,
+ enc_input_vl_pre,
+ enc_input_vl_pre,
+ layers.transpose(attn_vl_bias, perm=[0, 1, 3, 2]),
+ co_key,
+ co_value,
+ d_model,
+ co_head,
+ attention_dropout,
+ param_initializer=param_initializer,
+ name=name + '_multi_head_att')
+
+ attn_vl_output = multi_head_attention(
+ enc_input_vl_pre,
+ enc_input_pre,
+ enc_input_pre,
+ attn_vl_bias,
+ co_key,
+ co_value,
+ v_model,
+ co_head,
+ attention_dropout,
+ param_initializer=param_initializer,
+ name=name + '_vl_multi_head_att')
+
+ attn_output = post_process_layer(
+ enc_input,
+ attn_output,
+ postprocess_cmd,
+ prepostprocess_dropout,
+ name=name + '_post_att')
+
+ attn_vl_output = post_process_layer(
+ enc_vl_input,
+ attn_vl_output,
+ postprocess_cmd,
+ prepostprocess_dropout,
+ name=name + '_vl_post_att')
+
+ ffd_output = positionwise_feed_forward(
+ pre_process_layer(
+ attn_output,
+ preprocess_cmd,
+ prepostprocess_dropout,
+ name=name + '_pre_ffn'),
+ d_inner_hid,
+ d_model,
+ relu_dropout,
+ hidden_act,
+ param_initializer=param_initializer,
+ name=name + '_ffn')
+
+ ffd_vl_output = positionwise_feed_forward(
+ pre_process_layer(
+ attn_vl_output,
+ preprocess_cmd,
+ prepostprocess_dropout,
+ name=name + '_pre_vl_ffn'),
+ v_inner_hid,
+ v_model,
+ relu_dropout,
+ hidden_act,
+ param_initializer=param_initializer,
+ name=name + '_vl_ffn')
+
+ enc_output = post_process_layer(
+ attn_output,
+ ffd_output,
+ postprocess_cmd,
+ prepostprocess_dropout,
+ name=name + '_post_ffn')
+
+ enc_vl_output = post_process_layer(
+ attn_vl_output,
+ ffd_vl_output,
+ postprocess_cmd,
+ prepostprocess_dropout,
+ name=name + '_vl_post_ffn')
+
+ return enc_output, enc_vl_output
+
+
+def encoder_layer(enc_input,
+ attn_bias,
+ n_head,
+ d_key,
+ d_value,
+ d_model,
+ d_inner_hid,
+ prepostprocess_dropout,
+ attention_dropout,
+ relu_dropout,
+ hidden_act,
+ preprocess_cmd="n",
+ postprocess_cmd="da",
+ param_initializer=None,
+ name=''):
+ """The encoder layers that can be stacked to form a deep encoder.
+ This module consits of a multi-head (self) attention followed by
+ position-wise feed-forward networks and both the two components companied
+ with the post_process_layer to add residual connection, layer normalization
+ and droput.
+ """
+ attn_output = multi_head_attention(
+ pre_process_layer(
+ enc_input,
+ preprocess_cmd,
+ prepostprocess_dropout,
+ name=name + '_pre_att'),
+ None,
+ None,
+ attn_bias,
+ d_key,
+ d_value,
+ d_model,
+ n_head,
+ attention_dropout,
+ param_initializer=param_initializer,
+ name=name + '_multi_head_att')
+ attn_output = post_process_layer(
+ enc_input,
+ attn_output,
+ postprocess_cmd,
+ prepostprocess_dropout,
+ name=name + '_post_att')
+ ffd_output = positionwise_feed_forward(
+ pre_process_layer(
+ attn_output,
+ preprocess_cmd,
+ prepostprocess_dropout,
+ name=name + '_pre_ffn'),
+ d_inner_hid,
+ d_model,
+ relu_dropout,
+ hidden_act,
+ param_initializer=param_initializer,
+ name=name + '_ffn')
+ return post_process_layer(
+ attn_output,
+ ffd_output,
+ postprocess_cmd,
+ prepostprocess_dropout,
+ name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+ enc_vl_input,
+ attn_bias,
+ attn_image_bias,
+ attn_vl_bias,
+ n_layer,
+ n_head,
+ d_key,
+ d_value,
+ d_model,
+ d_inner_hid,
+ v_head,
+ v_key,
+ v_value,
+ v_model,
+ v_inner_hid,
+ co_head,
+ co_key,
+ co_value,
+ co_model,
+ co_inner_hid,
+ prepostprocess_dropout,
+ attention_dropout,
+ relu_dropout,
+ hidden_act,
+ preprocess_cmd="n",
+ postprocess_cmd="da",
+ param_initializer=None,
+ v_biattention_id=[0, 1, 2, 3, 4, 5],
+ t_biattention_id=[18, 19, 20, 21, 22, 23],
+ name=''):
+ """
+ The encoder is composed of a stack of identical layers returned by calling
+ encoder_layer and encoder_co_layer
+ """
+
+ v_start = 0
+ t_start = 0
+ block = 0
+
+ for v_layer_id, t_layer_id in zip(v_biattention_id, t_biattention_id):
+ v_end = v_layer_id
+ t_end = t_layer_id
+ for idx in range(t_start, t_end):
+ enc_output = encoder_layer(
+ enc_input,
+ attn_bias,
+ n_head,
+ d_key,
+ d_value,
+ d_model,
+ d_inner_hid,
+ prepostprocess_dropout,
+ attention_dropout,
+ relu_dropout,
+ hidden_act,
+ preprocess_cmd,
+ postprocess_cmd,
+ param_initializer=param_initializer,
+ name=name + '_layer_' + str(idx))
+ enc_input = enc_output
+
+ for idx in range(v_start, v_end):
+ enc_vl_output = encoder_layer(
+ enc_vl_input,
+ attn_image_bias,
+ v_head,
+ v_key,
+ v_value,
+ v_model,
+ v_inner_hid,
+ prepostprocess_dropout,
+ attention_dropout,
+ relu_dropout,
+ hidden_act,
+ preprocess_cmd,
+ postprocess_cmd,
+ param_initializer=param_initializer,
+ name=name + '_vlayer_' + str(idx))
+ enc_vl_input = enc_vl_output
+
+ enc_output, enc_vl_output = encoder_co_layer(
+ enc_input,
+ enc_vl_input,
+ attn_vl_bias,
+ co_head,
+ co_key,
+ co_value,
+ co_model,
+ d_model,
+ d_inner_hid,
+ v_model,
+ v_inner_hid,
+ prepostprocess_dropout,
+ attention_dropout,
+ relu_dropout,
+ hidden_act,
+ preprocess_cmd,
+ postprocess_cmd,
+ param_initializer=param_initializer,
+ name=name + '_colayer_' + str(block))
+
+ enc_input, enc_vl_input = enc_output, enc_vl_output
+
+ block += 1
+ v_start = v_end
+ t_start = t_end
+
+ enc_output = encoder_layer(
+ enc_output,
+ attn_bias,
+ n_head,
+ d_key,
+ d_value,
+ d_model,
+ d_inner_hid,
+ prepostprocess_dropout,
+ attention_dropout,
+ relu_dropout,
+ hidden_act,
+ preprocess_cmd,
+ postprocess_cmd,
+ param_initializer=param_initializer,
+ name=name + '_layer_' + str(t_end))
+
+ enc_vl_output = encoder_layer(
+ enc_vl_output,
+ attn_image_bias,
+ v_head,
+ v_key,
+ v_value,
+ v_model,
+ v_inner_hid,
+ prepostprocess_dropout,
+ attention_dropout,
+ relu_dropout,
+ hidden_act,
+ preprocess_cmd,
+ postprocess_cmd,
+ param_initializer=param_initializer,
+ name=name + '_vlayer_' + str(v_end))
+
+ enc_output = pre_process_layer(
+ enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+ enc_vl_output = pre_process_layer(
+ enc_vl_output, preprocess_cmd, prepostprocess_dropout, name="vl_post_encoder")
+
+ return enc_output, enc_vl_output
diff --git a/ernie-vil/optim/__init__.py b/ernie-vil/optim/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/ernie-vil/optim/optimization.py b/ernie-vil/optim/optimization.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb27665f9cc6bff8fa8e2febda8c4058c082b18c
--- /dev/null
+++ b/ernie-vil/optim/optimization.py
@@ -0,0 +1,167 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" text preprocess """
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import paddle.fluid as fluid
+
+def manual_warmup_decay(learning_rate, warmup_steps, num_train_steps, decay_steps=[], lr_decay_ratio=0.1):
+ """
+ Applies linear warmup of learning rate from 0 and keep constant.
+ """
+ with fluid.default_main_program()._lr_schedule_guard():
+ lr = fluid.layers.tensor.create_global_var(
+ shape=[1],
+ value=0.0,
+ dtype='float32',
+ persistable=True,
+ name="scheduled_learning_rate")
+
+ global_step = fluid.layers.learning_rate_scheduler._decay_step_counter(
+ )
+ with fluid.layers.control_flow.Switch() as switch:
+ with switch.case(global_step < warmup_steps):
+ warmup_lr = learning_rate * (global_step / warmup_steps)
+ fluid.layers.tensor.assign(warmup_lr, lr)
+ for i, step in enumerate(decay_steps):
+ with switch.case(global_step < step):
+ decayed_lr = learning_rate * (global_step / global_step) * pow(lr_decay_ratio, i)
+ fluid.layers.tensor.assign(decayed_lr, lr)
+ with switch.default():
+ constant_lr = learning_rate * (global_step / global_step) * pow(lr_decay_ratio, len(decay_steps))
+ fluid.layers.tensor.assign(constant_lr, lr)
+
+ return lr
+
+
+def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
+ """
+ Applies linear warmup of learning rate from 0 and decay to 0.
+ """
+ with fluid.default_main_program()._lr_schedule_guard():
+ lr = fluid.layers.tensor.create_global_var(
+ shape=[1],
+ value=0.0,
+ dtype='float32',
+ persistable=True,
+ name="scheduled_learning_rate")
+
+ global_step = fluid.layers.learning_rate_scheduler._decay_step_counter(
+ )
+
+ with fluid.layers.control_flow.Switch() as switch:
+ with switch.case(global_step < warmup_steps):
+ warmup_lr = learning_rate * (global_step / warmup_steps)
+ fluid.layers.tensor.assign(warmup_lr, lr)
+ with switch.default():
+ decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
+ learning_rate=learning_rate,
+ decay_steps=num_train_steps,
+ end_learning_rate=0.0,
+ power=1.0,
+ cycle=False)
+ fluid.layers.tensor.assign(decayed_lr, lr)
+
+ return lr
+
+def optimization(loss,
+ warmup_steps,
+ num_train_steps,
+ learning_rate,
+ train_program,
+ startup_prog,
+ weight_decay,
+ scheduler='linear_warmup_decay',
+ decay_steps=[],
+ lr_decay_dict_file="",
+ lr_decay_ratio=0.1):
+ """
+ optimization implementation
+ """
+ if warmup_steps > 0:
+ if scheduler == 'noam_decay':
+ scheduled_lr = fluid.layers.learning_rate_scheduler \
+ .noam_decay(1 / (warmup_steps * (learning_rate ** 2)),
+ warmup_steps)
+ elif scheduler == 'linear_warmup_decay':
+ scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps,
+ num_train_steps)
+ elif scheduler == 'manual_warmup_decay':
+ scheduled_lr = manual_warmup_decay(learning_rate, warmup_steps,
+ num_train_steps, decay_steps, lr_decay_ratio)
+ else:
+ raise ValueError("Unkown learning rate scheduler, should be "
+ "'noam_decay' or 'linear_warmup_decay' or 'manual_warmup_decay'")
+ else:
+ scheduled_lr = fluid.layers.create_global_var(
+ name=fluid.unique_name.generate("learning_rate"),
+ shape=[1],
+ value=learning_rate,
+ dtype='float32',
+ persistable=True)
+
+ lr_decay_dict = {}
+ if lr_decay_dict_file != "":
+ with open(lr_decay_dict_file) as f:
+ for line in f:
+ param, decay_rate = line.strip().split('\t')
+ lr_decay_dict[param] = float(decay_rate)
+
+ for param in fluid.default_main_program().block(0).all_parameters():
+ if param.name in lr_decay_dict:
+ print (param.name, lr_decay_dict[param.name])
+ param.optimize_attr['learning_rate'] = lr_decay_dict[param.name]
+
+ optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
+ optimizer._learning_rate_map[fluid.default_main_program(
+ )] = scheduled_lr
+
+
+ fluid.clip.set_gradient_clip(
+ clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0))
+
+ def exclude_from_weight_decay(name):
+ """
+ Parameters not use weight decay
+ """
+ if name.find("layer_norm") > -1:
+ return True
+ bias_suffix = ["_bias", "_b", ".b_0"]
+ for suffix in bias_suffix:
+ if name.endswith(suffix):
+ return True
+ return False
+
+ param_list = dict()
+
+ for param in train_program.global_block().all_parameters():
+ param_list[param.name] = param * 1.0
+ param_list[param.name].stop_gradient = True
+
+ _, param_grads = optimizer.minimize(loss)
+
+ if weight_decay > 0:
+ for param, grad in param_grads:
+ if exclude_from_weight_decay(param.name):
+ continue
+ with param.block.program._optimized_guard(
+ [param, grad]), fluid.framework.name_scope("weight_decay"):
+ updated_param = param - param_list[
+ param.name] * weight_decay * scheduled_lr * param.optimize_attr['learning_rate']
+ fluid.layers.assign(output=param, input=updated_param)
+
+ return scheduled_lr
diff --git a/ernie-vil/preprocess/__init__.py b/ernie-vil/preprocess/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/ernie-vil/preprocess/preprocessor.py b/ernie-vil/preprocess/preprocessor.py
new file mode 100755
index 0000000000000000000000000000000000000000..0cc0a80139d7bbaad98df8c99352cc95c6f5bec4
--- /dev/null
+++ b/ernie-vil/preprocess/preprocessor.py
@@ -0,0 +1,46 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" text preprocess """
+
+import random
+import sys
+import os
+import base64
+import numpy as np
+
+reload(sys)
+sys.setdefaultencoding("utf-8")
+
+from preprocess import tokenization
+
+class PreprocessorBasic(object):
+ """
+ Main class for text preprocess
+ """
+ def __init__(self,
+ tokenizer_name,
+ vocab_path,
+ tagger_path="",
+ nltk_data_path="",
+ do_lower_case=True):
+ self.do_lower_case = do_lower_case
+ self.tokenizer = getattr(tokenization, tokenizer_name)(vocab_file=vocab_path, do_lower_case=do_lower_case)
+ self.vocab = self.tokenizer.vocab
+
+ def convert_sentence_to_ids_without_cls(self, sentence):
+ """
+ Convert sentence to ids without cls
+ """
+ tokens = self.tokenizer.tokenize(sentence)
+ ids = self.tokenizer.convert_tokens_to_ids(tokens)
+ return ids
diff --git a/ernie-vil/preprocess/tokenization.py b/ernie-vil/preprocess/tokenization.py
new file mode 100644
index 0000000000000000000000000000000000000000..a661203259b61b6db061158ed91580c0a18af2bd
--- /dev/null
+++ b/ernie-vil/preprocess/tokenization.py
@@ -0,0 +1,467 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" tokenization implemnet """
+
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+import unicodedata
+import six
+from functools import reduce
+
+def convert_to_unicode(text):
+ """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+ if six.PY3:
+ if isinstance(text, str):
+ return text
+ elif isinstance(text, bytes):
+ return text.decode("utf-8", "ignore")
+ else:
+ raise ValueError("Unsupported string type: %s" % (type(text)))
+ elif six.PY2:
+ if isinstance(text, str):
+ return text.decode("utf-8", "ignore")
+ elif isinstance(text, unicode):
+ return text
+ else:
+ raise ValueError("Unsupported string type: %s" % (type(text)))
+ else:
+ raise ValueError("Not running on Python2 or Python 3?")
+
+
+def printable_text(text):
+ """Returns text encoded in a way suitable for print or `tf.logging`."""
+
+ # These functions want `str` for both Python2 and Python3, but in one case
+ # it's a Unicode string and in the other it's a byte string.
+ if six.PY3:
+ if isinstance(text, str):
+ return text
+ elif isinstance(text, bytes):
+ return text.decode("utf-8", "ignore")
+ else:
+ raise ValueError("Unsupported string type: %s" % (type(text)))
+ elif six.PY2:
+ if isinstance(text, str):
+ return text
+ elif isinstance(text, unicode):
+ return text.encode("utf-8")
+ else:
+ raise ValueError("Unsupported string type: %s" % (type(text)))
+ else:
+ raise ValueError("Not running on Python2 or Python 3?")
+
+
+def load_vocab(vocab_file):
+ """Loads a vocabulary file into a dictionary."""
+ vocab = collections.OrderedDict()
+ fin = open(vocab_file)
+ for num, line in enumerate(fin):
+ items = convert_to_unicode(line.strip()).split("\t")
+ if len(items) > 2:
+ break
+ token = items[0]
+ index = items[1] if len(items) == 2 else num
+ token = token.strip()
+ vocab[token] = int(index)
+ return vocab
+
+
+def convert_by_vocab(vocab, items):
+ """Converts a sequence of [tokens|ids] using the vocab."""
+ output = []
+ for item in items:
+ output.append(vocab[item])
+ return output
+
+
+def convert_tokens_to_ids(vocab, tokens):
+ """
+ Converts tokens to ids
+ """
+ return convert_by_vocab(vocab, tokens)
+
+
+def convert_ids_to_tokens(inv_vocab, ids):
+ """
+ Converts ids to tokens
+ """
+ return convert_by_vocab(inv_vocab, ids)
+
+
+def whitespace_tokenize(text):
+ """Runs basic whitespace cleaning and splitting on a peice of text."""
+ text = text.strip()
+ if not text:
+ return []
+ tokens = text.split()
+ return tokens
+
+
+class FullTokenizer(object):
+ """Runs end-to-end tokenziation."""
+
+ def __init__(self, vocab_file, do_lower_case=True):
+ self.vocab = load_vocab(vocab_file)
+ self.inv_vocab = {v: k for k, v in self.vocab.items()}
+ self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+ self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+
+ def tokenize(self, text):
+ """
+ turn text into tokens
+ """
+ split_tokens = []
+ for token in self.basic_tokenizer.tokenize(text):
+ for sub_token in self.wordpiece_tokenizer.tokenize(token):
+ split_tokens.append(sub_token)
+
+ return split_tokens
+
+ def tokenize_case(self, text):
+ """
+ tokenize case
+ """
+ split_tokens = []
+ case_indexs = []
+ basic_tokens, case_index = self.basic_tokenizer.tokenize_case(text)
+ case_indexs += case_index
+ case_indexs = [[i] for i in case_indexs]
+
+ for token_index, token in enumerate(basic_tokens):
+ wordpiece_tokens = self.wordpiece_tokenizer.tokenize(token)
+ if len(wordpiece_tokens) > 1:
+ case_indexs[token_index] = case_indexs[token_index]*(len(wordpiece_tokens))
+ for sub_token in wordpiece_tokens:
+ split_tokens.append(sub_token)
+
+ if case_indexs:
+ case_indexs = reduce(lambda x, y: x + y, case_indexs)
+ return split_tokens, case_indexs
+
+ def convert_tokens_to_ids(self, tokens):
+ """
+ Converts tokens to ids
+ """
+ return convert_by_vocab(self.vocab, tokens)
+
+ def convert_ids_to_tokens(self, ids):
+ """
+ Converts ids to tokens
+ """
+ return convert_by_vocab(self.inv_vocab, ids)
+
+
+class CharTokenizer(object):
+ """Runs end-to-end tokenziation."""
+
+ def __init__(self, vocab_file, do_lower_case=True):
+ self.vocab = load_vocab(vocab_file)
+ self.inv_vocab = {v: k for k, v in self.vocab.items()}
+ self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+
+ def tokenize(self, text):
+ """
+ Convert text to tokens
+ """
+ split_tokens = []
+ for token in text.lower().split(" "):
+ for sub_token in self.wordpiece_tokenizer.tokenize(token):
+ split_tokens.append(sub_token)
+
+ return split_tokens
+
+ def convert_tokens_to_ids(self, tokens):
+ """
+ Convert tokens to ids
+ """
+ return convert_by_vocab(self.vocab, tokens)
+
+ def convert_ids_to_tokens(self, ids):
+ """
+ Convert tokens to ids
+ """
+ return convert_by_vocab(self.inv_vocab, ids)
+
+
+class BasicTokenizer(object):
+ """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
+
+ def __init__(self, do_lower_case=True):
+ """Constructs a BasicTokenizer.
+
+ Args:
+ do_lower_case: Whether to lower case the input.
+ """
+ self.do_lower_case = do_lower_case
+
+ def tokenize(self, text):
+ """Tokenizes a piece of text."""
+ text = convert_to_unicode(text)
+ text = self._clean_text(text)
+
+ # This was added on November 1st, 2018 for the multilingual and Chinese
+ # models. This is also applied to the English models now, but it doesn't
+ # matter since the English models were not trained on any Chinese data
+ # and generally don't have any Chinese data in them (there are Chinese
+ # characters in the vocabulary because Wikipedia does have some Chinese
+ # words in the English Wikipedia.).
+ text = self._tokenize_chinese_chars(text)
+
+ orig_tokens = whitespace_tokenize(text)
+ split_tokens = []
+ for token in orig_tokens:
+ if self.do_lower_case:
+ token = token.lower()
+ token = self._run_strip_accents(token)
+ split_tokens.extend(self._run_split_on_punc(token))
+
+ output_tokens = whitespace_tokenize(" ".join(split_tokens))
+ return output_tokens
+
+ def tokenize_case(self, text):
+ """
+ tokenize case
+ """
+ text = convert_to_unicode(text)
+ text = self._clean_text(text)
+ text = self._tokenize_chinese_chars(text)
+
+ orig_tokens = whitespace_tokenize(text)
+ split_tokens = []
+ case_index = []
+
+ for token in orig_tokens:
+ if self.do_lower_case:
+ if token.istitle():
+ case_index.append(1)
+ else:
+ case_index.append(0)
+ token = token.lower()
+ token = self._run_strip_accents(token)
+ if token == '':
+ case_index.pop()
+
+ tmpsplit_tokens, case_index = self._run_split_on_punc_case(token, case_index)
+ split_tokens.extend(tmpsplit_tokens)
+
+ output_tokens = whitespace_tokenize(" ".join(split_tokens))
+ return output_tokens, case_index
+
+ def _run_strip_accents(self, text):
+ """Strips accents from a piece of text."""
+ text = unicodedata.normalize("NFD", text)
+ output = []
+ for char in text:
+ cat = unicodedata.category(char)
+ if cat == "Mn":
+ continue
+ output.append(char)
+ return "".join(output)
+
+ def _run_split_on_punc(self, text):
+ """Splits punctuation on a piece of text."""
+ chars = list(text)
+ i = 0
+ start_new_word = True
+ output = []
+ while i < len(chars):
+ char = chars[i]
+ if _is_punctuation(char):
+ output.append([char])
+ start_new_word = True
+ else:
+ if start_new_word:
+ output.append([])
+ start_new_word = False
+ output[-1].append(char)
+ i += 1
+
+ return ["".join(x) for x in output]
+
+ def _run_split_on_punc_case(self, text, case_index):
+ """Splits punctuation on a piece of text."""
+ chars = list(text)
+ i = 0
+ start_new_word = True
+ output = []
+
+ while i < len(chars):
+ char = chars[i]
+ if _is_punctuation(char):
+ output.append([char])
+ start_new_word = True
+ else:
+ if start_new_word:
+ output.append([])
+ start_new_word = False
+ output[-1].append(char)
+ i += 1
+
+ if len(output) > 1:
+ case_index.extend([case_index[-1]]*(len(output)-1))
+
+ return ["".join(x) for x in output], case_index
+
+ def _tokenize_chinese_chars(self, text):
+ """Adds whitespace around any CJK character."""
+ output = []
+ for char in text:
+ cp = ord(char)
+ if self._is_chinese_char(cp):
+ output.append(" ")
+ output.append(char)
+ output.append(" ")
+ else:
+ output.append(char)
+ return "".join(output)
+
+ def _is_chinese_char(self, cp):
+ """Checks whether CP is the codepoint of a CJK character."""
+ # This defines a "chinese character" as anything in the CJK Unicode block:
+ # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+ #
+ # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+ # despite its name. The modern Korean Hangul alphabet is a different block,
+ # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+ # space-separated words, so they are not treated specially and handled
+ # like the all of the other languages.
+ if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
+ (cp >= 0x3400 and cp <= 0x4DBF) or #
+ (cp >= 0x20000 and cp <= 0x2A6DF) or #
+ (cp >= 0x2A700 and cp <= 0x2B73F) or #
+ (cp >= 0x2B740 and cp <= 0x2B81F) or #
+ (cp >= 0x2B820 and cp <= 0x2CEAF) or
+ (cp >= 0xF900 and cp <= 0xFAFF) or #
+ (cp >= 0x2F800 and cp <= 0x2FA1F)): #
+ return True
+
+ return False
+
+ def _clean_text(self, text):
+ """Performs invalid character removal and whitespace cleanup on text."""
+ output = []
+ for char in text:
+ cp = ord(char)
+ if cp == 0 or cp == 0xfffd or _is_control(char):
+ continue
+ if _is_whitespace(char):
+ output.append(" ")
+ else:
+ output.append(char)
+ return "".join(output)
+
+
+class WordpieceTokenizer(object):
+ """Runs WordPiece tokenziation."""
+
+ def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
+ self.vocab = vocab
+ self.unk_token = unk_token
+ self.max_input_chars_per_word = max_input_chars_per_word
+
+ def tokenize(self, text):
+ """Tokenizes a piece of text into its word pieces.
+
+ This uses a greedy longest-match-first algorithm to perform tokenization
+ using the given vocabulary.
+
+ For example:
+ input = "unaffable"
+ output = ["un", "##aff", "##able"]
+
+ Args:
+ text: A single token or whitespace separated tokens. This should have
+ already been passed through `BasicTokenizer.
+
+ Returns:
+ A list of wordpiece tokens.
+ """
+
+ text = convert_to_unicode(text)
+
+ output_tokens = []
+ for token in whitespace_tokenize(text):
+ chars = list(token)
+ if len(chars) > self.max_input_chars_per_word:
+ output_tokens.append(self.unk_token)
+ continue
+
+ is_bad = False
+ start = 0
+ sub_tokens = []
+ while start < len(chars):
+ end = len(chars)
+ cur_substr = None
+ while start < end:
+ substr = "".join(chars[start:end])
+ if start > 0:
+ substr = "##" + substr
+ if substr in self.vocab:
+ cur_substr = substr
+ break
+ end -= 1
+ if cur_substr is None:
+ is_bad = True
+ break
+ sub_tokens.append(cur_substr)
+ start = end
+
+ if is_bad:
+ output_tokens.append(self.unk_token)
+ else:
+ output_tokens.extend(sub_tokens)
+ return output_tokens
+
+
+def _is_whitespace(char):
+ """Checks whether `chars` is a whitespace character."""
+ # \t, \n, and \r are technically contorl characters but we treat them
+ # as whitespace since they are generally considered as such.
+ if char == " " or char == "\t" or char == "\n" or char == "\r":
+ return True
+ cat = unicodedata.category(char)
+ if cat == "Zs":
+ return True
+ return False
+
+
+def _is_control(char):
+ """Checks whether `chars` is a control character."""
+ # These are technically control characters but we count them as whitespace
+ # characters.
+ if char == "\t" or char == "\n" or char == "\r":
+ return False
+ cat = unicodedata.category(char)
+ if cat.startswith("C"):
+ return True
+ return False
+
+
+def _is_punctuation(char):
+ """Checks whether `chars` is a punctuation character."""
+ cp = ord(char)
+ # We treat all non-letter/number ASCII as punctuation.
+ # Characters such as "^", "$", and "`" are not in the Unicode
+ # Punctuation class but we treat them as punctuation anyways, for
+ # consistency.
+ if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
+ (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
+ return True
+ cat = unicodedata.category(char)
+ if cat.startswith("P"):
+ return True
+ return False
diff --git a/ernie-vil/reader/__init__.py b/ernie-vil/reader/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/ernie-vil/reader/_image_features_reader.py b/ernie-vil/reader/_image_features_reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..2866bef90e806d14066faf9b2a17faa72834df7a
--- /dev/null
+++ b/ernie-vil/reader/_image_features_reader.py
@@ -0,0 +1,79 @@
+"""
+Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+"""
+import numpy as np
+import copy
+import pickle
+import lmdb # install lmdb by "pip install lmdb"
+import base64
+
+class ImageFeaturesH5Reader(object):
+ """
+ Reader class
+ """
+ def __init__(self, features_path):
+ self.features_path = features_path
+ self.env = lmdb.open(self.features_path, max_readers=1, readonly=True,
+ lock=False, readahead=False, meminit=False)
+
+ with self.env.begin(write=False) as txn:
+ self._image_ids = pickle.loads(txn.get('keys'.encode()))
+
+ self.features = [None] * len(self._image_ids)
+ self.num_boxes = [None] * len(self._image_ids)
+ self.boxes = [None] * len(self._image_ids)
+ self.boxes_ori = [None] * len(self._image_ids)
+
+ def __len__(self):
+ return len(self._image_ids)
+
+ def __getitem__(self, image_id):
+ image_id = str(image_id).encode()
+ index = self._image_ids.index(image_id)
+ # Read chunk from file everytime if not loaded in memory.
+ with self.env.begin(write=False) as txn:
+ item = pickle.loads(txn.get(image_id))
+ image_id = item['image_id']
+ image_h = int(item['image_h'])
+ image_w = int(item['image_w'])
+ num_boxes = int(item['num_boxes'])
+
+ features = np.frombuffer(base64.b64decode(item["features"]), dtype=np.float32).reshape(num_boxes, 2048)
+ boxes = np.frombuffer(base64.b64decode(item['boxes']), dtype=np.float32).reshape(num_boxes, 4)
+ g_feat = np.sum(features, axis=0) / num_boxes
+ num_boxes = num_boxes + 1
+ features = np.concatenate([np.expand_dims(g_feat, axis=0), features], axis=0)
+ image_location = np.zeros((boxes.shape[0], 5), dtype=np.float32)
+ image_location[:, :4] = boxes
+ image_location[:, 4] = (image_location[:, 3] - image_location[:, 1]) * \
+ (image_location[:, 2] - image_location[:, 0]) / (float(image_w) * float(image_h))
+
+ image_location_ori = copy.deepcopy(image_location)
+ image_location[:, 0] = image_location[:, 0] / float(image_w)
+ image_location[:, 1] = image_location[:, 1] / float(image_h)
+ image_location[:, 2] = image_location[:, 2] / float(image_w)
+ image_location[:, 3] = image_location[:, 3] / float(image_h)
+
+ g_location = np.array([0, 0, 1, 1, 1])
+ image_location = np.concatenate([np.expand_dims(g_location, axis=0), image_location], axis=0)
+
+ g_location_ori = np.array([0, 0, image_w, image_h, image_w * image_h])
+ image_location_ori = np.concatenate([np.expand_dims(g_location_ori, axis=0), image_location_ori], axis=0)
+
+ data_json = {"features": features,
+ "num_boxes": num_boxes,
+ "image_location": image_location,
+ "image_location_ori": image_location_ori
+ }
+ return data_json
+
diff --git a/ernie-vil/reader/vcr_finetuning.py b/ernie-vil/reader/vcr_finetuning.py
new file mode 100644
index 0000000000000000000000000000000000000000..78345572f6390864590ae9b989c0c25dc50eccb8
--- /dev/null
+++ b/ernie-vil/reader/vcr_finetuning.py
@@ -0,0 +1,473 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" VCR Data Reader implementation """
+
+from __future__ import print_function
+from __future__ import division
+
+import os
+import base64
+import numpy as np
+import re
+import random
+import json
+import json_lines
+import csv
+import sys
+import itertools
+
+from reader._image_features_reader import ImageFeaturesH5Reader
+from preprocess import preprocessor
+from batching.finetune_batching import prepare_batch_data
+
+import paddle.fluid as fluid
+
+def _converId(img_id):
+ """
+ conversion for image ID
+ """
+ img_id = img_id.split('-')
+ if 'train' in img_id[0]:
+ new_id = int(img_id[1])
+ elif 'val' in img_id[0]:
+ new_id = int(img_id[1]) + 1000000
+ elif 'test' in img_id[0]:
+ new_id = int(img_id[1]) + 2000000
+ else:
+ print("no split known")
+ return new_id
+
+
+def _load_annotationsQ_A(annotations_jsonpath, split):
+ """
+ Build an index out of FOIL annotations, mapping each image ID with its corresponding captions.
+ """
+ entries = []
+ with open(annotations_jsonpath) as f:
+ for annotation in json_lines.reader(f):
+ det_names = ""
+ question = annotation["question"]
+ if split == 'test':
+ ans_label = 0
+ else:
+ ans_label = annotation["answer_label"]
+ img_id = _converId(annotation["img_id"])
+ anno_id = int(annotation["annot_id"].split('-')[1])
+ entries.append(
+ {"question": question,
+ "answers": annotation["answer_choices"],
+ "metadata_fn": annotation["metadata_fn"],
+ "target": ans_label,
+ "img_id": img_id,
+ "anno_id": anno_id,
+ "det_names": annotation['objects']
+ })
+ return entries
+
+
+def _load_annotationsQA_R(annotations_jsonpath, split):
+ """
+ Build an index out of FOIL annotations, mapping each image ID with its corresponding captions.
+ """
+ entries = []
+ with open(annotations_jsonpath, 'rb') as f:
+ for annotation in json_lines.reader(f):
+ if split == 'test':
+ for answer in annotation["answer_choices"]:
+ question = annotation["question"] + ["[MARK]"] + answer
+ img_id = _converId(annotation["img_id"])
+ ans_label = 0
+ anno_id = int(annotation["annot_id"].split('-')[1])
+ entries.append(
+ {"question": question,
+ "answers": annotation["rationale_choices"],
+ "metadata_fn": annotation["metadata_fn"],
+ "target": ans_label,
+ "img_id": img_id,
+ "anno_id": anno_id,
+ "det_names": annotation['objects']
+ })
+ else:
+ det_names = ""
+ question = annotation["question"] + ["[MARK]"] + \
+ annotation["answer_choices"][annotation['answer_label']]
+ ans_label = annotation["rationale_label"]
+ img_id = _converId(annotation["img_id"])
+ anno_id = int(annotation["annot_id"].split('-')[1])
+ entries.append(
+ {"question": question,
+ "answers": annotation["rationale_choices"],
+ "metadata_fn": annotation["metadata_fn"],
+ "target": ans_label,
+ "img_id": img_id,
+ "anno_id": anno_id,
+ "det_names": annotation['objects']})
+ return entries
+
+
+class VCRDataReader(object):
+ """
+ Data reader for sub VCR task
+ """
+ def __init__(self,
+ task_conf,
+ split,
+ vocab_path=None,
+ batch_size=4096,
+ shuffle=True,
+ epoch=100,
+ is_test=False,
+ feature_reader_dict={},
+ random_seed=None,
+ task_index=0,
+ task_num=1):
+
+ self.task_conf = task_conf
+ self.processor = getattr(preprocessor,
+ task_conf["Proprocessor"])(tokenizer_name=self.task_conf["tokenizer_name"],
+ vocab_path=vocab_path)
+ self.vocab = self.processor.vocab
+ self.batch_size = batch_size
+ self.shuffle = shuffle
+ self.epoch = epoch
+ self.current_epoch = 0
+ self.current_file_index = 0
+ self.total_file = 0
+ self.current_file = None
+ self.random_seed = random_seed
+ self.max_seq_len = self.task_conf['max_seq_len']
+ self.pad_id = self.vocab["[PAD]"]
+ self.cls_id = self.vocab["[CLS]"]
+ self.sep_id = self.vocab["[SEP]"]
+ self.mask_id = self.vocab["[MASK]"]
+ self.is_test = is_test
+ self.task_index = task_index
+ self.task_num = task_num
+
+ if self.is_test:
+ self.epoch = 1
+ self.shuffle_files = False
+ if self.shuffle:
+ shufflekeep_across_task = self.task_conf.get('shufflekeep_across_task', True)
+ if shufflekeep_across_task:
+ self.global_rng = np.random.RandomState(random_seed)
+ else:
+ self.global_rng = np.random.RandomState()
+ self.shuffle_every_epoch = self.task_conf.get('shuffle_every_epoch', False)
+ task=self.task_conf['task']
+ annotations_jsonpath=self.task_conf['annotations_jsonpath_' + split]
+ self.num_choice = int(self.task_conf['num_choice'])
+ if task == 'VCR_Q-A':
+ self._entries = _load_annotationsQ_A(annotations_jsonpath, split)
+ elif task == "VCR_QA-R":
+ self._entries = _load_annotationsQA_R(annotations_jsonpath, split)
+ else:
+ assert False
+ self._split = split
+ self._names = []
+ with open(self.task_conf['unisex_names_table']) as csv_file:
+ csv_reader = csv.reader(csv_file, delimiter=',')
+ for row in csv_reader:
+ if row[1] != 'name':
+ self._names.append(row[1])
+ self._feature_reader = feature_reader_dict[self.task_conf['feature_lmdb_path']]
+ self.use_gt_fea = task_conf.get('use_gt_fea', False)
+ if self.use_gt_fea:
+ self._gt_feature_reader = feature_reader_dict[self.task_conf['gt_feature_lmdb_path']]
+ self._max_region_num = self.task_conf.get('max_region_num', 100)
+ print("use gt featurre")
+ else:
+ self._max_region_num = self.task_conf.get('max_region_num', 37)
+ print("only butd feature")
+ self.tokenize()
+
+ def generate_random_name(self, det_names):
+ """
+ Replace "person" with a random name
+ """
+ random_name = []
+ for name in det_names:
+ if name == 'person':
+ word = random.choice(self._names)
+ else:
+ word = name
+ random_name.append(word)
+
+ return random_name
+
+ def replace_det_with_name(self, inputs, random_names):
+ """
+ Replace det with name
+ """
+ tokens = []
+ mask = []
+ for w in inputs:
+ if isinstance(w, list):
+ for idx in w:
+ word = random_names[idx]
+ tokens.append(word)
+ else:
+ word = w.encode('utf-8')
+ tokens.append(word)
+
+ return tokens, mask
+
+ def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
+ """
+ Truncates a sequence pair in place to the maximum length.
+ """
+ while True:
+ total_length = len(tokens_a) + len(tokens_b)
+ if total_length <= max_length:
+ break
+ if len(tokens_a) > len(tokens_b):
+ tokens_a.pop()
+ else:
+ tokens_b.pop()
+
+ def get_progress(self):
+ """
+ Return current progress of traning data
+ """
+ progress_dict = {"current_epoch": self.current_epoch,
+ "current_file_index": self.current_file_index,
+ "total_file": self.total_file,
+ "current_file": self.current_file
+ }
+ return progress_dict
+
+ def tokenize(self):
+ """
+ Tokenizes the captions.
+ """
+ # This will add caption_tokens in each entry of the dataset.
+ # -1 represents nil, and should be treated as padding_idx in embedding.
+ count = 0
+ for entry in self._entries:
+ det_names = entry["det_names"]
+ random_names = self.generate_random_name(det_names)
+ # replace with name
+ tokens_a, mask_a = self.replace_det_with_name(entry["question"], random_names)
+ q_str = " ".join(tokens_a)
+ ids_a = []
+ for i, q in enumerate(q_str.split(" [MARK] ")):
+ if i == 1:
+ ids_a.append(self.vocab["[SEP]"])
+ ids_a = ids_a + self.processor.convert_sentence_to_ids_without_cls(q)
+
+ input_ids_all = []
+ segment_ids_all = []
+ input_poss_all = []
+ input_len_all = []
+
+ for answer in entry["answers"]:
+ tokens_b, mask_b = self.replace_det_with_name(answer, random_names)
+ ids_b = self.processor.convert_sentence_to_ids_without_cls(" ".join(tokens_b))
+
+ self._truncate_seq_pair(ids_a, ids_b, self.max_seq_len - 3)
+
+ input_ids = []
+ segment_ids = []
+ input_ids.append(self.vocab["[CLS]"])
+ segment_ids.append(0)
+
+ for id in ids_a:
+ input_ids.append(id)
+ segment_ids.append(0)
+
+ input_ids.append(self.vocab["[SEP]"])
+ segment_ids.append(0)
+
+ assert len(ids_b) > 0
+ for id in ids_b:
+ input_ids.append(id)
+ segment_ids.append(1)
+ input_ids.append(self.vocab["[SEP]"])
+ segment_ids.append(1)
+
+ input_ids_all.append(input_ids)
+ segment_ids_all.append(segment_ids)
+ input_poss = [str(pos) for pos in range(len(input_ids))]
+ input_poss_all.append(input_poss)
+ input_len_all.append(len(input_ids))
+
+ entry["input_ids"] = input_ids_all
+ entry["input_poss"] = input_poss_all
+ entry["segment_ids"] = segment_ids_all
+ entry["input_lens"] = input_len_all
+
+ sys.stdout.write('%d/%d\r' % (count, len(self._entries)))
+ sys.stdout.flush()
+ count += 1
+
+ def parse_line(self, s_index):
+ """
+ Form slot info with the line information
+ """
+ entry = self._entries[s_index]
+ image_id = entry["img_id"]
+ image_fea_json = self._feature_reader[image_id]
+ features = image_fea_json["features"]
+ num_boxes = image_fea_json["num_boxes"]
+ boxes = image_fea_json["image_location"]
+ if not self.use_gt_fea:
+ num_boxes = min(num_boxes, self._max_region_num)
+ boxes = boxes[:num_boxes]
+ features = features[:num_boxes]
+ else:
+ boxes = boxes[:num_boxes]
+ features = features[:num_boxes]
+ image_fea_json = self._gt_feature_reader[image_id]
+ gt_features = image_fea_json["features"]
+ gt_num_boxes = image_fea_json["num_boxes"]
+ gt_boxes = image_fea_json["image_location"]
+ features[0] = (features[0] * num_boxes + gt_features[0] * gt_num_boxes) / (num_boxes + gt_num_boxes)
+
+ gt_boxes = gt_boxes[1: gt_num_boxes]
+ gt_features = gt_features[1: gt_num_boxes]
+ gt_num_boxes = gt_num_boxes - 1
+
+ gt_box_preserve = min(self._max_region_num - 1, gt_num_boxes)
+ gt_boxes = gt_boxes[:gt_box_preserve]
+ gt_features = gt_features[:gt_box_preserve]
+ gt_num_boxes = gt_box_preserve
+
+ num_box_preserve = min(self._max_region_num - int(gt_num_boxes), int(num_boxes))
+ boxes = boxes[:num_box_preserve]
+ features = features[:num_box_preserve]
+
+ # concatenate the boxes
+ mix_boxes = np.concatenate((boxes, gt_boxes), axis=0)
+ mix_features = np.concatenate((features, gt_features), axis=0)
+ mix_num_boxes = num_box_preserve + int(gt_num_boxes)
+
+ num_boxes = min(mix_num_boxes, self._max_region_num)
+ boxes = mix_boxes[:num_boxes]
+ features = mix_features[:num_boxes]
+ record = {
+ "input_ids": entry["input_ids"],
+ "input_pos": entry["input_poss"],
+ "segment_ids": entry["segment_ids"],
+ "input_lens": entry["input_lens"],
+ "target": int(entry["target"]),
+ "features": features,
+ "boxes": boxes,
+ "anno_id": entry["anno_id"]
+ }
+ return record
+
+ def data_generator(self):
+ """
+ Data_generator
+ """
+ sample_indice = range(len(self._entries))
+ def wrapper():
+ """
+ Wrapper
+ """
+ for epoch_index in range(self.epoch):
+ if self._split == "train":
+ self.current_example = 0
+ self.current_epoch = epoch_index
+ if self.shuffle:
+ if epoch_index == 0:
+ self.global_rng.shuffle(sample_indice)
+ print("shuffle epoch %d" % epoch_index)
+ elif self.shuffle_every_epoch:
+ self.global_rng.shuffle(sample_indice)
+ print("shuffle epoch %d" % epoch_index)
+ batch_records = []
+ for index in sample_indice:
+ batch_records.append(self.parse_line(index))
+ if len(batch_records) == self.batch_size:
+ yield prepare_batch_data(
+ batch_records, self.num_choice, self.pad_id, \
+ self.task_index, self.task_num), self.task_conf['task']
+ batch_records = []
+ if len(batch_records) > 0:
+ yield prepare_batch_data(
+ batch_records, self.num_choice, self.pad_id, \
+ self.task_index, self.task_num), self.task_conf['task']
+ return wrapper
+
+
+class VCRDataJointReader(object):
+ """
+ Joint data reader for Q2A task and QA2R task
+ """
+ def __init__(self,
+ task_conf_group,
+ split,
+ batch_size=4096,
+ shuffle=True,
+ epoch=100,
+ vocab_path=None,
+ is_test=False):
+
+ self.task_readers = []
+ feature_reader_dict = {}
+ self.task_dup_cnt = []
+ for task_conf in task_conf_group:
+ if 'feature_lmdb_path' in task_conf:
+ if task_conf['feature_lmdb_path'] not in feature_reader_dict:
+ feature_reader_dict[task_conf['feature_lmdb_path']] = \
+ ImageFeaturesH5Reader(task_conf['feature_lmdb_path'])
+ if 'gt_feature_lmdb_path' in task_conf and task_conf.get('use_gt_fea', False):
+ if task_conf['gt_feature_lmdb_path'] not in feature_reader_dict:
+ feature_reader_dict[task_conf['gt_feature_lmdb_path']] = \
+ ImageFeaturesH5Reader(task_conf['gt_feature_lmdb_path'])
+ task_batch_size = task_conf.get('batch_size', 64)
+ self.task_dup_cnt.append(max(int(task_batch_size / batch_size), 1))
+ random_seed=np.random.randint(1000)
+ for task_index, task_conf in enumerate(task_conf_group):
+ self.task_readers.append(VCRDataReader(task_conf, split, vocab_path, batch_size, shuffle,
+ epoch, is_test, feature_reader_dict, random_seed, task_index, len(task_conf_group)))
+ self.task_generators = [reader.data_generator() for reader in self.task_readers]
+
+ def get_progress(self):
+ """
+ Return current progress of traning data
+ """
+ current_epoch = max([reader.current_epoch for reader in self.task_readers])
+ current_file_index = max([reader.current_file_index for reader in self.task_readers])
+ total_file = max([reader.total_file for reader in self.task_readers])
+ current_file = ""
+ self.progress_dict = {"current_epoch": current_epoch,
+ "current_file_index": current_file_index,
+ "total_file": total_file,
+ "current_file": current_file
+ }
+ return self.progress_dict
+
+ def data_generator(self):
+ """
+ Data_generator
+ """
+ def wrapper():
+ """
+ warpper
+ """
+ task_buffer = [[] for i in range(len(self.task_dup_cnt))]
+ for data in itertools.izip(*[generator() for generator in self.task_generators]):
+ for i, d in enumerate(data):
+ task_buffer[i].append(d)
+ if len(task_buffer[i]) >= self.task_dup_cnt[i]:
+ for t in task_buffer[i]:
+ yield t[0]
+ task_buffer[i] = []
+
+ return wrapper
+
+
+if __name__ == "__main__":
+ pass
diff --git a/ernie-vil/requirements.txt b/ernie-vil/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..525c143d4ad74c9a883b6f7767cd73207006de7b
--- /dev/null
+++ b/ernie-vil/requirements.txt
@@ -0,0 +1,8 @@
+nltk==3.2.4
+numpy==1.14.3
+scipy==1.2.1
+six==1.11.0
+json_lines==0.5.0
+lmdb==0.97
+opencv-python==3.2.0.8
+paddlepaddle-gpu==1.8.3.post97
diff --git a/ernie-vil/run_finetuning.sh b/ernie-vil/run_finetuning.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7807240fd41f66c9e203c7384ddd7c34eb845b4f
--- /dev/null
+++ b/ernie-vil/run_finetuning.sh
@@ -0,0 +1,59 @@
+set -eu
+set -x
+
+#bash -x ./env.sh
+
+TASK_NAME=$1
+CONF_FILE=$2
+VOCAB_PATH=$3
+ERNIE_VIL_CONFIG=$4
+PRETRAIN_MODELS=$5
+
+source $CONF_FILE
+
+#configure your cuda and cudnn
+#configure nccl
+
+export FLAGS_fast_eager_deletion_mode=1
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_fraction_of_gpu_memory_to_use=0.98
+
+e_executor=$(echo ${use_experimental_executor-'True'} | tr '[A-Z]' '[a-z]')
+
+use_fuse=$(echo ${use_fuse-'False'} | tr '[A-Z]' '[a-z]')
+if [[ ${use_fuse} == "true" ]]; then
+ export FLAGS_fuse_parameter_memory_size=131072
+ export FLAGS_fuse_parameter_groups_size=10
+fi
+
+
+TASK_GROUP_JSON=./conf/$TASK_NAME/task_${TASK_NAME}.json
+
+gpu_cnt=`echo $CUDA_VISIBLE_DEVICES | awk -F"\t" '{len=split($0,vec,",");print len}'`
+echo "gpu_cnt", $gpu_cnt
+python finetune.py --use_cuda "True" \
+ --is_distributed "False" \
+ --use_fast_executor ${e_executor-"True"} \
+ --nccl_comm_num ${nccl_comm_num:-"1"} \
+ --batch_size $((BATCH_SIZE/gpu_cnt)) \
+ --do_train "True" \
+ --do_test "False" \
+ --task_name ${TASK_NAME} \
+ --vocab_path ${VOCAB_PATH} \
+ --task_group_json ${TASK_GROUP_JSON} \
+ --lr_scheduler ${lr_scheduler} \
+ --decay_steps ${decay_steps-""} \
+ --lr_decay_ratio ${lr_decay_ratio-0.1} \
+ --num_train_steps ${num_train_steps} \
+ --checkpoints $output_model_path \
+ --save_steps ${SAVE_STEPS} \
+ --init_checkpoint ${PRETRAIN_MODELS} \
+ --ernie_config_path ${ERNIE_VIL_CONFIG} \
+ --learning_rate ${LR_RATE} \
+ --warmup_steps ${WARMUP_STEPS} \
+ --weight_decay ${WEIGHT_DECAY:-0} \
+ --max_seq_len ${MAX_LEN} \
+ --validation_steps ${VALID_STEPS} \
+ --skip_steps 10
+
+
diff --git a/ernie-vil/run_inference.sh b/ernie-vil/run_inference.sh
new file mode 100644
index 0000000000000000000000000000000000000000..63893286fec7c88d44f13b00db0787052da20037
--- /dev/null
+++ b/ernie-vil/run_inference.sh
@@ -0,0 +1,48 @@
+set -eu
+
+#bash -x ./env.sh
+
+TASK_NAME=$1
+SUB_TASK_NAME=$2
+TEST_SPLIT=$3
+CONF_FILE=$4
+VOCAB_PATH=$5
+ERNIE_VIL_CONFIG=$6
+MODEL_PATH=$7
+RES_FILE=$8
+
+source $CONF_FILE
+
+#configure your cuda and cudnn
+#configure nccl
+
+export FLAGS_eager_delete_tensor_gb=2.0
+export FLAGS_fraction_of_gpu_memory_to_use=0.01
+export FLAGS_sync_nccl_allreduce=1
+
+e_executor=$(echo ${use_experimental_executor-'True'} | tr '[A-Z]' '[a-z]')
+
+use_fuse=$(echo ${use_fuse-'False'} | tr '[A-Z]' '[a-z]')
+if [[ ${use_fuse} == "true" ]]; then
+ export FLAGS_fuse_parameter_memory_size=131072
+ export FLAGS_fuse_parameter_groups_size=10
+fi
+
+TASK_GROUP_JSON=./conf/$TASK_NAME/task_${TASK_NAME}_${SUB_TASK_NAME}.json
+
+python finetune.py --use_cuda "True" \
+ --use_fast_executor ${e_executor-"True"} \
+ --batch_size ${BATCH_SIZE} \
+ --do_train "False" \
+ --do_test "True" \
+ --test_split ${TEST_SPLIT} \
+ --task_name $TASK_NAME \
+ --vocab_path ${VOCAB_PATH} \
+ --task_group_json ${TASK_GROUP_JSON} \
+ --result_file "$RES_FILE" \
+ --init_checkpoint "$MODEL_PATH" \
+ --ernie_config_path ${ERNIE_VIL_CONFIG} \
+ --max_seq_len ${MAX_LEN} \
+ --skip_steps 10
+
+
diff --git a/ernie-vil/utils/__init__.py b/ernie-vil/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/ernie-vil/utils/args.py b/ernie-vil/utils/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..a88528a8ae3ff42df932f62502e649d62e82e1b2
--- /dev/null
+++ b/ernie-vil/utils/args.py
@@ -0,0 +1,61 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Arguments for configuration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import six
+import argparse
+
+
+def str2bool(v):
+ """
+ because argparse does not support to parse "true, False" as python
+ boolean directly
+ """
+ return v.lower() in ("true", "t", "1")
+
+
+class ArgumentGroup(object):
+ """
+ group of arguments
+ """
+ def __init__(self, parser, title, des):
+ self._group = parser.add_argument_group(title=title, description=des)
+
+ def add_arg(self, name, type, default, help, positional_arg=False, **kwargs):
+ """
+ add arg
+ """
+ prefix = "" if positional_arg else "--"
+ type = str2bool if type == bool else type
+ self._group.add_argument(
+ prefix + name,
+ default=default,
+ type=type,
+ help=help + ' Default: %(default)s.',
+ **kwargs)
+
+
+def print_arguments(args):
+ """
+ Arguments print function
+ """
+ print('----------- Configuration Arguments -----------')
+ for arg, value in sorted(six.iteritems(vars(args))):
+ print('%s: %s' % (arg, value))
+ print('------------------------------------------------')
diff --git a/ernie-vil/utils/init.py b/ernie-vil/utils/init.py
new file mode 100644
index 0000000000000000000000000000000000000000..faadca1b15a38b04122754ee41be35fd2430848c
--- /dev/null
+++ b/ernie-vil/utils/init.py
@@ -0,0 +1,71 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""parameters init function implementations"""
+
+
+from __future__ import print_function
+
+import os
+import six
+
+import numpy as np
+import paddle.fluid as fluid
+
+
+def init_checkpoint(exe, init_checkpoint_path, main_program):
+ """
+ init checkpoint params with lr and step info
+ """
+ assert os.path.exists(
+ init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
+ def existed_persitables(var):
+ """
+ Check if persitables
+ """
+ if not fluid.io.is_persistable(var):
+ return False
+ return os.path.exists(os.path.join(init_checkpoint_path, var.name))
+
+ fluid.io.load_vars(
+ exe,
+ init_checkpoint_path,
+ main_program=main_program,
+ predicate=existed_persitables)
+ print("Load model from {}".format(init_checkpoint_path))
+
+
+def init_pretraining_params(exe, pretraining_params_path, main_program):
+ """
+ init pretraining params without lr and step info
+ """
+ assert os.path.exists(pretraining_params_path
+ ), "[%s] cann't be found." % pretraining_params_path
+
+ def existed_params(var):
+ """
+ Check existed params
+ """
+ if not isinstance(var, fluid.framework.Parameter):
+ return False
+ return os.path.exists(os.path.join(pretraining_params_path, var.name))
+
+ fluid.io.load_vars(
+ exe,
+ pretraining_params_path,
+ main_program=main_program,
+ predicate=existed_params)
+ print("Load pretraining parameters from {}.".format(
+ pretraining_params_path))
+
diff --git a/requirements.txt b/requirements.txt
index e267a7738bb8f9067254bd7fe11fd992b8018504..9c9d2bc707935b6b0cad95f511d7130ce31d9a5c 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -6,4 +6,5 @@ scipy==1.2.1
six==1.11.0
sklearn==0.0
sentencepiece==0.1.8
+opencv-python==3.4.2.17
paddlepaddle-gpu==1.6.3.post107