diff --git a/ernie-vil/.meta/ernie_vil_struct.png b/ernie-vil/.meta/ernie_vil_struct.png
new file mode 100644
index 0000000000000000000000000000000000000000..cfa72e6116a2d2393f2bf12f25c98a66545c0698
Binary files /dev/null and b/ernie-vil/.meta/ernie_vil_struct.png differ
diff --git a/ernie-vil/README.md b/ernie-vil/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c14475630ca7596dd638a839a6b7f42d27ff5bfa
--- /dev/null
+++ b/ernie-vil/README.md
@@ -0,0 +1,136 @@
+English| [简体中文](./README_zh.md) 
+
+## _ERNIE-ViL_: Knowledge Enhanced Vision-Language Representations Through Scene Graph
+- [Framework](#framework)
+- [Pre-trained models](#pre-trained-models)
+- [Downstream tasks](#downstream-tasks)
+  * [VCR](#VCR)
+- [Usage](#usage)
+  * [Install PaddlePaddle](#install-paddlepaddle)
+  * [Fine-tuning on ERNIE-ViL](#fine-tuning-on-ernie-vil)
+  * [Inference](#inference)
+- [Citation](#citation)
+
+For technical description of the algorithm, please see our paper:
+
+>[_**ERNIE-ViL:Knowledge Enhanced Vision-Language Representations Through Scene Graph**_](https://arxiv.org/abs/2006.16934)
+>
+>Fei Yu\*, Jiji Tang\*, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution)
+>
+>Preprint June 2020
+>
+
+![ERNIE-ViL](https://img.shields.io/badge/Pretraining-vision_and_language_joint_representions-green)
+![VQA](https://img.shields.io/badge/VQA-Visual_Question_Answering-yellow) 
+![VCR](https://img.shields.io/badge/VCR-Visual_Commensense_Reasoning-blue) ![RefCOCO+](https://img.shields.io/badge/RefCOCO+-Region_to_Phrase_Grounding-green) 
+![IRTR](https://img.shields.io/badge/IR_&TR-Image_Retrieval&_Text_Retrieval-yellowgreen) 
+
+**[ERNIE-ViL](https://arxiv.org/abs/2006.16934) is a knowledge-enhanced joint representations for vision-language tasks**, which is the first work that has **introduced structured knowledge to enhance vision-language pre-training**. Utilizing structured knowledge obtained 
+from scene graphs, ERNIE-ViL constructs three **Scene Graph Prediction tasks**, i.e., **Object Prediction**, **Attribute Prediction** and **Relationship Prediction** tasks. 
+Thus, ERNIE-ViL can learn the better joint vision-language representations characterizing the alignments of the detailed semantics across vision and language.
+
+
+
+## Framework
+
+Based on the scene graph parsed from the text using Scene Graph Parser, we construct Object Prediction, Attribute Prediction and Relationship Prediction tasks:
+- **Object Prediction:** We randomly select a set of the objects in the scene graph, then mask and predict the corresponding words in the sentence.
+- **Attribute Prediction:** For the object-attribute pairs in the scene graph, we randomly select a part of them to mask and predict the words related to the attribute nodes in the sentence.
+- **Realtionship Prediction:** For the object-relationship-object triplets in the scene graph, we randomly select a part of realtionship nodes to mask and predict them.
+
+![ernie_vil_struct](.meta/ernie_vil_struct.png)  
+<font face="黑体" color=black size=5>Model Architecture of ERNIE-ViL</font>
+                                
+
+## Pre-trained Models
+ERNIE-ViL adopts large-scale image-text aligned datasets as the pre-training data. We provide ERNIE-ViL models of two scale settings which are pretrained on [**Conceptual Captions**](https://www.aclweb.org/anthology/P18-1238.pdf) and [**SBU Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio).
+
+- [**ERNIE-ViL _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-en.1.tar.gz) (_lowercased | 12-text-stream-layer, 6-visual-stream-layer_)
+- [**ERNIE-ViL _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_) 
+
+## Downstream tasks
+We finetune ERNIE-ViL on five vision-langage downstream tasks, i.e., Visual Commensense Reasoning([**VCR**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.pdf))，
+Visual Question Answering([**VQA**](https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf)),
+Cross-modal Image Retrieval([**IR**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)),
+Cross-modal Text Retrieval([**TR**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)) and
+Region_to_Phrase_Grounding([**RefCOCO+**](https://www.aclweb.org/anthology/D14-1086.pdf)).
+
+_Code and pre-trained models related to VCR task are made public now, and those of more downstream tasks are planed to be public._
+
+### VCR
+   * datasets
+      * The training, validation and testing data of VCR task are provided by [**VCR Website**](https://visualcommonsense.com/download/).
+      * Organization of visual features is modified from [**ViLBERT**](https://github.com/jiasenlu/vilbert_beta), we directly use the data from it. Data can be downloaded [here](https://github.com/jiasenlu/vilbert_beta/tree/master/data).
+      * Put all downloaded files under diretory "data/vcr".
+      
+  
+   * Task pre-training: We perform task-pretraining on VCR task, which is also known as task-specific-pretraining. The trained models are as follows: 
+      * [**ERNIE-ViL-VCR-task-pretrain _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-VCR-task-pre-en.1.tar.gz)
+      * [**ERNIE-ViL-VCR-task-pretrain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-VCR-task-pre-en.1.tar.gz) 
+   * Performance: Results of VCR task for ERNIE-ViL model, compared with previous state-of-the-art pre-trained models([**VILLA**](https://arxiv.org/pdf/2006.06195.pdf)).
+
+      | Models                                 |      <strong>Q->A</strong>    |    <strong>QA->R</strong>      |     <strong>Q->AR</strong>       |
+      | :--------------------------------------| :---------------------------: | :----------------------------: | :-----------------------------:  |
+      | VILLA (task-pretrain) _base_           |        75.54(76.4)            |        78.78(79.1)             |         59.75(60.6)              |
+      | ERNIE-ViL (task-pretrain) _base_       |        76.37(77.0)            |        79.65(80.3)             |         61.24(62.1)              |
+      | VILLA (task-pretrain) _large_          |        78.45(78.9)            |        82.57(82.8)             |          65.18(65.7)             |
+      | ERNIE-ViL (task-pretrain) _large_      | <strong>78.52(79.2)</strong>  |  <strong>83.37(83.5)</strong>  |  <strong/>65.81(66.3) </strong>  |
+
+        _Numerical results outside and inside parentheses represent the dev and test performance of VCR task respectively. 
+        Test results are obtained from the [**VCR leadborad**](https://visualcommonsense.com/leaderboard/)._
+
+
+
+## Usage
+
+### Install PaddlePaddle
+
+This code has been tested with Paddle Fluid 1.8 with Python 2.7. Other dependencies of ERNIE-ViL are listed in `requirements.txt`, you can install them by
+   ```script
+      pip install -r requirements.txt
+   ```
+
+### Fine-tuning on ERNIE-ViL
+Please update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2 before fine-tuning. You can easily run fine-tuning through
+configuration files. For example, you can finetune ERNIE-ViL model on VCR task by
+```script
+    sh run_finetuning.sh vcr conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $pretrain_models
+```
+Files which are needed by fine-tuning can be found in our given download links, incluing vocabulary dictionary, configuration
+file and pre-trained parameters. Note that our fine-tuning experiments on VCR are carried on 4 NVIDIA V100 (32GB) GPUs.
+If your GPU memory is not enough, you can reduce the batch size in the corresponding configuration file, e.g., "conf/vcr/model_conf_vcr". 
+
+
+
+### Inference
+   
+  You can use the following command to infer fine-tuned models. For example, you can infer VCR models by the following commands for different sub-tasks:
+    
+  **Task Q->A** 
+
+  ```script
+        sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
+  ``` 
+  **Task QA->R** 
+
+  ```script
+        sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
+  ``` 
+  
+
+
+
+## Citation
+
+You can cite the paper as below:
+
+```
+@article{yu2020ernie,
+  title={ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph},
+  author={Yu, Fei and Tang, Jiji and Yin, Weichong and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
+  journal={arXiv preprint arXiv:2006.16934},
+  year={2020}
+}
+
+```
+
diff --git a/ernie-vil/README_zh.md b/ernie-vil/README_zh.md
new file mode 100644
index 0000000000000000000000000000000000000000..149fc6a93d71d078c62eef141091ce986420c97e
--- /dev/null
+++ b/ernie-vil/README_zh.md
@@ -0,0 +1,132 @@
+
+[English](./README.md) | 简体中文
+
+## _ERNIE-ViL_: Knowledge Enhanced Vision-Language Representations Through Scene Graph
+- [模型框架](#模型框架)
+- [预训练模型](#预训练模型)
+- [下游任务](#下游任务)
+  * [视觉推理](#视觉推理)
+- [使用说明](#使用说明)
+  * [安装飞桨](#安装飞桨)
+  * [运行微调](#运行微调)
+  * [预测](#预测)
+- [引用](#引用)
+
+关于算法的详细描述，请参见我们的论文
+
+>[_**ERNIE-ViL:Knowledge Enhanced Vision-Language Representations Through Scene Graph**_](https://arxiv.org/abs/2006.16934)
+>
+>Fei Yu\*, Jiji Tang\*, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution)
+>
+>Preprint June 2020
+>
+![ERNIE-ViL](https://img.shields.io/badge/预训练-视觉语言联合表示-green)![VQA](https://img.shields.io/badge/视觉问答-VQA-yellow) ![VCR](https://img.shields.io/badge/视觉常识推理-VCR-blue) ![RefCOCO](https://img.shields.io/badge/引用表达式理解-RefCOCO+-green) ![IRTR](https://img.shields.io/badge/跨模态检索-IR&TR-yellowgreen) 
+
+
+---
+**ERNIE-ViL
+是面向视觉-语言任务的知识增强预训练框架**，首次在视觉-语言预训练中引入了结构化的知识。ERNIE-ViL利用场景图中的结构化知识，构建了**物体预测，属性预测，关系预测**三种预训练任务，精细地刻画了视觉-语言模态之间细粒度语义的对齐，从而获得了更好的视觉-语言联合表示。
+
+## 模型框架
+
+基于文本中解析出的场景图，ERNIE-ViL提出了三个多模态场景图预测任务：
+- **物体预测**：随机选取图中的一部分物体，然后对其在句子中对应的词进行掩码和预测；
+- **属性预测**：对于场景图中的属性-物体组合，随机选取一部分词对其中属性词进行掩码和预测；
+- **关系预测**：对于场景图中的物体-关系-物体三元组，对其中的关系词进行掩码和预测。
+
+![ernie_vil_struct](.meta/ernie_vil_struct.png)
+
+ERNIE-ViL 场景图预训练任务结构
+
+## 预训练模型
+
+
+ERNIE-ViL使用大规模图文对齐数据集作为预训练数据，基于[**Conceptual
+Captions**](https://www.aclweb.org/anthology/P18-1238.pdf)和[**SBU
+Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio)数据集，训练和发布了两种参数规模的模型：
+
+- [**ERNIE-ViL _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-en.1.tar.gz) (_lowercased | 12-text-stream-layer, 6-visual-stream-layer_)
+- [**ERNIE-ViL _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_)
+
+## 下游任务
+
+ERNIE-ViL在五个视觉语言下游任务进行了实验，包括[**视觉常识推理**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.pdf)，
+[**视觉问答**](https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf)，
+[**跨模态图片检索**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)，
+[**跨模态文本检索**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)，
+[**引用式理解**](https://www.aclweb.org/anthology/D14-1086.pdf)。 
+
+_当前仅开源视觉常识推理任务相关模型和代码，后续计划开源更多下游任务的模型和代码。_
+
+
+### **视觉常识推理**
+   * 数据集合
+      * 训练、验证和测试集合相关数据由[**视觉常识推理官网**](http://visualcommonsense.com/download/)提供；
+      * 视觉端特征的组织方式借鉴[**ViLBERT**](https://github.com/jiasenlu/vilbert_beta), 因此项目直接使用**ViLBERT**中的数据，数据[下载地址](https://github.com/jiasenlu/vilbert_beta/tree/master/data);
+      * 将所有获取的文件放在 data/vcr 目录下；
+      
+  
+   * 任务预训练： 在视觉推理任务中进行了任务预训练，预训练获得模型如下
+      * [**ERNIE-ViL-VCR-task-pretrain _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-VCR-task-pre-en.1.tar.gz)
+      * [**ERNIE-ViL-VCR-task-pretrain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-VCR-task-pre-en.1.tar.gz)
+   * 效果: ERNIE-ViL与之前最优预训练模型[**VILLA**](https://arxiv.org/pdf/2006.06195.pdf)在视觉常识推理任务上的效果对比如下：
+
+      | 模型                                |      <strong>Q->A</strong>    |      <strong>QA->R</strong>    |     <strong>Q->AR</strong>       |
+      | :---------------------------------- | :---------------------------: | :----------------------------: | :---------------------------:    |
+      | VILLA (task-pretrain) _base_        |           75.54(76.4)         |            78.78(79.1)         |           59.75(60.6)            |
+      | ERNIE-ViL (task-pretrain) _base_    |           76.37(77.0)         |            79.65(80.3)         |           61.24(62.1)            |
+      | VILLA (task-pretrain) _large_       |           78.45(78.9)         |            82.57(82.8)         |           65.18(65.7)            |
+      | ERNIE-ViL (task-pretrain) _large_   |  <strong>78.52(79.2)</strong> |  <strong>83.37(83.5)</strong>  |  <strong/>65.81(66.3) </strong>  |
+
+      _注：括号外表示验证集效果，括号内表示测试集效果，测试集效果由[VCR榜单](https://visualcommonsense.com/leaderboard/)提供。_
+
+
+## 使用说明
+
+### 安装飞桨
+
+ERNIE-ViL代码基于Paddle Fluid 1.8 和 Python 2.7， 依赖的其他模块也列举在 requirements.txt，可以通过下面的指令安装: 
+ ```script
+      pip install -r requirements.txt
+  ```
+### 运行微调
+在运行 ERNIE-ViL 前，需要将 CUDA 、cuDNN 、NCCL2 的动态库路径添加到 LD_LIBRARY_PATH 。 我们把下游任务的参数配置文件放到了 conf/ ，可以简单地通过配置文件运行。 例如，您可以通过下面的指令在VCR上任务上进行微调：
+```script
+    sh run_finetuning.sh vcr conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $pretrain_models_params
+```
+前面提供的模型链接中包含了所有需要的文件, 包含词表文件，配置文件和预训练参数。VCR任务的微调实验是在 4 张32 GB 的英伟达V100 GPU上运行，如果您的GPU显存不够，可以考虑八张卡运行或者减小配置中的batch_size。
+_我们目前开放了预训练模型和VCR的任务代码，其他的下游任务可以参考任务自主尝试。_
+
+### 预测
+基于已经训练的模型，您可以通过下面的命令测试VCR的效果：
+
+  **Task Q->A**
+ 
+  ```script
+       sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
+  ```
+  **Task QA->R**
+ 
+  ```script
+        sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
+  ```
+  
+  
+  VCR的测试可以在一张32GB的英伟达V100 GPU上运行，测试的结果包含Q->A 任务、QA->R任务和Q->AR任务，其中Q->AR任务由前两个任务结果合并所得。
+
+
+
+## 引用
+
+可以按下面的格式引用我们的论文:
+
+```
+@article{yu2020ernie,
+  title={ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph},
+  author={Yu, Fei and Tang, Jiji and Yin, Weichong and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
+  journal={arXiv preprint arXiv:2006.16934},
+  year={2020}
+}
+
+```
+
diff --git a/ernie-vil/args/__init__.py b/ernie-vil/args/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/ernie-vil/args/finetune_args.py b/ernie-vil/args/finetune_args.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd034c673bfe0bceee053f293eff9fc8fba36c15
--- /dev/null
+++ b/ernie-vil/args/finetune_args.py
@@ -0,0 +1,79 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" args defination and default value """
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import time
+import argparse
+
+from utils.args import ArgumentGroup, print_arguments
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
+model_g.add_arg("ernie_config_path", str, "./config/ernie_config.json", "json file path for ernie model config.")
+model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
+model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
+model_g.add_arg("task_name", str, "vcr", "Task to finetune on ERNIE-ViL")
+
+train_g = ArgumentGroup(parser, "training", "training options.")
+train_g.add_arg("epoch", int, 100, "Number of epoches for training.")
+train_g.add_arg("learning_rate", float, 0.0001, "Learning rate used to train with warmup.")
+train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
+                "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay', 'manual_warmup_decay'])
+train_g.add_arg("decay_steps", str, "", "learning rate decay steps, list with ;")
+train_g.add_arg("lr_decay_ratio", float, 0.1, "learning rate decay ratio, used with manual_warmup_decay")
+train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
+train_g.add_arg("num_train_steps", int, 1000000, "Total steps to perform pretraining.")
+train_g.add_arg("warmup_steps", int, 0, "Total steps to perform warmup when pretraining.")
+train_g.add_arg("save_steps", int, 100, "The steps interval to save checkpoints.")
+train_g.add_arg("validation_steps", int, 6000, "The steps interval to evaluate model performance.")
+train_g.add_arg("use_fuse", bool, False, "Whether to use fuse_allreduce_ops.")
+train_g.add_arg("nccl_comm_num", int, 1, "NCCL comm num.")
+train_g.add_arg("hierarchical_allreduce_inter_nranks", int, 8, "Hierarchical allreduce inter ranks.")
+train_g.add_arg("use_hierarchical_allreduce", bool, False, "Use hierarchical allreduce or not.")
+train_g.add_arg("use_gpu", bool, True, "Whether to gpu.")
+
+log_g = ArgumentGroup(parser, "logging", "logging related.")
+log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
+log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
+
+data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
+data_g.add_arg("result_file", str, "./res_tmp", "file to storage results")
+data_g.add_arg("lr_decay_dict_file", str, "", "learning rate decay files.")
+data_g.add_arg("train_filelist", str, "", "Path to training filelist.")
+data_g.add_arg("valid_filelist", str, "", "Path to valid filelist.")
+data_g.add_arg("test_filelist", str, "", "Path to test filelist.")
+data_g.add_arg("vocab_path", str, "./config/vocab.txt", "Vocabulary path.")
+data_g.add_arg("test_split", str, "val", "test of sub tasks, val or test")
+data_g.add_arg("max_seq_len", int, 128, "Number of words of the longest seqence.")
+data_g.add_arg("max_img_len", int, 100, "Number of image rois of the longest seqence.")
+data_g.add_arg("feature_size", int, 2048, "Number of roi feature size of image.")
+data_g.add_arg("fusion_method", str, "sum", "Number of roi feature size of image.")
+data_g.add_arg("batch_size", int, 16, "Total examples' number in batch for training. see also --in_tokens.")
+data_g.add_arg("task_group_json", str, "", "Path to task json")
+
+run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
+run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.")
+run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
+run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
+run_type_g.add_arg("do_train", bool, False, "Whether to perform evaluation on test data set.")
+run_type_g.add_arg("do_test", bool, False, "Whether to perform evaluation on test data set.")
+run_type_g.add_arg("output_file", str, "", "The output file to save model output.")
+# yapf: enable
diff --git a/ernie-vil/batching/__init__.py b/ernie-vil/batching/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/ernie-vil/batching/finetune_batching.py b/ernie-vil/batching/finetune_batching.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9527bfd2d5570e567bd171b250db44202ba2b11
--- /dev/null
+++ b/ernie-vil/batching/finetune_batching.py
@@ -0,0 +1,97 @@
+#    Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");                                                      
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#    http://www.apache.org/licenses/LICENSE-2.0
+        
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" prepare data format for finetuning tasks """
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from six.moves import xrange
+
+
+def prepare_batch_data(batch_records, num_choice, pad_id, task_index, task_num):
+    """
+    prepare batch data for finetuning tasks
+    """
+    batch_input_ids = []
+    batch_input_pos = []
+    batch_seg_ids = []
+    batch_input_masks = []
+    num_sample = len(batch_records)
+    batch_lens = [record["input_lens"] for record in batch_records]
+    batch_labels = [record["target"] for record in batch_records]
+    binary_labels = np.zeros([num_choice * num_sample, 1], dtype='float32')
+    for i, l in enumerate(batch_labels):
+        binary_labels[i * num_choice + l] = 1.0
+    labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
+    image_features = [record["features"] for record in batch_records]
+    image_boxes = [record["boxes"] for record in batch_records]
+    batch_anno_ids = np.array([record["anno_id"] for record in batch_records]).astype("int64").reshape([-1, 1])
+    max_len = max([max(lens) for lens in batch_lens])
+    for i in range(len(batch_records)):
+        batch_input_ids.append([inst + list([pad_id] * (max_len - len(inst))) \
+            for inst in batch_records[i]["input_ids"]])
+        batch_input_pos.append([inst + list([pad_id] * (max_len - len(inst))) \
+            for inst in batch_records[i]["input_pos"]])
+        batch_seg_ids.append([inst + list([pad_id] * (max_len - len(inst)))   \
+            for inst in batch_records[i]["segment_ids"]])
+        batch_input_masks.append([[1] * len(inst) + [0] * (max_len - len(inst))    \
+            for inst in batch_records[i]["input_ids"]])
+
+    image_embedding, image_mask = pad_feature_data(image_features, return_mask=True)
+    image_loc = pad_feature_data(image_boxes)
+    src_ids = np.array(batch_input_ids).astype("int64").reshape([num_choice * num_sample, max_len, 1])
+    src_pos = np.array(batch_input_pos).astype("int64").reshape([num_choice * num_sample, max_len, 1])
+    src_seg = np.array(batch_seg_ids).astype("int64").reshape([num_choice * num_sample, max_len, 1])
+    src_masks = np.array(batch_input_masks).astype("float32").reshape([num_choice * num_sample, max_len, 1])
+    src_task = np.zeros(src_ids.shape, dtype="int64")
+    batch, seq_len, fea_len = image_embedding.shape
+    image_embedding = np.tile(np.expand_dims(image_embedding, axis=1),    \
+        (1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, fea_len])
+    image_mask = np.tile(np.expand_dims(image_mask, axis=1),        \
+        (1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, 1])
+    image_loc = np.tile(np.expand_dims(image_loc, axis=1),     \
+        (1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, 5])
+    return_list = [src_ids, src_pos, src_seg, src_task, src_masks, \
+        image_embedding, image_loc, image_mask, labels, batch_anno_ids]
+    return_list.append(np.array([task_index]).astype('int64'))
+    return_list.append(binary_labels)
+    for i in xrange(task_num):
+        if i == task_index:
+            return_list.append(np.array([1.0]).astype("float32"))
+        else:
+            return_list.append(np.array([0.0]).astype("float32"))
+    return return_list
+
+
+def pad_feature_data(data, pad_value=0.0, dtype="float32", return_mask=False):
+    """
+    pad visual features with given pad value
+    """
+    max_lenth=max([len(item) for item in data])
+    data_width = len(data[0][0])
+    out_data = np.ones((len(data), max_lenth, data_width), dtype=dtype) * pad_value
+    out_mask = np.zeros((len(data), max_lenth, 1), dtype=dtype)
+    for i in range(len(data)):
+        out_data[i, 0: len(data[i]), :] = data[i]
+        if return_mask:
+            out_mask[i, 0:len(data[i]):] = 1.0
+    if return_mask:
+        return out_data, out_mask
+    else:
+        return out_data
+
+if __name__ == "__main__":
+    pass
diff --git a/ernie-vil/conf/vcr/model_conf_vcr b/ernie-vil/conf/vcr/model_conf_vcr
new file mode 100644
index 0000000000000000000000000000000000000000..d683cbff17d285ed369ed39a432cd1e8eb920885
--- /dev/null
+++ b/ernie-vil/conf/vcr/model_conf_vcr
@@ -0,0 +1,12 @@
+output_model_path="output_vcr"
+lr_scheduler="manual_warmup_decay"
+decay_steps="13308;19962"
+lr_decay_ratio=0.1
+num_train_steps=26640
+SAVE_STEPS=6660
+WARMUP_STEPS=6654
+BATCH_SIZE=64
+VALID_STEPS=20000
+LR_RATE=2e-5
+WEIGHT_DECAY=0.01
+MAX_LEN=80
diff --git a/ernie-vil/conf/vcr/task_vcr.json b/ernie-vil/conf/vcr/task_vcr.json
new file mode 100644
index 0000000000000000000000000000000000000000..9ac9d56d24f05591f29456b8ac50cf603faabcd4
--- /dev/null
+++ b/ernie-vil/conf/vcr/task_vcr.json
@@ -0,0 +1,42 @@
+[
+{
+"task": "VCR_Q-A",
+"num_choice": 4,
+"annotations_jsonpath_train": "./data/vcr/train.jsonl",
+"annotations_jsonpath_val": "./data/vcr/val.jsonl",
+"annotations_jsonpath_test": "./data/vcr/test.jsonl",
+"feature_lmdb_path": "./data/vcr/VCR_resnet101_faster_rcnn_genome_pickle2.lmdb",
+"gt_feature_lmdb_path": "./data/vcr/VCR_gt_resnet101_faster_rcnn_genome_pickle2.lmdb",
+"unisex_names_table" : "./data/vcr/unisex_names_table.csv",
+"Proprocessor": "PreprocessorBasic",
+"tokenizer_name" : "FullTokenizer",
+"fusion_method" : "mul",
+"dropout_rate" : 0.1,
+"max_seq_len" : 60,
+"use_gt_fea" : true,
+"shufflekeep_across_task": true,
+"shuffle_every_epoch": true,
+"task_weight": 1.0,
+"task_prefix": "vcr_qa"
+},
+{
+"task": "VCR_QA-R",
+"num_choice": 4,
+"annotations_jsonpath_train": "./data/vcr/train.jsonl",
+"annotations_jsonpath_val": "./data/vcr/val.jsonl",
+"annotations_jsonpath_test": "./data/vcr/test.jsonl",
+"feature_lmdb_path": "./data/vcr/VCR_resnet101_faster_rcnn_genome_pickle2.lmdb",
+"gt_feature_lmdb_path": "./data/vcr/VCR_gt_resnet101_faster_rcnn_genome_pickle2.lmdb",
+"unisex_names_table" : "./data/vcr/unisex_names_table.csv",
+"Proprocessor": "PreprocessorBasic",
+"tokenizer_name" : "FullTokenizer",
+"fusion_method" : "mul",
+"dropout_rate" : 0.1,
+"max_seq_len" : 80,
+"use_gt_fea" : true,
+"shufflekeep_across_task": true,
+"shuffle_every_epoch" : true,
+"task_weight": 1.0,
+"task_prefix": "vcr_qar"
+}
+]
diff --git a/ernie-vil/conf/vcr/task_vcr_qa.json b/ernie-vil/conf/vcr/task_vcr_qa.json
new file mode 100644
index 0000000000000000000000000000000000000000..c2b4afa714046a94e8f7720506721cc6edde5894
--- /dev/null
+++ b/ernie-vil/conf/vcr/task_vcr_qa.json
@@ -0,0 +1,21 @@
+[
+{
+"task": "VCR_Q-A",
+"num_choice": 4,
+"annotations_jsonpath_train": "./data/vcr/train.jsonl",
+"annotations_jsonpath_val": "./data/vcr/val.jsonl",
+"annotations_jsonpath_test": "./data/vcr/test.jsonl",
+"feature_lmdb_path": "./data/vcr/VCR_resnet101_faster_rcnn_genome_pickle2.lmdb",
+"gt_feature_lmdb_path": "./data/vcr/VCR_gt_resnet101_faster_rcnn_genome_pickle2.lmdb",
+"unisex_names_table" : "./data/vcr/unisex_names_table.csv",
+"Proprocessor": "PreprocessorBasic",
+"tokenizer_name" : "FullTokenizer",
+"tagger_path" : "./script/ntc.pickle",
+"nltk_data_path" : "./nltk_data",
+"fusion_method" : "mul",
+"dropout_rate" : 0.1,
+"max_seq_len" : 60,
+"use_gt_fea" : true,
+"task_prefix" : "vcr_qa"
+}
+]
diff --git a/ernie-vil/conf/vcr/task_vcr_qar.json b/ernie-vil/conf/vcr/task_vcr_qar.json
new file mode 100644
index 0000000000000000000000000000000000000000..8f4c88021f2666ce1779ea47ffd6014e67fb91ad
--- /dev/null
+++ b/ernie-vil/conf/vcr/task_vcr_qar.json
@@ -0,0 +1,22 @@
+[
+{
+"task": "VCR_QA-R",
+"num_choice": 4,
+"annotations_jsonpath_train": "./data/vcr/train.jsonl",
+"annotations_jsonpath_val": "./data/vcr/val.jsonl",
+"annotations_jsonpath_test": "./data/vcr/test.jsonl",
+"feature_lmdb_path": "./data/vcr/VCR_resnet101_faster_rcnn_genome_pickle2.lmdb",
+"gt_feature_lmdb_path": "./data/vcr/VCR_gt_resnet101_faster_rcnn_genome_pickle2.lmdb",
+"unisex_names_table" : "./data/vcr/unisex_names_table.csv",
+"Proprocessor": "PreprocessorBasic",
+"tokenizer_name" : "FullTokenizer",
+"vocab_path" : "./package/vocab.txt",
+"tagger_path" : "./script/ntc.pickle",
+"nltk_data_path" : "./nltk_data",
+"fusion_method" : "mul",
+"dropout_rate" : 0.1,
+"max_seq_len" : 80,
+"use_gt_fea" : true,
+"task_prefix" : "vcr_qa"
+}
+]
diff --git a/ernie-vil/finetune.py b/ernie-vil/finetune.py
new file mode 100755
index 0000000000000000000000000000000000000000..dbee99a0a5d6096a3ae9e902f3137f227470e2af
--- /dev/null
+++ b/ernie-vil/finetune.py
@@ -0,0 +1,465 @@
+#    Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" finetuning vison-language task """
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import sys
+import time
+import datetime
+import argparse
+import numpy as np
+import multiprocessing
+import json
+
+from reader.vcr_finetuning import VCRDataJointReader
+from model.ernie_vil import ErnieVilModel, ErnieVilConfig
+from optim.optimization import optimization
+from utils.args import print_arguments
+from utils.init import init_checkpoint, init_pretraining_params
+from args.finetune_args import parser
+
+import paddle.fluid as fluid
+
+args = parser.parse_args()
+
+# yapf: enable.
+
+#READERS = {"vcr": VCRDataJointReader, "vqa": VQADataReader, "refcoco+": RefcocoReader, "flickr": FlickrReader}
+READERS = {"vcr": VCRDataJointReader}
+
+def format_result(res_arr, qids, pred, labels, scores):
+    """
+        trans batch results into json format
+    """
+    for i in range(len(qids)):
+        res="\t".join([str(qids[i]), str(pred[i]), str(labels[i]), " ".join(["%.5f" % s for s in scores[i]])])
+        res_arr.append(res)
+    return res_arr
+
+
+def create_vcr_model(pyreader_name, ernie_config, task_group, is_prediction=False):
+    """
+        create model arc for vcr tasks
+    """
+    shapes = [[-1, args.max_seq_len, 1],    #src_id 
+             [-1, args.max_seq_len, 1],    #pos_id
+             [-1, args.max_seq_len, 1],    #sent_id
+             [-1, args.max_seq_len, 1],    #task_id
+             [-1, args.max_seq_len, 1],    #input_mask
+             [-1, args.max_img_len, args.feature_size],  #image_embedding
+             [-1, args.max_img_len, 5],     #image_loc
+             [-1, args.max_img_len, 1],    #image_mask
+             [-1, 1],     #labels
+             [-1, 1],     #qids
+             [],          #task_index
+             [-1, 1],     #binary_labels
+             ]
+    dtypes = ['int64', 'int64', 'int64', 'int64', 'float32', 'float32', 'float32', 'float32', 
+                       'int64', 'int64', 'int64', 'float32']
+    lod_levels = [0] * len(dtypes)
+
+    for _ in task_group:
+        shapes.append([])
+        dtypes.append('float')
+        lod_levels.append(0)
+
+    pyreader = fluid.layers.py_reader(
+        capacity=30,
+        shapes=shapes,
+        dtypes=dtypes,
+        lod_levels=lod_levels,
+        name=pyreader_name,
+        use_double_buffer=False)
+
+    inputs = fluid.layers.read_file(pyreader)
+    src_ids, pos_ids, sent_ids, task_ids, input_mask, image_embeddings, \
+         image_loc, image_mask, labels, q_ids, task_index, binary_labels = inputs[: 12]
+
+    ernie_vil = ErnieVilModel(
+        src_ids=src_ids,
+        position_ids=pos_ids,
+        sentence_ids=sent_ids,
+        task_ids=task_ids,
+        input_mask=input_mask,
+        image_embeddings=image_embeddings,
+        image_loc=image_loc,
+        input_image_mask=image_mask,
+        config=ernie_config
+        )
+
+    h_cls, h_img = ernie_vil.get_pooled_output()
+    task_conf = task_group[0]
+    fusion_method = task_conf["fusion_method"]
+    fusion_fea = ernie_vil.get_match_score(text=h_cls, image=h_img,         \
+                                           dropout_rate=task_conf["dropout_rate"],
+                                           mode=fusion_method)
+    if is_prediction:
+        num_choice = int(task_conf['num_choice'])
+        task_name = task_conf.get('task_prefix', 'vcr')
+        score = fluid.layers.fc(fusion_fea, 1,
+                                param_attr = fluid.ParamAttr(name = task_name + "_fc.w_0",
+                                                    initializer = fluid.initializer.TruncatedNormal(scale = 0.02)),
+                                                    bias_attr = task_name + "_fc.b_0")
+        score = fluid.layers.reshape(score, shape = [-1, num_choice])
+        _loss, _softmax = fluid.layers.softmax_with_cross_entropy(logits = score,
+                                                                  label = labels, return_softmax = True)
+        _acc = fluid.layers.accuracy(input = _softmax, label = labels)
+        pred = fluid.layers.argmax(score, axis = 1)
+        mean_loss = fluid.layers.mean(_loss)
+        task_vars = [mean_loss, _acc, pred, q_ids, labels, _softmax]
+        for var in task_vars:
+            var.persistable = True
+        return pyreader, task_vars
+    else:
+        start_ind = 12
+        mean_loss = fluid.layers.zeros(shape = [1], dtype = 'float32')
+        mean_acc = fluid.layers.zeros(shape = [1], dtype = 'float32')
+        for task_conf in task_group:
+            task_weight = inputs[start_ind]
+            start_ind += 1
+            num_choice = int(task_conf['num_choice'])
+            task_name = task_conf.get('task_prefix', 'vcr')
+            score = fluid.layers.fc(fusion_fea, 1,
+                                    param_attr = fluid.ParamAttr(name = task_name + "_fc.w_0",
+                                    initializer = fluid.initializer.TruncatedNormal(scale = 0.02)),
+                                    bias_attr = task_name + "_fc.b_0")
+
+            _loss = fluid.layers.sigmoid_cross_entropy_with_logits(score,
+                                                                    binary_labels, name = "cross_entropy_loss")
+            tmp_score = fluid.layers.reshape(score, shape = [-1, num_choice])
+            _softmax = fluid.layers.softmax(tmp_score)
+            _acc = fluid.layers.accuracy(input = _softmax, label = labels)
+            _mean_loss = fluid.layers.mean(_loss)
+            mean_loss += _mean_loss * task_weight
+            mean_acc += _acc * task_weight
+        task_vars = [fluid.layers.reduce_mean(mean_loss), mean_acc]
+        for var in task_vars:
+            var.persistable = True
+
+        return pyreader, task_vars
+
+
+#MODELS = {"vcr": create_vcr_model, "vqa": create_vqa_model, "refcoco+": create_refcoco_model}
+MODELS = {"vcr": create_vcr_model}
+
+def predict_wrapper(args,
+                    exe,
+                    ernie_config,
+                    task_group,
+                    test_prog=None,
+                    pyreader=None,
+                    graph_vars=None):
+    """Context to do validation.
+    """
+    reader_name = READERS[args.task_name]
+    data_reader = reader_name(
+        task_group,
+        split=args.test_split,
+        vocab_path=args.vocab_path,
+        is_test=True,
+        shuffle=False,
+        batch_size=args.batch_size,
+        epoch=args.epoch)
+    if args.do_test:
+        assert args.init_checkpoint is not None, "[FATAL] Please use --init_checkpoint '/path/to/checkpoints' \
+                                                  to specify you pretrained model checkpoints"
+
+        init_pretraining_params(exe, args.init_checkpoint, test_prog)
+        print(("testing on %s %s split") % (args.task_name, args.test_split))
+
+    def predict(exe=exe, pyreader=pyreader):
+        """
+            inference for downstream tasks
+        """
+        pyreader.decorate_tensor_provider(data_reader.data_generator())
+        pyreader.start()
+
+        cost = 0
+        appear_step = 0
+        task_acc = {}
+        task_steps = {}
+        steps = 0
+        case_f1 = 0
+        appear_f1 = 0
+        time_begin = time.time()
+        task_name_list = [v.name for v in graph_vars]
+        fetch_list = task_name_list
+
+        print('task name list : ', task_name_list)
+        sum_acc = 0
+        res_arr = []
+        while True:
+            try:
+                outputs = exe.run(fetch_list=fetch_list, program=test_prog)
+                each_acc = outputs[1][0]
+                preds = np.reshape(outputs[2], [-1])
+                qids = np.reshape(outputs[3], [-1])
+                labels = np.reshape(outputs[4], [-1])
+                scores = np.reshape(outputs[5], [-1, 4])
+                sum_acc += each_acc
+                steps += 1
+                if steps % 10 == 0:
+                    print('cur_step:', steps, 'cur_acc:', sum_acc / steps)
+                format_result(res_arr, qids.tolist(), preds.tolist(), labels.tolist(), scores.tolist())
+            except fluid.core.EOFException:
+                pyreader.reset()
+                break
+
+        used_time = time.time() - time_begin
+
+        with open(args.result_file, "w") as f:
+            for r in res_arr:
+                f.write(r + "\n")
+
+        print("average_acc:", sum_acc / steps)
+        ret = {}
+        ret["acc"] = "acc: %f" % (sum_acc / steps)  
+        for item in ret:
+            try:
+                ret[item] = ret[item].split(':')[-1]
+            except:
+                pass
+        return ret
+    return predict
+
+
+def get_optimizer(total_loss, train_program, startup_prog, args):
+    """
+        optimization func
+    """
+    decay_steps_str=args.decay_steps
+    if decay_steps_str == "":
+        decay_steps = []
+    else:
+        decay_steps = [int(s) for s in decay_steps_str.split(";")]
+    scheduled_lr = optimization(
+         loss=total_loss,
+         warmup_steps=args.warmup_steps,
+         num_train_steps=args.num_train_steps,
+         learning_rate=args.learning_rate,
+         train_program=train_program,
+         startup_prog=startup_prog,
+         weight_decay=args.weight_decay,
+         scheduler=args.lr_scheduler,
+         decay_steps=decay_steps,
+         lr_decay_ratio=args.lr_decay_ratio)
+    return scheduled_lr
+
+
+def main(args):
+    """
+       Main func for downstream tasks
+    """
+    print("finetuning tasks start")
+    ernie_config = ErnieVilConfig(args.ernie_config_path)
+    ernie_config.print_config()
+
+    with open(args.task_group_json) as f:
+        task_group = json.load(f)
+        print('task: ', task_group)
+
+    startup_prog = fluid.Program()
+    if args.do_train and args.do_test:
+        print("can not set both do_train and do_test as True")
+        return 
+
+    model_name = MODELS[args.task_name]
+    if args.do_train:
+        train_program = fluid.Program()
+        with fluid.program_guard(train_program, startup_prog):
+            with fluid.unique_name.guard():
+                train_pyreader, model_outputs = model_name(
+                    pyreader_name='train_reader', ernie_config=ernie_config, task_group=task_group)
+
+                total_loss = model_outputs[0]
+                scheduled_lr = get_optimizer(total_loss, train_program, startup_prog, args)
+    if args.do_test:
+        test_prog = fluid.Program()
+        with fluid.program_guard(test_prog, startup_prog):
+            with fluid.unique_name.guard():
+                test_pyreader, model_outputs  = model_name(
+                    pyreader_name='test_reader', ernie_config=ernie_config, task_group=task_group, is_prediction=True)
+                total_loss = model_outputs[0]
+
+        test_prog = test_prog.clone(for_test=True)
+    
+    if args.use_gpu:
+        gpu_id = 0
+        if os.getenv("FLAGS_selected_gpus"):
+            gpu_id = int(os.getenv("FLAGS_selected_gpus"))
+    place = fluid.CUDAPlace(gpu_id) if args.use_gpu else fluid.CPUPlace()
+
+    print("theoretical memory usage: ")
+    if args.do_train:
+        print(fluid.contrib.memory_usage(
+             program=train_program, batch_size=args.batch_size))
+    if args.do_test:
+        print(fluid.contrib.memory_usage(
+            program=test_prog, batch_size=args.batch_size))
+
+    nccl2_num_trainers = 1
+    nccl2_trainer_id = 0
+    print("args.is_distributed:", args.is_distributed)
+    trainer_id = 0
+    if args.is_distributed:
+        trainer_id = int(os.getenv("PADDLE_TRAINER_ID"))
+        worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
+        current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
+        worker_endpoints = worker_endpoints_env.split(",")
+        trainers_num = len(worker_endpoints)
+
+        print("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
+              trainer_id:{}".format(worker_endpoints, trainers_num,
+                                    current_endpoint, trainer_id))
+
+        # prepare nccl2 env.
+        config = fluid.DistributeTranspilerConfig()
+        config.mode = "nccl2"
+        if args.nccl_comm_num > 1:
+            config.nccl_comm_num = args.nccl_comm_num
+        if args.use_hierarchical_allreduce and trainers_num > args.hierarchical_allreduce_inter_nranks:
+            config.use_hierarchical_allreduce=args.use_hierarchical_allreduce
+            config.hierarchical_allreduce_inter_nranks=args.hierarchical_allreduce_inter_nranks
+
+            assert config.hierarchical_allreduce_inter_nranks > 1
+            assert trainers_num % config.hierarchical_allreduce_inter_nranks == 0
+
+            config.hierarchical_allreduce_exter_nranks = \
+                trainers_num / config.hierarchical_allreduce_inter_nranks
+
+        t = fluid.DistributeTranspiler(config=config)
+        t.transpile(
+            trainer_id,
+            trainers=worker_endpoints_env,
+            current_endpoint=current_endpoint,
+            program=train_program,
+            startup_program=startup_prog)
+
+        nccl2_num_trainers = trainers_num
+        nccl2_trainer_id = trainer_id
+
+    exe = fluid.Executor(place)
+    exe.run(startup_prog)
+
+    if args.do_train:
+        if args.init_checkpoint and args.init_checkpoint != "":
+            sys.stderr.write('############################WARNING############################')
+            sys.stderr.write('####### using init_pretraining_params, not init_checkpoint ####')
+            sys.stderr.write('## meaning hyper param e.g. lr won\'t inherit from checkpoint##')
+            sys.stderr.write('###############################################################')
+            init_pretraining_params(exe, args.init_checkpoint, train_program)
+
+        reader_name=READERS[args.task_name]
+        data_reader = reader_name(
+            task_group,
+            split="train",
+            vocab_path=args.vocab_path,
+            batch_size=args.batch_size,
+            epoch=args.epoch,)
+
+    exec_strategy = fluid.ExecutionStrategy()
+    if args.use_fast_executor:
+        exec_strategy.use_experimental_executor = True
+    exec_strategy.num_threads = 2
+    
+    exec_strategy.num_iteration_per_drop_scope = min(10, args.skip_steps)
+
+    build_strategy = fluid.compiler.BuildStrategy()
+    build_strategy.fuse_all_reduce_ops = False
+
+    if args.use_fuse:
+        build_strategy.fuse_all_reduce_ops = True
+
+    if args.do_train:
+        train_exe = fluid.ParallelExecutor(
+            use_cuda=args.use_cuda,
+            loss_name=total_loss.name,
+            build_strategy=build_strategy,
+            exec_strategy=exec_strategy,
+            main_program=train_program,
+            num_trainers=nccl2_num_trainers,
+            trainer_id=nccl2_trainer_id)
+
+    if args.do_test: 
+        predict = predict_wrapper(
+            args,
+            exe,
+            ernie_config,
+            task_group,
+            test_prog=test_prog,
+            pyreader=test_pyreader,
+            graph_vars=model_outputs)
+        result = predict()
+
+    if args.do_train:
+        train_pyreader.decorate_tensor_provider(data_reader.data_generator())
+        train_pyreader.start()
+        steps = 0
+        time_begin = time.time()
+        node_nums = 1 #int(os.getenv("PADDLE_NODES_NUM"))
+        used_time_all = 0 
+        while steps < args.num_train_steps:
+            try:
+                steps += node_nums
+                skip_steps = args.skip_steps * node_nums
+                fetch_list = []
+                if nccl2_trainer_id == 0 and steps % skip_steps == 0:
+                    task_name_list = [v.name for v in model_outputs]
+                    fetch_list = task_name_list
+                    fetch_list.append(scheduled_lr.name)
+                
+                time_begin = time.time()
+                outputs = train_exe.run(fetch_list=fetch_list)
+                if outputs:
+                    print("feed_queue size", train_pyreader.queue.size())
+                    progress_file = data_reader.get_progress()
+                    epoch = progress_file["current_epoch"]
+                    current_file_index = progress_file["current_file_index"]
+                    total_file =  progress_file["total_file"]
+                    current_file = progress_file["current_file"]
+                    print(
+                        "epoch: %d, progress: %d/%d, step: %d, loss: %f, "
+                        "acc: %f"
+                        % (epoch, current_file_index, total_file, steps,
+                           outputs[0][0],
+                           outputs[1][0]))
+                    print("steps:", steps)
+                    print("save_steps:", args.save_steps)
+
+                    np_lr = outputs[-1:]
+
+                    date_str = datetime.datetime.now().strftime("%Y%m%d %H:%M:%S")
+
+                    np_lr = float(np.mean(np_lr[0]))
+                    print("%s current learning_rate:%.8f" % (date_str, np_lr))
+
+                    if steps % args.save_steps == 0:
+                        save_path = os.path.join(args.checkpoints, "step_" + str(steps))
+                        print("save_path:", save_path)
+                        fluid.io.save_persistables(exe, save_path, train_program)
+                    time_end = time.time()
+                    used_time = time_end - time_begin
+                    time_end = time_begin
+                    print("used_time:", used_time)  
+            except fluid.core.EOFException:
+                train_pyreader.reset()
+                break
+
+
+if __name__ == '__main__':
+    print_arguments(args)
+    main(args)
+
diff --git a/ernie-vil/model/__init__.py b/ernie-vil/model/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/ernie-vil/model/ernie_vil.py b/ernie-vil/model/ernie_vil.py
new file mode 100644
index 0000000000000000000000000000000000000000..13b53097898e4c01416f12105dd0421ed72bd5e1
--- /dev/null
+++ b/ernie-vil/model/ernie_vil.py
@@ -0,0 +1,288 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""ERNIE-ViL model"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import json
+
+import six
+import paddle.fluid as fluid
+
+from model.vl_transformer_encoder import encoder, pre_process_layer
+
+
+class ErnieVilConfig(object):
+    """
+    configuration for ernie-vil
+    """
+    def __init__(self, config_path):
+        self._config_dict = self._parse(config_path)
+
+    def _parse(self, config_path):
+        try:
+            with open(config_path) as json_file:
+                config_dict = json.load(json_file)
+        except Exception:
+            raise IOError("Error in parsing Ernie model config file '%s'" %
+                          config_path)
+        else:
+            return config_dict
+
+    def __getitem__(self, key):
+        return self._config_dict[key]
+
+    def print_config(self):
+        """
+        print configuration value
+        """
+        for arg, value in sorted(six.iteritems(self._config_dict)):
+            print('%s: %s' % (arg, value))
+        print('------------------------------------------------')
+
+
+class ErnieVilModel(object):
+    """
+    main class for ERNIE-ViL model
+    """
+    def __init__(self,
+                 src_ids,
+                 position_ids,
+                 sentence_ids,
+                 task_ids,
+                 input_mask,
+                 image_embeddings,
+                 image_loc,
+                 input_image_mask,
+                 config,
+                 predict_feature=False,
+                 predict_class=True,
+                 use_attr=False,
+                 use_soft_label=True):
+        
+        self._emb_size = config['hidden_size']
+        self._n_layer = config['num_hidden_layers']
+        self._n_head = config['num_attention_heads']
+        
+        self._v_head = config['v_num_attention_heads']
+        self._v_emb_size = config['v_hidden_size']
+        self._v_inter_hid = config['v_intermediate_size']
+
+        self._co_head = config['co_num_attention_heads']
+        self._co_emb_size = config['co_hidden_size']
+        self._co_inter_hid = config['co_intermediate_size']
+
+        self._voc_size = config['vocab_size']
+        self._class_size = config['class_size']
+        self._class_attr_size = config['class_attr_size']
+        self._max_position_seq_len = config['max_position_embeddings']
+        self._sent_types = config['sent_type_vocab_size']
+        self._task_types = config['task_type_vocab_size']
+        self._hidden_act = config['hidden_act']
+        self._prepostprocess_dropout = config['hidden_dropout_prob']
+        self._attention_dropout = config['attention_probs_dropout_prob']
+        self._v_biattention_id = config['v_biattention_id']
+        self._t_biattention_id = config['t_biattention_id']
+
+        self._predict_feature = predict_feature
+        self._predict_class = predict_class
+        self._use_attr = use_attr
+        self._use_soft_label = use_soft_label
+        self._word_emb_name = "word_embedding"
+        self._pos_emb_name = "pos_embedding"
+        self._sent_emb_name = "sent_embedding"
+        self._image_emb_name = "image_embedding"
+        self._loc_emb_name = "loc_embedding"
+        self._dtype = "float32"
+        self._emb_dtype = "float32"
+
+        # Initialize all weigths by truncated normal initializer, and all biases
+        # will be initialized by constant zero by default.
+        self._param_initializer = fluid.initializer.TruncatedNormal(
+            scale=config['initializer_range'])
+
+        self._build_model(src_ids, position_ids, sentence_ids, task_ids, input_mask, \
+                image_embeddings, image_loc, input_image_mask)
+
+    def _build_model(self, src_ids, position_ids, sentence_ids, task_ids, input_mask, \
+            image_embeddings, image_loc, input_image_mask):
+        # padding id in vocabulary must be set to 0
+        emb_out = fluid.layers.embedding(
+            input=src_ids,
+            size=[self._voc_size, self._emb_size],
+            dtype=self._emb_dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._word_emb_name, initializer=self._param_initializer),
+            is_sparse=False)
+
+        position_emb_out = fluid.layers.embedding(
+            input=position_ids,
+            size=[self._max_position_seq_len, self._emb_size],
+            dtype=self._emb_dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._pos_emb_name, initializer=self._param_initializer))
+
+        sent_emb_out = fluid.layers.embedding(
+            sentence_ids,
+            size=[self._sent_types, self._emb_size],
+            dtype=self._emb_dtype,
+            param_attr=fluid.ParamAttr(
+                name=self._sent_emb_name, initializer=self._param_initializer))
+
+        emb_out = emb_out + position_emb_out
+        emb_out = emb_out + sent_emb_out
+
+        emb_out = pre_process_layer(
+            emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
+
+        self_attn_mask = fluid.layers.matmul(
+            x=input_mask, y=input_mask, transpose_y=True)
+
+        self_attn_mask = fluid.layers.scale(
+            x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
+        n_head_self_attn_mask = fluid.layers.stack(
+            x=[self_attn_mask] * self._n_head, axis=1)
+        n_head_self_attn_mask.stop_gradient = True
+
+        image_embeddings = fluid.layers.fc(image_embeddings,
+                                      self._v_emb_size,
+                                      param_attr=fluid.ParamAttr(
+                                            name="image_emb.w_0",
+                                            initializer=self._param_initializer),
+                                      bias_attr = "image_emb.b_0",
+                                      num_flatten_dims = 2)
+        loc_emb_out = fluid.layers.fc(image_loc,
+                                      self._v_emb_size,
+                                      param_attr=fluid.ParamAttr(
+                                            name="image_loc.w_0",
+                                            initializer=self._param_initializer),
+                                      bias_attr = "image_loc.b_0",
+                                      num_flatten_dims = 2)
+
+        emb_vl_out = image_embeddings + loc_emb_out
+        emb_vl_out = pre_process_layer(  
+            emb_vl_out, 'nd', self._prepostprocess_dropout, name='vl_pre_encoder')
+
+        self_attn_image_mask = fluid.layers.matmul(
+            x=input_image_mask, y=input_image_mask, transpose_y=True)
+
+        self_attn_image_mask = fluid.layers.scale(
+            x=self_attn_image_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
+        n_head_self_attn_image_mask = fluid.layers.stack(
+            x=[self_attn_image_mask] * self._v_head, axis=1)
+        n_head_self_attn_image_mask.stop_gradient = True
+
+        self_attn_vl_mask = fluid.layers.matmul(
+            x=input_image_mask, y=input_mask, transpose_y=True)
+        self_attn_vl_mask = fluid.layers.scale(
+            x=self_attn_vl_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
+        n_head_self_attn_vl_mask = fluid.layers.stack(
+            x=[self_attn_vl_mask] * self._co_head, axis=1)
+        n_head_self_attn_vl_mask.stop_gradient = True
+
+        self._enc_out, self._enc_vl_out = encoder(
+            enc_input=emb_out,
+            enc_vl_input=emb_vl_out,
+            attn_bias=n_head_self_attn_mask,
+            attn_image_bias=n_head_self_attn_image_mask,
+            attn_vl_bias=n_head_self_attn_vl_mask,
+            n_layer=self._n_layer,
+            n_head=self._n_head,
+            d_key=self._emb_size // self._n_head,
+            d_value=self._emb_size // self._n_head,
+            d_model=self._emb_size,
+            d_inner_hid=self._emb_size * 4,
+            v_head=self._v_head,
+            v_key=self._v_emb_size // self._v_head,
+            v_value=self._v_emb_size // self._v_head,
+            v_model=self._v_emb_size,
+            v_inner_hid=self._v_inter_hid,
+            co_head=self._co_head,
+            co_key=self._co_emb_size // self._co_head,
+            co_value=self._co_emb_size // self._co_head,
+            co_model=self._co_emb_size,
+            co_inner_hid=self._co_inter_hid,
+            prepostprocess_dropout=self._prepostprocess_dropout,
+            attention_dropout=self._attention_dropout,
+            relu_dropout=0,
+            hidden_act=self._hidden_act,
+            preprocess_cmd="",
+            postprocess_cmd="dan",
+            param_initializer=self._param_initializer,
+            v_biattention_id = self._v_biattention_id,
+            t_biattention_id = self._t_biattention_id,
+            name='encoder')
+
+    def get_sequence_output(self):
+        """ 
+        Return sequence output of all text and img tokens
+        """
+        return self._enc_out, self._enc_vl_out
+
+    def get_pooled_output(self):
+        """
+        Get the first feature of each sequence for classification
+        """
+        text_cls_feat = fluid.layers.slice(
+            input=self._enc_out, axes=[1], starts=[0], ends=[1])
+
+        text_cls_feat = fluid.layers.cast(
+            x=text_cls_feat, dtype=self._emb_dtype)
+
+        text_cls_feat = fluid.layers.fc(
+            input=text_cls_feat,
+            size=self._co_emb_size,
+            act="relu",
+            param_attr=fluid.ParamAttr(
+                name="pooled_fc_text.w_0", initializer=self._param_initializer),
+            bias_attr="pooled_fc_text.b_0")
+
+        image_cls_feat = fluid.layers.slice(
+            input=self._enc_vl_out, axes=[1], starts=[0], ends=[1])
+
+        image_cls_feat = fluid.layers.cast(
+                x=image_cls_feat, dtype=self._emb_dtype)
+
+        image_cls_feat = fluid.layers.fc(
+            input=image_cls_feat,
+            size=self._co_emb_size,
+            act="relu",
+            param_attr=fluid.ParamAttr(
+                name="pooled_fc_image.w_0", initializer=self._param_initializer),
+            bias_attr="pooled_fc_image.b_0")
+        return text_cls_feat, image_cls_feat
+
+    def get_match_score(self, text, image, dropout_rate=0.0, mode="mul"):
+        """
+        match score for text [cls] and image [img] tokens
+        """
+        if mode == "sum":
+            emb_fuse = text + image
+        elif mode == "mul":
+            emb_fuse = text * image
+        else:
+            "current mode %s is not supported" % mode
+            return
+        if dropout_rate > 0.0:
+
+            emb_fuse = fluid.layers.dropout(emb_fuse,
+                       self._attention_dropout,
+                       dropout_implementation="upscale_in_train")
+        return emb_fuse
+
+
+
diff --git a/ernie-vil/model/vl_transformer_encoder.py b/ernie-vil/model/vl_transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a477541d5fb7edb9e5322e76c023fd6cd66197b
--- /dev/null
+++ b/ernie-vil/model/vl_transformer_encoder.py
@@ -0,0 +1,561 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""two-stream Transformer encoder."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import partial
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         param_initializer=None,
+                         name='multi_head_att'):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError(
+            "Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(
+                          name=name + '_query_fc.w_0',
+                          initializer=param_initializer),
+                      bias_attr=name + '_query_fc.b_0')
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(
+                          name=name + '_key_fc.w_0',
+                          initializer=param_initializer),
+                      bias_attr=name + '_key_fc.b_0')
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      num_flatten_dims=2,
+                      param_attr=fluid.ParamAttr(
+                          name=name + '_value_fc.w_0',
+                          initializer=param_initializer),
+                      bias_attr=name + '_value_fc.b_0')
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(
+            x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(
+            x=trans_x,
+            shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
+            inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key ** -0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(
+                weights,
+                dropout_prob=dropout_rate,
+                dropout_implementation="upscale_in_train",
+                is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat(
+            [layers.reshape(
+                cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat(
+            [layers.reshape(
+                cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
+                                                  dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         num_flatten_dims=2,
+                         param_attr=fluid.ParamAttr(
+                             name=name + '_output_fc.w_0',
+                             initializer=param_initializer),
+                         bias_attr=name + '_output_fc.b_0')
+    return proj_out
+
+
+def positionwise_feed_forward(x,
+                              d_inner_hid,
+                              d_hid,
+                              dropout_rate,
+                              hidden_act,
+                              param_initializer=None,
+                              name='ffn'):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act=hidden_act,
+                       param_attr=fluid.ParamAttr(
+                           name=name + '_fc_0.w_0',
+                           initializer=param_initializer),
+                       bias_attr=name + '_fc_0.b_0')
+    if dropout_rate:
+        hidden = layers.dropout(
+            hidden,
+            dropout_prob=dropout_rate,
+            dropout_implementation="upscale_in_train",
+            is_test=False)
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.ParamAttr(
+                        name=name + '_fc_1.w_0', initializer=param_initializer),
+                    bias_attr=name + '_fc_1.b_0')
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
+                           name=''):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out = layers.layer_norm(
+                out,
+                begin_norm_axis=len(out.shape) - 1,
+                param_attr=fluid.ParamAttr(
+                    name=name + '_layer_norm_scale',
+                    initializer=fluid.initializer.Constant(1.)),
+                bias_attr=fluid.ParamAttr(
+                    name=name + '_layer_norm_bias',
+                    initializer=fluid.initializer.Constant(0.)))
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(
+                    out,
+                    dropout_prob=dropout_rate,
+                    dropout_implementation="upscale_in_train",
+                    is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_co_layer(enc_input,
+                     enc_vl_input,
+                     attn_vl_bias,
+                     co_head,
+                     co_key,
+                     co_value,
+                     co_model,
+                     d_model,
+                     d_inner_hid,
+                     v_model,
+                     v_inner_hid,
+                     prepostprocess_dropout,
+                     attention_dropout,
+                     relu_dropout,
+                     hidden_act,
+                     preprocess_cmd="n",
+                     postprocess_cmd="da",
+                     param_initializer=None,
+                     name=''):
+    """
+    Co_layer to perform co-attention from visual to language or from language to visual 
+    """
+    enc_input_pre = pre_process_layer(
+                            enc_input,
+                            preprocess_cmd,
+                            prepostprocess_dropout,
+                            name=name + '_pre_att')
+
+    enc_input_vl_pre = pre_process_layer(
+                            enc_vl_input,
+                            preprocess_cmd,
+                            prepostprocess_dropout,
+                            name=name + '_vl_pre_att')
+
+    attn_output = multi_head_attention(
+        enc_input_pre,
+        enc_input_vl_pre,
+        enc_input_vl_pre,
+        layers.transpose(attn_vl_bias, perm=[0, 1, 3, 2]),
+        co_key,
+        co_value,
+        d_model,
+        co_head,
+        attention_dropout,
+        param_initializer=param_initializer,
+        name=name + '_multi_head_att')
+
+    attn_vl_output = multi_head_attention(
+        enc_input_vl_pre,
+        enc_input_pre,
+        enc_input_pre,
+        attn_vl_bias,
+        co_key,
+        co_value,
+        v_model,
+        co_head,
+        attention_dropout,
+        param_initializer=param_initializer,
+        name=name + '_vl_multi_head_att')
+
+    attn_output = post_process_layer(
+        enc_input,
+        attn_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name=name + '_post_att')
+
+    attn_vl_output = post_process_layer(
+        enc_vl_input,
+        attn_vl_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name=name + '_vl_post_att')
+
+    ffd_output = positionwise_feed_forward(
+        pre_process_layer(
+            attn_output,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name=name + '_pre_ffn'),
+        d_inner_hid,
+        d_model,
+        relu_dropout,
+        hidden_act,
+        param_initializer=param_initializer,
+        name=name + '_ffn')
+
+    ffd_vl_output = positionwise_feed_forward(
+        pre_process_layer(
+            attn_vl_output,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name=name + '_pre_vl_ffn'),
+        v_inner_hid,
+        v_model,
+        relu_dropout,
+        hidden_act,
+        param_initializer=param_initializer,
+        name=name + '_vl_ffn')
+
+    enc_output = post_process_layer(
+        attn_output,
+        ffd_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name=name + '_post_ffn')
+
+    enc_vl_output = post_process_layer(
+        attn_vl_output,
+        ffd_vl_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name=name + '_vl_post_ffn')
+
+    return enc_output, enc_vl_output
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  hidden_act,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  param_initializer=None,
+                  name=''):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(
+        pre_process_layer(
+            enc_input,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name=name + '_pre_att'),
+        None,
+        None,
+        attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        attention_dropout,
+        param_initializer=param_initializer,
+        name=name + '_multi_head_att')
+    attn_output = post_process_layer(
+        enc_input,
+        attn_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name=name + '_post_att')
+    ffd_output = positionwise_feed_forward(
+        pre_process_layer(
+            attn_output,
+            preprocess_cmd,
+            prepostprocess_dropout,
+            name=name + '_pre_ffn'),
+        d_inner_hid,
+        d_model,
+        relu_dropout,
+        hidden_act,
+        param_initializer=param_initializer,
+        name=name + '_ffn')
+    return post_process_layer(
+        attn_output,
+        ffd_output,
+        postprocess_cmd,
+        prepostprocess_dropout,
+        name=name + '_post_ffn')
+
+
+def encoder(enc_input,
+            enc_vl_input,
+            attn_bias,
+            attn_image_bias,
+            attn_vl_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            v_head,
+            v_key,
+            v_value,
+            v_model,
+            v_inner_hid,
+            co_head,
+            co_key,
+            co_value,
+            co_model,
+            co_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            param_initializer=None,
+            v_biattention_id=[0, 1, 2, 3, 4, 5],
+            t_biattention_id=[18, 19, 20, 21, 22, 23],
+            name=''):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer and encoder_co_layer
+    """
+
+    v_start = 0
+    t_start = 0
+    block = 0
+
+    for v_layer_id, t_layer_id in zip(v_biattention_id, t_biattention_id):
+        v_end = v_layer_id
+        t_end = t_layer_id
+        for idx in range(t_start, t_end):
+            enc_output = encoder_layer(
+                 enc_input,
+                 attn_bias,
+                 n_head,
+                 d_key,
+                 d_value,
+                 d_model,
+                 d_inner_hid,
+                 prepostprocess_dropout,
+                 attention_dropout,
+                 relu_dropout,
+                 hidden_act,
+                 preprocess_cmd,
+                 postprocess_cmd,
+                 param_initializer=param_initializer,
+                 name=name + '_layer_' + str(idx))
+            enc_input = enc_output
+
+        for idx in range(v_start, v_end):
+            enc_vl_output = encoder_layer(
+                 enc_vl_input,
+                 attn_image_bias,
+                 v_head,
+                 v_key,
+                 v_value,
+                 v_model,
+                 v_inner_hid,
+                 prepostprocess_dropout,
+                 attention_dropout,
+                 relu_dropout,
+                 hidden_act,
+                 preprocess_cmd,
+                 postprocess_cmd,
+                 param_initializer=param_initializer,
+                 name=name + '_vlayer_' + str(idx))
+            enc_vl_input = enc_vl_output
+
+        enc_output, enc_vl_output = encoder_co_layer(
+             enc_input,
+             enc_vl_input,
+             attn_vl_bias,
+             co_head,
+             co_key,
+             co_value,
+             co_model,
+             d_model,
+             d_inner_hid,
+             v_model,
+             v_inner_hid,
+             prepostprocess_dropout,
+             attention_dropout,
+             relu_dropout,
+             hidden_act,
+             preprocess_cmd,
+             postprocess_cmd,
+             param_initializer=param_initializer,
+             name=name + '_colayer_' + str(block))
+
+        enc_input, enc_vl_input = enc_output, enc_vl_output
+        
+        block += 1
+        v_start = v_end
+        t_start = t_end
+
+    enc_output = encoder_layer(
+         enc_output,
+         attn_bias,
+         n_head,
+         d_key,
+         d_value,
+         d_model,
+         d_inner_hid,
+         prepostprocess_dropout,
+         attention_dropout,
+         relu_dropout,
+         hidden_act,
+         preprocess_cmd,
+         postprocess_cmd,
+         param_initializer=param_initializer,
+         name=name + '_layer_' + str(t_end))
+
+    enc_vl_output = encoder_layer(
+         enc_vl_output,
+         attn_image_bias,
+         v_head,
+         v_key,
+         v_value,
+         v_model,
+         v_inner_hid,
+         prepostprocess_dropout,
+         attention_dropout,
+         relu_dropout,
+         hidden_act,
+         preprocess_cmd,
+         postprocess_cmd,
+         param_initializer=param_initializer,
+         name=name + '_vlayer_' + str(v_end))
+
+    enc_output = pre_process_layer(
+        enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
+
+    enc_vl_output = pre_process_layer(
+        enc_vl_output, preprocess_cmd, prepostprocess_dropout, name="vl_post_encoder")
+
+    return enc_output, enc_vl_output
diff --git a/ernie-vil/optim/__init__.py b/ernie-vil/optim/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/ernie-vil/optim/optimization.py b/ernie-vil/optim/optimization.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb27665f9cc6bff8fa8e2febda8c4058c082b18c
--- /dev/null
+++ b/ernie-vil/optim/optimization.py
@@ -0,0 +1,167 @@
+#    Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" text preprocess """
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import paddle.fluid as fluid
+
+def manual_warmup_decay(learning_rate, warmup_steps, num_train_steps, decay_steps=[], lr_decay_ratio=0.1):
+    """ 
+    Applies linear warmup of learning rate from 0 and keep constant.
+    """
+    with fluid.default_main_program()._lr_schedule_guard():
+        lr = fluid.layers.tensor.create_global_var(
+            shape=[1],
+            value=0.0,
+            dtype='float32',
+            persistable=True,
+            name="scheduled_learning_rate")
+
+        global_step = fluid.layers.learning_rate_scheduler._decay_step_counter(
+        )
+        with fluid.layers.control_flow.Switch() as switch:
+            with switch.case(global_step < warmup_steps):
+                warmup_lr = learning_rate * (global_step / warmup_steps)
+                fluid.layers.tensor.assign(warmup_lr, lr)
+            for i, step in enumerate(decay_steps):
+                with switch.case(global_step < step):
+                    decayed_lr = learning_rate * (global_step / global_step) * pow(lr_decay_ratio, i)
+                    fluid.layers.tensor.assign(decayed_lr, lr)
+            with switch.default():
+                constant_lr = learning_rate * (global_step / global_step) * pow(lr_decay_ratio, len(decay_steps))
+                fluid.layers.tensor.assign(constant_lr, lr)
+
+        return lr
+
+
+def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
+    """ 
+    Applies linear warmup of learning rate from 0 and decay to 0.
+    """
+    with fluid.default_main_program()._lr_schedule_guard():
+        lr = fluid.layers.tensor.create_global_var(
+            shape=[1],
+            value=0.0,
+            dtype='float32',
+            persistable=True,
+            name="scheduled_learning_rate")
+
+        global_step = fluid.layers.learning_rate_scheduler._decay_step_counter(
+        )
+
+        with fluid.layers.control_flow.Switch() as switch:
+            with switch.case(global_step < warmup_steps):
+                warmup_lr = learning_rate * (global_step / warmup_steps)
+                fluid.layers.tensor.assign(warmup_lr, lr)
+            with switch.default():
+                decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
+                    learning_rate=learning_rate,
+                    decay_steps=num_train_steps,
+                    end_learning_rate=0.0,
+                    power=1.0,
+                    cycle=False)
+                fluid.layers.tensor.assign(decayed_lr, lr)
+
+        return lr
+
+def optimization(loss,
+                 warmup_steps,
+                 num_train_steps,
+                 learning_rate,
+                 train_program,
+                 startup_prog,
+                 weight_decay,
+                 scheduler='linear_warmup_decay',
+                 decay_steps=[],
+                 lr_decay_dict_file="",
+                 lr_decay_ratio=0.1):
+    """ 
+    optimization implementation 
+    """
+    if warmup_steps > 0:
+        if scheduler == 'noam_decay':
+            scheduled_lr = fluid.layers.learning_rate_scheduler \
+             .noam_decay(1 / (warmup_steps * (learning_rate ** 2)),
+                         warmup_steps)
+        elif scheduler == 'linear_warmup_decay':
+            scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps,
+                                               num_train_steps)
+        elif scheduler == 'manual_warmup_decay':
+            scheduled_lr = manual_warmup_decay(learning_rate, warmup_steps,
+                                               num_train_steps, decay_steps, lr_decay_ratio)
+        else:
+            raise ValueError("Unkown learning rate scheduler, should be "
+                             "'noam_decay' or 'linear_warmup_decay' or 'manual_warmup_decay'")
+    else:
+        scheduled_lr = fluid.layers.create_global_var(
+            name=fluid.unique_name.generate("learning_rate"),
+            shape=[1],
+            value=learning_rate,
+            dtype='float32',
+            persistable=True)
+
+    lr_decay_dict = {}
+    if lr_decay_dict_file != "":
+        with open(lr_decay_dict_file) as f:
+            for line in f:
+                param, decay_rate = line.strip().split('\t')
+                lr_decay_dict[param] = float(decay_rate)
+
+    for param in fluid.default_main_program().block(0).all_parameters():
+        if param.name in lr_decay_dict:
+            print (param.name, lr_decay_dict[param.name])
+            param.optimize_attr['learning_rate'] = lr_decay_dict[param.name]
+
+    optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
+    optimizer._learning_rate_map[fluid.default_main_program(
+    )] = scheduled_lr
+
+
+    fluid.clip.set_gradient_clip(
+        clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0))
+
+    def exclude_from_weight_decay(name):
+        """ 
+        Parameters not use weight decay
+        """
+        if name.find("layer_norm") > -1:
+            return True
+        bias_suffix = ["_bias", "_b", ".b_0"]
+        for suffix in bias_suffix:
+            if name.endswith(suffix):
+                return True
+        return False
+
+    param_list = dict()
+
+    for param in train_program.global_block().all_parameters():
+        param_list[param.name] = param * 1.0
+        param_list[param.name].stop_gradient = True
+
+    _, param_grads = optimizer.minimize(loss)
+
+    if weight_decay > 0:
+        for param, grad in param_grads:
+            if exclude_from_weight_decay(param.name):
+                continue
+            with param.block.program._optimized_guard(
+                [param, grad]), fluid.framework.name_scope("weight_decay"):
+                updated_param = param - param_list[
+                    param.name] * weight_decay * scheduled_lr * param.optimize_attr['learning_rate']
+                fluid.layers.assign(output=param, input=updated_param)
+
+    return scheduled_lr
diff --git a/ernie-vil/preprocess/__init__.py b/ernie-vil/preprocess/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/ernie-vil/preprocess/preprocessor.py b/ernie-vil/preprocess/preprocessor.py
new file mode 100755
index 0000000000000000000000000000000000000000..0cc0a80139d7bbaad98df8c99352cc95c6f5bec4
--- /dev/null
+++ b/ernie-vil/preprocess/preprocessor.py
@@ -0,0 +1,46 @@
+#    Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" text preprocess """
+
+import random
+import sys
+import os
+import base64
+import numpy as np
+
+reload(sys)
+sys.setdefaultencoding("utf-8")
+
+from preprocess import tokenization
+
+class PreprocessorBasic(object):
+    """
+    Main class for text preprocess
+    """
+    def __init__(self,
+                 tokenizer_name,
+                 vocab_path,
+                 tagger_path="",
+                 nltk_data_path="",
+                 do_lower_case=True):
+        self.do_lower_case = do_lower_case
+        self.tokenizer = getattr(tokenization, tokenizer_name)(vocab_file=vocab_path, do_lower_case=do_lower_case)
+        self.vocab = self.tokenizer.vocab
+    
+    def convert_sentence_to_ids_without_cls(self, sentence):
+        """
+        Convert sentence to ids without cls
+        """
+        tokens = self.tokenizer.tokenize(sentence)
+        ids = self.tokenizer.convert_tokens_to_ids(tokens)
+        return ids
diff --git a/ernie-vil/preprocess/tokenization.py b/ernie-vil/preprocess/tokenization.py
new file mode 100644
index 0000000000000000000000000000000000000000..a661203259b61b6db061158ed91580c0a18af2bd
--- /dev/null
+++ b/ernie-vil/preprocess/tokenization.py
@@ -0,0 +1,467 @@
+#    Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#    http://www.apache.org/licenses/LICENSE-2.0
+ 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" tokenization implemnet """
+
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+import unicodedata
+import six
+from functools import reduce
+
+def convert_to_unicode(text):
+    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+    if six.PY3:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, bytes):
+            return text.decode("utf-8", "ignore")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    elif six.PY2:
+        if isinstance(text, str):
+            return text.decode("utf-8", "ignore")
+        elif isinstance(text, unicode):
+            return text
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    else:
+        raise ValueError("Not running on Python2 or Python 3?")
+
+
+def printable_text(text):
+    """Returns text encoded in a way suitable for print or `tf.logging`."""
+
+    # These functions want `str` for both Python2 and Python3, but in one case
+    # it's a Unicode string and in the other it's a byte string.
+    if six.PY3:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, bytes):
+            return text.decode("utf-8", "ignore")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    elif six.PY2:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, unicode):
+            return text.encode("utf-8")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    else:
+        raise ValueError("Not running on Python2 or Python 3?")
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    fin = open(vocab_file)
+    for num, line in enumerate(fin):
+        items = convert_to_unicode(line.strip()).split("\t")
+        if len(items) > 2:
+            break
+        token = items[0]
+        index = items[1] if len(items) == 2 else num
+        token = token.strip()
+        vocab[token] = int(index)
+    return vocab
+
+
+def convert_by_vocab(vocab, items):
+    """Converts a sequence of [tokens|ids] using the vocab."""
+    output = []
+    for item in items:
+        output.append(vocab[item])
+    return output
+
+
+def convert_tokens_to_ids(vocab, tokens):
+    """
+    Converts tokens to ids
+    """
+    return convert_by_vocab(vocab, tokens)
+
+
+def convert_ids_to_tokens(inv_vocab, ids):
+    """
+    Converts ids to tokens
+    """
+    return convert_by_vocab(inv_vocab, ids)
+
+
+def whitespace_tokenize(text):
+    """Runs basic whitespace cleaning and splitting on a peice of text."""
+    text = text.strip()
+    if not text:
+        return []
+    tokens = text.split()
+    return tokens
+
+
+class FullTokenizer(object):
+    """Runs end-to-end tokenziation."""
+
+    def __init__(self, vocab_file, do_lower_case=True):
+        self.vocab = load_vocab(vocab_file)
+        self.inv_vocab = {v: k for k, v in self.vocab.items()}
+        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+
+    def tokenize(self, text):
+        """
+        turn text into tokens
+        """
+        split_tokens = []
+        for token in self.basic_tokenizer.tokenize(text):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+
+        return split_tokens
+
+    def tokenize_case(self, text):
+        """
+        tokenize case 
+        """
+        split_tokens = []
+        case_indexs = []
+        basic_tokens, case_index = self.basic_tokenizer.tokenize_case(text)
+        case_indexs += case_index
+        case_indexs = [[i] for i in case_indexs]
+
+        for token_index, token in enumerate(basic_tokens):
+            wordpiece_tokens = self.wordpiece_tokenizer.tokenize(token)
+            if len(wordpiece_tokens) > 1:
+                case_indexs[token_index] = case_indexs[token_index]*(len(wordpiece_tokens))
+            for sub_token in wordpiece_tokens:
+                split_tokens.append(sub_token)
+
+        if case_indexs:
+            case_indexs  = reduce(lambda x, y: x + y, case_indexs)
+        return split_tokens, case_indexs
+
+    def convert_tokens_to_ids(self, tokens):
+        """
+        Converts tokens to ids
+        """
+        return convert_by_vocab(self.vocab, tokens)
+
+    def convert_ids_to_tokens(self, ids):
+        """
+        Converts ids to tokens
+        """
+        return convert_by_vocab(self.inv_vocab, ids)
+
+
+class CharTokenizer(object):
+    """Runs end-to-end tokenziation."""
+
+    def __init__(self, vocab_file, do_lower_case=True):
+        self.vocab = load_vocab(vocab_file)
+        self.inv_vocab = {v: k for k, v in self.vocab.items()}
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+
+    def tokenize(self, text):
+        """
+        Convert text to tokens
+        """
+        split_tokens = []
+        for token in text.lower().split(" "):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+
+        return split_tokens
+
+    def convert_tokens_to_ids(self, tokens):
+        """
+        Convert tokens to ids
+        """
+        return convert_by_vocab(self.vocab, tokens)
+
+    def convert_ids_to_tokens(self, ids):
+        """
+        Convert tokens to ids
+        """
+        return convert_by_vocab(self.inv_vocab, ids)
+
+
+class BasicTokenizer(object):
+    """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
+
+    def __init__(self, do_lower_case=True):
+        """Constructs a BasicTokenizer.
+
+        Args:
+            do_lower_case: Whether to lower case the input.
+        """
+        self.do_lower_case = do_lower_case
+
+    def tokenize(self, text):
+        """Tokenizes a piece of text."""
+        text = convert_to_unicode(text)
+        text = self._clean_text(text)
+
+        # This was added on November 1st, 2018 for the multilingual and Chinese
+        # models. This is also applied to the English models now, but it doesn't
+        # matter since the English models were not trained on any Chinese data
+        # and generally don't have any Chinese data in them (there are Chinese
+        # characters in the vocabulary because Wikipedia does have some Chinese
+        # words in the English Wikipedia.).
+        text = self._tokenize_chinese_chars(text)
+
+        orig_tokens = whitespace_tokenize(text)
+        split_tokens = []
+        for token in orig_tokens:
+            if self.do_lower_case:
+                token = token.lower()
+                token = self._run_strip_accents(token)
+            split_tokens.extend(self._run_split_on_punc(token))
+
+        output_tokens = whitespace_tokenize(" ".join(split_tokens))
+        return output_tokens
+
+    def tokenize_case(self, text):
+        """ 
+        tokenize case
+        """
+        text = convert_to_unicode(text)
+        text = self._clean_text(text)
+        text = self._tokenize_chinese_chars(text)
+
+        orig_tokens = whitespace_tokenize(text)
+        split_tokens = []
+        case_index = []
+
+        for token in orig_tokens:
+            if self.do_lower_case:
+                if token.istitle():
+                    case_index.append(1)
+                else:
+                    case_index.append(0)
+                token = token.lower()
+                token  = self._run_strip_accents(token)
+                if token == '':
+                    case_index.pop()
+
+            tmpsplit_tokens, case_index = self._run_split_on_punc_case(token, case_index)
+            split_tokens.extend(tmpsplit_tokens)
+
+        output_tokens = whitespace_tokenize(" ".join(split_tokens))
+        return output_tokens, case_index
+
+    def _run_strip_accents(self, text):
+        """Strips accents from a piece of text."""
+        text = unicodedata.normalize("NFD", text)
+        output = []
+        for char in text:
+            cat = unicodedata.category(char)
+            if cat == "Mn":
+                continue
+            output.append(char)
+        return "".join(output)
+
+    def _run_split_on_punc(self, text):
+        """Splits punctuation on a piece of text."""
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+        while i < len(chars):
+            char = chars[i]
+            if _is_punctuation(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+
+        return ["".join(x) for x in output]
+
+    def _run_split_on_punc_case(self, text, case_index):
+        """Splits punctuation on a piece of text."""
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+
+        while i < len(chars):
+            char = chars[i]
+            if _is_punctuation(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+
+        if len(output) > 1:
+            case_index.extend([case_index[-1]]*(len(output)-1))
+
+        return ["".join(x) for x in output], case_index
+
+    def _tokenize_chinese_chars(self, text):
+        """Adds whitespace around any CJK character."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if self._is_chinese_char(cp):
+                output.append(" ")
+                output.append(char)
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+    def _is_chinese_char(self, cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
+            (cp >= 0x3400 and cp <= 0x4DBF) or  #
+            (cp >= 0x20000 and cp <= 0x2A6DF) or  #
+            (cp >= 0x2A700 and cp <= 0x2B73F) or  #
+            (cp >= 0x2B740 and cp <= 0x2B81F) or  #
+            (cp >= 0x2B820 and cp <= 0x2CEAF) or
+            (cp >= 0xF900 and cp <= 0xFAFF) or  #
+            (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
+            return True
+
+        return False
+
+    def _clean_text(self, text):
+        """Performs invalid character removal and whitespace cleanup on text."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if cp == 0 or cp == 0xfffd or _is_control(char):
+                continue
+            if _is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+
+class WordpieceTokenizer(object):
+    """Runs WordPiece tokenziation."""
+
+    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
+        self.vocab = vocab
+        self.unk_token = unk_token
+        self.max_input_chars_per_word = max_input_chars_per_word
+
+    def tokenize(self, text):
+        """Tokenizes a piece of text into its word pieces.
+
+        This uses a greedy longest-match-first algorithm to perform tokenization
+        using the given vocabulary.
+
+        For example:
+            input = "unaffable"
+            output = ["un", "##aff", "##able"]
+
+        Args:
+            text: A single token or whitespace separated tokens. This should have
+                already been passed through `BasicTokenizer.
+
+        Returns:
+            A list of wordpiece tokens.
+        """
+
+        text = convert_to_unicode(text)
+
+        output_tokens = []
+        for token in whitespace_tokenize(text):
+            chars = list(token)
+            if len(chars) > self.max_input_chars_per_word:
+                output_tokens.append(self.unk_token)
+                continue
+
+            is_bad = False
+            start = 0
+            sub_tokens = []
+            while start < len(chars):
+                end = len(chars)
+                cur_substr = None
+                while start < end:
+                    substr = "".join(chars[start:end])
+                    if start > 0:
+                        substr = "##" + substr
+                    if substr in self.vocab:
+                        cur_substr = substr
+                        break
+                    end -= 1
+                if cur_substr is None:
+                    is_bad = True
+                    break
+                sub_tokens.append(cur_substr)
+                start = end
+
+            if is_bad:
+                output_tokens.append(self.unk_token)
+            else:
+                output_tokens.extend(sub_tokens)
+        return output_tokens
+
+
+def _is_whitespace(char):
+    """Checks whether `chars` is a whitespace character."""
+    # \t, \n, and \r are technically contorl characters but we treat them
+    # as whitespace since they are generally considered as such.
+    if char == " " or char == "\t" or char == "\n" or char == "\r":
+        return True
+    cat = unicodedata.category(char)
+    if cat == "Zs":
+        return True
+    return False
+
+
+def _is_control(char):
+    """Checks whether `chars` is a control character."""
+    # These are technically control characters but we count them as whitespace
+    # characters.
+    if char == "\t" or char == "\n" or char == "\r":
+        return False
+    cat = unicodedata.category(char)
+    if cat.startswith("C"):
+        return True
+    return False
+
+
+def _is_punctuation(char):
+    """Checks whether `chars` is a punctuation character."""
+    cp = ord(char)
+    # We treat all non-letter/number ASCII as punctuation.
+    # Characters such as "^", "$", and "`" are not in the Unicode
+    # Punctuation class but we treat them as punctuation anyways, for
+    # consistency.
+    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
+        (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
+        return True
+    cat = unicodedata.category(char)
+    if cat.startswith("P"):
+        return True
+    return False
diff --git a/ernie-vil/reader/__init__.py b/ernie-vil/reader/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/ernie-vil/reader/_image_features_reader.py b/ernie-vil/reader/_image_features_reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..2866bef90e806d14066faf9b2a17faa72834df7a
--- /dev/null
+++ b/ernie-vil/reader/_image_features_reader.py
@@ -0,0 +1,79 @@
+"""
+Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+"""
+import numpy as np
+import copy
+import pickle
+import lmdb # install lmdb by "pip install lmdb"
+import base64
+
+class ImageFeaturesH5Reader(object):
+    """
+    Reader class
+    """
+    def __init__(self, features_path):
+        self.features_path = features_path
+        self.env = lmdb.open(self.features_path, max_readers=1, readonly=True,
+                            lock=False, readahead=False, meminit=False)
+
+        with self.env.begin(write=False) as txn: 
+            self._image_ids = pickle.loads(txn.get('keys'.encode()))
+
+        self.features = [None] * len(self._image_ids)
+        self.num_boxes = [None] * len(self._image_ids)
+        self.boxes = [None] * len(self._image_ids)
+        self.boxes_ori = [None] * len(self._image_ids)
+
+    def __len__(self):
+        return len(self._image_ids)
+
+    def __getitem__(self, image_id):
+        image_id = str(image_id).encode()
+        index = self._image_ids.index(image_id)
+        # Read chunk from file everytime if not loaded in memory.    
+        with self.env.begin(write=False) as txn:
+            item = pickle.loads(txn.get(image_id))
+            image_id = item['image_id']
+            image_h = int(item['image_h'])
+            image_w = int(item['image_w'])
+            num_boxes = int(item['num_boxes'])
+
+            features = np.frombuffer(base64.b64decode(item["features"]), dtype=np.float32).reshape(num_boxes, 2048)
+            boxes = np.frombuffer(base64.b64decode(item['boxes']), dtype=np.float32).reshape(num_boxes, 4)
+            g_feat = np.sum(features, axis=0) / num_boxes
+            num_boxes = num_boxes + 1
+            features = np.concatenate([np.expand_dims(g_feat, axis=0), features], axis=0)
+            image_location = np.zeros((boxes.shape[0], 5), dtype=np.float32)
+            image_location[:, :4] = boxes
+            image_location[:, 4] = (image_location[:, 3] - image_location[:, 1]) *   \
+                    (image_location[:, 2] - image_location[:, 0]) / (float(image_w) * float(image_h))
+
+            image_location_ori = copy.deepcopy(image_location)
+            image_location[:, 0] = image_location[:, 0] / float(image_w)
+            image_location[:, 1] = image_location[:, 1] / float(image_h)
+            image_location[:, 2] = image_location[:, 2] / float(image_w)
+            image_location[:, 3] = image_location[:, 3] / float(image_h)
+
+            g_location = np.array([0, 0, 1, 1, 1])
+            image_location = np.concatenate([np.expand_dims(g_location, axis=0), image_location], axis=0)
+
+            g_location_ori = np.array([0, 0, image_w, image_h, image_w * image_h])
+            image_location_ori = np.concatenate([np.expand_dims(g_location_ori, axis=0), image_location_ori], axis=0)
+
+        data_json = {"features": features,
+                     "num_boxes": num_boxes,
+                     "image_location": image_location,
+                     "image_location_ori": image_location_ori
+            }
+        return data_json
+
diff --git a/ernie-vil/reader/vcr_finetuning.py b/ernie-vil/reader/vcr_finetuning.py
new file mode 100644
index 0000000000000000000000000000000000000000..78345572f6390864590ae9b989c0c25dc50eccb8
--- /dev/null
+++ b/ernie-vil/reader/vcr_finetuning.py
@@ -0,0 +1,473 @@
+#    Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" VCR Data Reader implementation """
+
+from __future__ import print_function
+from __future__ import division
+
+import os
+import base64
+import numpy as np
+import re
+import random
+import json
+import json_lines
+import csv
+import sys
+import itertools
+
+from reader._image_features_reader import ImageFeaturesH5Reader
+from preprocess import preprocessor
+from batching.finetune_batching import prepare_batch_data
+
+import paddle.fluid as fluid
+
+def _converId(img_id):
+    """ 
+    conversion for image ID 
+    """
+    img_id = img_id.split('-')
+    if 'train' in img_id[0]:
+        new_id = int(img_id[1])
+    elif 'val' in img_id[0]:
+        new_id = int(img_id[1]) + 1000000
+    elif 'test' in img_id[0]:
+        new_id = int(img_id[1]) + 2000000
+    else:
+        print("no split known")
+    return new_id
+
+
+def _load_annotationsQ_A(annotations_jsonpath, split):
+    """
+    Build an index out of FOIL annotations, mapping each image ID with its corresponding captions.
+    """
+    entries = []
+    with open(annotations_jsonpath) as f:
+        for annotation in json_lines.reader(f):
+            det_names = ""
+            question = annotation["question"]
+            if split == 'test':
+                ans_label = 0
+            else:
+                ans_label = annotation["answer_label"]
+            img_id = _converId(annotation["img_id"])
+            anno_id = int(annotation["annot_id"].split('-')[1])
+            entries.append(
+                     {"question": question,
+                      "answers": annotation["answer_choices"],
+                      "metadata_fn": annotation["metadata_fn"],
+                      "target": ans_label,
+                      "img_id": img_id,
+                      "anno_id": anno_id,
+                      "det_names": annotation['objects']
+                    })
+    return entries
+
+
+def _load_annotationsQA_R(annotations_jsonpath, split):
+    """
+    Build an index out of FOIL annotations, mapping each image ID with its corresponding captions.
+    """
+    entries = []
+    with open(annotations_jsonpath, 'rb') as f: 
+        for annotation in json_lines.reader(f):
+            if split == 'test':
+                for answer in annotation["answer_choices"]:
+                    question = annotation["question"] + ["[MARK]"] + answer
+                    img_id = _converId(annotation["img_id"])
+                    ans_label = 0
+                    anno_id = int(annotation["annot_id"].split('-')[1])
+                    entries.append(
+                             {"question": question,
+                              "answers": annotation["rationale_choices"],
+                              "metadata_fn": annotation["metadata_fn"],
+                              "target": ans_label,
+                              "img_id": img_id,
+                              "anno_id": anno_id,
+                              "det_names": annotation['objects']
+                            })
+            else:
+                det_names = ""
+                question = annotation["question"] + ["[MARK]"]  + \
+                               annotation["answer_choices"][annotation['answer_label']]
+                ans_label = annotation["rationale_label"]
+                img_id = _converId(annotation["img_id"])
+                anno_id = int(annotation["annot_id"].split('-')[1])
+                entries.append(
+                         {"question": question,
+                          "answers": annotation["rationale_choices"],
+                          "metadata_fn": annotation["metadata_fn"],
+                          "target": ans_label,
+                          "img_id": img_id,
+                          "anno_id": anno_id, 
+                          "det_names": annotation['objects']})
+    return entries
+
+
+class VCRDataReader(object):
+    """ 
+    Data reader for sub VCR task
+    """
+    def __init__(self,
+                 task_conf,
+                 split,
+                 vocab_path=None,
+                 batch_size=4096,
+                 shuffle=True,
+                 epoch=100,
+                 is_test=False,
+                 feature_reader_dict={},
+                 random_seed=None,
+                 task_index=0,
+                 task_num=1):
+
+        self.task_conf = task_conf
+        self.processor = getattr(preprocessor,
+                                 task_conf["Proprocessor"])(tokenizer_name=self.task_conf["tokenizer_name"],
+                                 vocab_path=vocab_path)
+        self.vocab = self.processor.vocab
+        self.batch_size = batch_size
+        self.shuffle = shuffle
+        self.epoch = epoch
+        self.current_epoch = 0
+        self.current_file_index = 0
+        self.total_file = 0
+        self.current_file = None
+        self.random_seed = random_seed
+        self.max_seq_len = self.task_conf['max_seq_len']
+        self.pad_id = self.vocab["[PAD]"]
+        self.cls_id = self.vocab["[CLS]"]
+        self.sep_id = self.vocab["[SEP]"]
+        self.mask_id = self.vocab["[MASK]"]
+        self.is_test = is_test
+        self.task_index = task_index
+        self.task_num = task_num
+
+        if self.is_test:
+            self.epoch = 1
+            self.shuffle_files = False
+        if self.shuffle:
+            shufflekeep_across_task = self.task_conf.get('shufflekeep_across_task', True)
+            if shufflekeep_across_task:
+                self.global_rng = np.random.RandomState(random_seed)
+            else:
+                self.global_rng = np.random.RandomState()
+            self.shuffle_every_epoch = self.task_conf.get('shuffle_every_epoch', False)
+        task=self.task_conf['task']
+        annotations_jsonpath=self.task_conf['annotations_jsonpath_' + split]
+        self.num_choice = int(self.task_conf['num_choice'])
+        if task == 'VCR_Q-A':
+            self._entries = _load_annotationsQ_A(annotations_jsonpath, split)
+        elif task == "VCR_QA-R":
+            self._entries = _load_annotationsQA_R(annotations_jsonpath, split)
+        else:
+            assert False
+        self._split = split
+        self._names = []
+        with open(self.task_conf['unisex_names_table']) as csv_file:
+            csv_reader = csv.reader(csv_file, delimiter=',')
+            for row in csv_reader:
+                if row[1] != 'name':
+                    self._names.append(row[1])
+        self._feature_reader = feature_reader_dict[self.task_conf['feature_lmdb_path']]
+        self.use_gt_fea = task_conf.get('use_gt_fea', False)
+        if self.use_gt_fea:
+            self._gt_feature_reader = feature_reader_dict[self.task_conf['gt_feature_lmdb_path']]
+            self._max_region_num = self.task_conf.get('max_region_num', 100)
+            print("use gt featurre")
+        else:
+            self._max_region_num = self.task_conf.get('max_region_num', 37)
+            print("only butd feature")
+        self.tokenize()
+
+    def generate_random_name(self, det_names):
+        """ 
+        Replace "person" with a random name
+        """
+        random_name = []
+        for name in det_names:
+            if name == 'person':
+                word = random.choice(self._names)
+            else:
+                word = name
+            random_name.append(word)
+
+        return random_name
+
+    def replace_det_with_name(self, inputs, random_names):
+        """
+        Replace det with name
+        """
+        tokens = []
+        mask = []
+        for w in inputs:
+            if isinstance(w, list):
+                for idx in w:
+                    word = random_names[idx]
+                    tokens.append(word)
+            else:
+                word = w.encode('utf-8')
+                tokens.append(word)
+
+        return tokens, mask
+
+    def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
+        """
+        Truncates a sequence pair in place to the maximum length.
+        """
+        while True:
+            total_length = len(tokens_a) + len(tokens_b)
+            if total_length <= max_length:
+                break
+            if len(tokens_a) > len(tokens_b):
+                tokens_a.pop()
+            else:
+                tokens_b.pop()
+
+    def get_progress(self):
+        """
+        Return current progress of traning data
+        """
+        progress_dict = {"current_epoch": self.current_epoch,
+                         "current_file_index": self.current_file_index,
+                         "total_file": self.total_file,
+                         "current_file": self.current_file
+                         }
+        return progress_dict
+
+    def tokenize(self):
+        """
+        Tokenizes the captions.
+        """
+        # This will add caption_tokens in each entry of the dataset.
+        # -1 represents nil, and should be treated as padding_idx in embedding.
+        count = 0
+        for entry in self._entries:
+            det_names = entry["det_names"]
+            random_names = self.generate_random_name(det_names)
+            # replace with name
+            tokens_a, mask_a = self.replace_det_with_name(entry["question"], random_names)
+            q_str = " ".join(tokens_a)
+            ids_a = []
+            for i, q in enumerate(q_str.split(" [MARK] ")):
+                if i == 1:
+                    ids_a.append(self.vocab["[SEP]"])
+                ids_a = ids_a + self.processor.convert_sentence_to_ids_without_cls(q)
+
+            input_ids_all = []
+            segment_ids_all = []
+            input_poss_all = []
+            input_len_all = []
+
+            for answer in entry["answers"]:
+                tokens_b, mask_b = self.replace_det_with_name(answer, random_names)
+                ids_b = self.processor.convert_sentence_to_ids_without_cls(" ".join(tokens_b))
+
+                self._truncate_seq_pair(ids_a, ids_b, self.max_seq_len - 3)
+
+                input_ids = []
+                segment_ids = []
+                input_ids.append(self.vocab["[CLS]"])
+                segment_ids.append(0)
+
+                for id in ids_a:
+                    input_ids.append(id)
+                    segment_ids.append(0)
+
+                input_ids.append(self.vocab["[SEP]"])
+                segment_ids.append(0)
+
+                assert len(ids_b) > 0
+                for id in ids_b:
+                    input_ids.append(id)
+                    segment_ids.append(1)
+                input_ids.append(self.vocab["[SEP]"])
+                segment_ids.append(1)
+
+                input_ids_all.append(input_ids)
+                segment_ids_all.append(segment_ids)
+                input_poss = [str(pos) for pos in range(len(input_ids))]
+                input_poss_all.append(input_poss)
+                input_len_all.append(len(input_ids))
+
+            entry["input_ids"] = input_ids_all
+            entry["input_poss"] = input_poss_all
+            entry["segment_ids"] = segment_ids_all
+            entry["input_lens"] = input_len_all
+
+            sys.stdout.write('%d/%d\r' % (count, len(self._entries)))
+            sys.stdout.flush()
+            count += 1
+
+    def parse_line(self, s_index):
+        """
+        Form slot info with the line information
+        """
+        entry = self._entries[s_index]
+        image_id = entry["img_id"]
+        image_fea_json = self._feature_reader[image_id]
+        features = image_fea_json["features"]
+        num_boxes = image_fea_json["num_boxes"]
+        boxes = image_fea_json["image_location"]
+        if not self.use_gt_fea:
+            num_boxes = min(num_boxes, self._max_region_num)
+            boxes = boxes[:num_boxes]
+            features = features[:num_boxes]
+        else:
+            boxes = boxes[:num_boxes]
+            features = features[:num_boxes]
+            image_fea_json = self._gt_feature_reader[image_id]
+            gt_features = image_fea_json["features"]
+            gt_num_boxes = image_fea_json["num_boxes"]
+            gt_boxes = image_fea_json["image_location"]
+            features[0] = (features[0] * num_boxes + gt_features[0] * gt_num_boxes) / (num_boxes + gt_num_boxes)
+
+            gt_boxes = gt_boxes[1: gt_num_boxes]
+            gt_features = gt_features[1: gt_num_boxes]
+            gt_num_boxes = gt_num_boxes - 1
+
+            gt_box_preserve = min(self._max_region_num - 1, gt_num_boxes)
+            gt_boxes = gt_boxes[:gt_box_preserve]
+            gt_features = gt_features[:gt_box_preserve]
+            gt_num_boxes = gt_box_preserve
+
+            num_box_preserve = min(self._max_region_num - int(gt_num_boxes), int(num_boxes))
+            boxes = boxes[:num_box_preserve]
+            features = features[:num_box_preserve]
+
+            # concatenate the boxes
+            mix_boxes = np.concatenate((boxes, gt_boxes), axis=0)
+            mix_features = np.concatenate((features, gt_features), axis=0)
+            mix_num_boxes = num_box_preserve + int(gt_num_boxes)
+
+            num_boxes = min(mix_num_boxes, self._max_region_num)
+            boxes = mix_boxes[:num_boxes]
+            features = mix_features[:num_boxes]
+            record = {
+                "input_ids": entry["input_ids"],
+                "input_pos": entry["input_poss"],
+                "segment_ids": entry["segment_ids"],
+                "input_lens": entry["input_lens"],
+                "target": int(entry["target"]),
+                "features": features,
+                "boxes": boxes,
+                "anno_id": entry["anno_id"]
+                }
+        return record
+
+    def data_generator(self):
+        """ 
+        Data_generator 
+        """
+        sample_indice = range(len(self._entries))
+        def wrapper():
+            """
+            Wrapper
+            """
+            for epoch_index in range(self.epoch):
+                if self._split == "train":
+                    self.current_example = 0
+                    self.current_epoch = epoch_index
+                if self.shuffle:
+                    if epoch_index == 0:
+                        self.global_rng.shuffle(sample_indice)
+                        print("shuffle epoch %d" % epoch_index)
+                    elif self.shuffle_every_epoch:
+                        self.global_rng.shuffle(sample_indice)
+                        print("shuffle epoch %d" % epoch_index)
+                batch_records = []
+                for index in sample_indice:
+                    batch_records.append(self.parse_line(index))
+                    if len(batch_records) == self.batch_size:
+                        yield prepare_batch_data(
+                            batch_records, self.num_choice, self.pad_id, \
+                            self.task_index, self.task_num), self.task_conf['task']
+                        batch_records = []
+                if len(batch_records) > 0:
+                    yield prepare_batch_data(
+                        batch_records, self.num_choice, self.pad_id, \
+                        self.task_index, self.task_num), self.task_conf['task']
+        return wrapper
+
+
+class VCRDataJointReader(object):
+    """ 
+    Joint data reader for Q2A task and QA2R task
+    """
+    def __init__(self,
+                 task_conf_group,
+                 split,
+                 batch_size=4096,
+                 shuffle=True,
+                 epoch=100,
+                 vocab_path=None,
+                 is_test=False):
+
+        self.task_readers = []
+        feature_reader_dict = {}
+        self.task_dup_cnt = []
+        for task_conf in task_conf_group:
+            if 'feature_lmdb_path' in task_conf:
+                if task_conf['feature_lmdb_path'] not in feature_reader_dict:
+                    feature_reader_dict[task_conf['feature_lmdb_path']] =    \
+                        ImageFeaturesH5Reader(task_conf['feature_lmdb_path'])
+            if 'gt_feature_lmdb_path' in task_conf and task_conf.get('use_gt_fea', False):
+                if task_conf['gt_feature_lmdb_path'] not in feature_reader_dict:
+                    feature_reader_dict[task_conf['gt_feature_lmdb_path']] =    \
+                        ImageFeaturesH5Reader(task_conf['gt_feature_lmdb_path'])
+            task_batch_size = task_conf.get('batch_size', 64)
+            self.task_dup_cnt.append(max(int(task_batch_size / batch_size), 1))
+        random_seed=np.random.randint(1000)
+        for task_index, task_conf in enumerate(task_conf_group):
+            self.task_readers.append(VCRDataReader(task_conf, split, vocab_path, batch_size, shuffle,
+                epoch, is_test, feature_reader_dict, random_seed, task_index, len(task_conf_group)))
+        self.task_generators = [reader.data_generator() for reader in self.task_readers]
+
+    def get_progress(self):
+        """
+        Return current progress of traning data
+        """
+        current_epoch = max([reader.current_epoch for reader in self.task_readers])
+        current_file_index = max([reader.current_file_index for reader in self.task_readers])
+        total_file = max([reader.total_file for reader in self.task_readers])
+        current_file = ""
+        self.progress_dict = {"current_epoch": current_epoch,
+                         "current_file_index": current_file_index,
+                         "total_file": total_file,
+                         "current_file": current_file
+                         }
+        return self.progress_dict
+
+    def data_generator(self):
+        """ 
+        Data_generator 
+        """
+        def wrapper():
+            """
+            warpper
+            """
+            task_buffer = [[] for i in range(len(self.task_dup_cnt))]
+            for data in itertools.izip(*[generator() for generator in self.task_generators]):
+                for i, d in enumerate(data):
+                    task_buffer[i].append(d)
+                    if len(task_buffer[i]) >= self.task_dup_cnt[i]:
+                        for t in task_buffer[i]:
+                            yield t[0]
+                        task_buffer[i] = []
+
+        return wrapper
+
+
+if __name__ == "__main__":
+    pass
diff --git a/ernie-vil/requirements.txt b/ernie-vil/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..525c143d4ad74c9a883b6f7767cd73207006de7b
--- /dev/null
+++ b/ernie-vil/requirements.txt
@@ -0,0 +1,8 @@
+nltk==3.2.4
+numpy==1.14.3
+scipy==1.2.1
+six==1.11.0
+json_lines==0.5.0
+lmdb==0.97
+opencv-python==3.2.0.8
+paddlepaddle-gpu==1.8.3.post97
diff --git a/ernie-vil/run_finetuning.sh b/ernie-vil/run_finetuning.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7807240fd41f66c9e203c7384ddd7c34eb845b4f
--- /dev/null
+++ b/ernie-vil/run_finetuning.sh
@@ -0,0 +1,59 @@
+set -eu
+set -x
+
+#bash -x ./env.sh
+
+TASK_NAME=$1
+CONF_FILE=$2
+VOCAB_PATH=$3
+ERNIE_VIL_CONFIG=$4
+PRETRAIN_MODELS=$5
+
+source $CONF_FILE
+
+#configure your cuda and cudnn 
+#configure nccl
+
+export FLAGS_fast_eager_deletion_mode=1
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_fraction_of_gpu_memory_to_use=0.98
+
+e_executor=$(echo ${use_experimental_executor-'True'} | tr '[A-Z]' '[a-z]')
+
+use_fuse=$(echo ${use_fuse-'False'} | tr '[A-Z]' '[a-z]')
+if [[ ${use_fuse} == "true" ]]; then
+    export FLAGS_fuse_parameter_memory_size=131072
+    export FLAGS_fuse_parameter_groups_size=10
+fi
+
+
+TASK_GROUP_JSON=./conf/$TASK_NAME/task_${TASK_NAME}.json
+
+gpu_cnt=`echo $CUDA_VISIBLE_DEVICES | awk -F"\t" '{len=split($0,vec,",");print len}'`
+echo "gpu_cnt", $gpu_cnt
+python finetune.py --use_cuda "True"             \
+                --is_distributed "False"                                       \
+                --use_fast_executor ${e_executor-"True"}                       \
+                --nccl_comm_num ${nccl_comm_num:-"1"}                          \
+                --batch_size $((BATCH_SIZE/gpu_cnt))                                   \
+                --do_train "True"  \
+                --do_test "False"     \
+                --task_name ${TASK_NAME}                      \
+                --vocab_path ${VOCAB_PATH}                                     \
+                --task_group_json ${TASK_GROUP_JSON}                           \
+                --lr_scheduler ${lr_scheduler}                                 \
+                --decay_steps ${decay_steps-""}                                 \
+                --lr_decay_ratio ${lr_decay_ratio-0.1}                                 \
+                --num_train_steps ${num_train_steps}                           \
+                --checkpoints $output_model_path                                       \
+                --save_steps ${SAVE_STEPS}                                     \
+                --init_checkpoint ${PRETRAIN_MODELS}                                 \
+                --ernie_config_path ${ERNIE_VIL_CONFIG}                             \
+                --learning_rate ${LR_RATE}                                     \
+                --warmup_steps ${WARMUP_STEPS}                                               \
+                --weight_decay ${WEIGHT_DECAY:-0}                              \
+                --max_seq_len ${MAX_LEN}                                       \
+                --validation_steps ${VALID_STEPS}                              \
+                --skip_steps 10 
+
+
diff --git a/ernie-vil/run_inference.sh b/ernie-vil/run_inference.sh
new file mode 100644
index 0000000000000000000000000000000000000000..63893286fec7c88d44f13b00db0787052da20037
--- /dev/null
+++ b/ernie-vil/run_inference.sh
@@ -0,0 +1,48 @@
+set -eu
+
+#bash -x ./env.sh
+
+TASK_NAME=$1
+SUB_TASK_NAME=$2
+TEST_SPLIT=$3
+CONF_FILE=$4
+VOCAB_PATH=$5
+ERNIE_VIL_CONFIG=$6
+MODEL_PATH=$7
+RES_FILE=$8
+
+source $CONF_FILE
+
+#configure your cuda and cudnn
+#configure nccl
+
+export FLAGS_eager_delete_tensor_gb=2.0
+export FLAGS_fraction_of_gpu_memory_to_use=0.01
+export FLAGS_sync_nccl_allreduce=1
+
+e_executor=$(echo ${use_experimental_executor-'True'} | tr '[A-Z]' '[a-z]')
+
+use_fuse=$(echo ${use_fuse-'False'} | tr '[A-Z]' '[a-z]')
+if [[ ${use_fuse} == "true" ]]; then
+    export FLAGS_fuse_parameter_memory_size=131072
+    export FLAGS_fuse_parameter_groups_size=10
+fi
+
+TASK_GROUP_JSON=./conf/$TASK_NAME/task_${TASK_NAME}_${SUB_TASK_NAME}.json
+
+python finetune.py --use_cuda "True"             \
+                --use_fast_executor ${e_executor-"True"}                       \
+                --batch_size ${BATCH_SIZE}                                   \
+                --do_train "False"  \
+                --do_test "True"     \
+                --test_split ${TEST_SPLIT}                        \
+                --task_name $TASK_NAME                                       \
+                --vocab_path ${VOCAB_PATH}                                     \
+                --task_group_json ${TASK_GROUP_JSON}                           \
+                --result_file "$RES_FILE"                                  \
+                --init_checkpoint "$MODEL_PATH"                                 \
+                --ernie_config_path ${ERNIE_VIL_CONFIG}                             \
+                --max_seq_len ${MAX_LEN}                                       \
+                --skip_steps 10
+
+
diff --git a/ernie-vil/utils/__init__.py b/ernie-vil/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/ernie-vil/utils/args.py b/ernie-vil/utils/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..a88528a8ae3ff42df932f62502e649d62e82e1b2
--- /dev/null
+++ b/ernie-vil/utils/args.py
@@ -0,0 +1,61 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Arguments for configuration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import six
+import argparse
+
+
+def str2bool(v):
+    """
+    because argparse does not support to parse "true, False" as python
+    boolean directly
+    """
+    return v.lower() in ("true", "t", "1")
+
+
+class ArgumentGroup(object):
+    """
+    group of arguments
+    """
+    def __init__(self, parser, title, des):
+        self._group = parser.add_argument_group(title=title, description=des)
+
+    def add_arg(self, name, type, default, help, positional_arg=False, **kwargs):
+        """
+        add arg
+        """
+        prefix = "" if positional_arg else "--"
+        type = str2bool if type == bool else type
+        self._group.add_argument(
+            prefix + name,
+            default=default,
+            type=type,
+            help=help + ' Default: %(default)s.',
+            **kwargs)
+
+
+def print_arguments(args):
+    """
+    Arguments print function
+    """
+    print('-----------  Configuration Arguments -----------')
+    for arg, value in sorted(six.iteritems(vars(args))):
+        print('%s: %s' % (arg, value))
+    print('------------------------------------------------')
diff --git a/ernie-vil/utils/init.py b/ernie-vil/utils/init.py
new file mode 100644
index 0000000000000000000000000000000000000000..faadca1b15a38b04122754ee41be35fd2430848c
--- /dev/null
+++ b/ernie-vil/utils/init.py
@@ -0,0 +1,71 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""parameters init function implementations"""
+
+
+from __future__ import print_function
+
+import os
+import six
+
+import numpy as np
+import paddle.fluid as fluid
+
+
+def init_checkpoint(exe, init_checkpoint_path, main_program):
+    """
+    init checkpoint params with lr and step info
+    """
+    assert os.path.exists(
+        init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
+    def existed_persitables(var):
+        """
+        Check if persitables
+        """
+        if not fluid.io.is_persistable(var):
+            return False
+        return os.path.exists(os.path.join(init_checkpoint_path, var.name))
+
+    fluid.io.load_vars(
+        exe,
+        init_checkpoint_path,
+        main_program=main_program,
+        predicate=existed_persitables)
+    print("Load model from {}".format(init_checkpoint_path))
+
+
+def init_pretraining_params(exe, pretraining_params_path, main_program):
+    """
+    init pretraining params without lr and step info
+    """
+    assert os.path.exists(pretraining_params_path
+                          ), "[%s] cann't be found." % pretraining_params_path
+
+    def existed_params(var):
+        """
+        Check existed params
+        """
+        if not isinstance(var, fluid.framework.Parameter):
+            return False
+        return os.path.exists(os.path.join(pretraining_params_path, var.name))
+
+    fluid.io.load_vars(
+        exe,
+        pretraining_params_path,
+        main_program=main_program,
+        predicate=existed_params)
+    print("Load pretraining parameters from {}.".format(
+        pretraining_params_path))
+
diff --git a/requirements.txt b/requirements.txt
index e267a7738bb8f9067254bd7fe11fd992b8018504..9c9d2bc707935b6b0cad95f511d7130ce31d9a5c 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -6,4 +6,5 @@ scipy==1.2.1
 six==1.11.0
 sklearn==0.0
 sentencepiece==0.1.8
+opencv-python==3.4.2.17
 paddlepaddle-gpu==1.6.3.post107