Repro (#628)

* add other tasks * Update README_zh.md * Update README.md * Update README.md

Repro (#628)
* add other tasks * Update README_zh.md * Update README.md * Update README.md
f6628ee8 · tangjiji · GitHub · 69a9e2fa · f6628ee8 · f6628ee8
40 changed file
--- a/ernie-vil/README.md
+++ b/ernie-vil/README.md
@@ -5,6 +5,11 @@ English| [简体中文](./README_zh.md)
 - [Pre-trained models](#pre-trained-models)
 - [Downstream tasks](#downstream-tasks)
  * [VCR](#VCR)
+  * [VQA](#VQA)
+  * [IR&TR](#Retrieval)
+  * [RefCOCO+](#RefCOCO+)
+  
+  
 - [Usage](#usage)
  * [Install PaddlePaddle](#install-paddlepaddle)
  * [Fine-tuning on ERNIE-ViL](#fine-tuning-on-ernie-vil)
@@ -43,11 +48,15 @@ Based on the scene graph parsed from the text using Scene Graph Parser, we const
                                

 ## Pre-trained Models
-ERNIE-ViL adopts large-scale image-text aligned datasets as the pre-training data. We provide ERNIE-ViL models of two scale settings which are pretrained on [**Conceptual Captions**](https://www.aclweb.org/anthology/P18-1238.pdf) and [**SBU Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio).
+ERNIE-ViL adopts large-scale image-text aligned datasets as the pre-training data. We provide ERNIE-ViL models of two scale settings which are pretrained on two out-of-domain datasets, e.g., [**Conceptual Captions**](https://www.aclweb.org/anthology/P18-1238.pdf) and [**SBU Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio).

 - [**ERNIE-ViL _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-en.1.tar.gz) (_lowercased | 12-text-stream-layer, 6-visual-stream-layer_)
 - [**ERNIE-ViL _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_) 

+We also provide large scale settings model which are pretrained on both out-of-domain datasets([**Conceptual Captions**](https://www.aclweb.org/anthology/P18-1238.pdf), [**SBU Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio)) and in-domain([**MS-COCO**](https://arxiv.org/abs/1405.0312)，[**Visual-Genome**](https://arxiv.org/abs/1602.07332)) datasets.
+
+- [**ERNIE-ViL-Out&in-domain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-all-domain-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_)
+
 ## Downstream tasks
 We finetune ERNIE-ViL on five vision-langage downstream tasks, i.e., Visual Commensense Reasoning([**VCR**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.pdf))，
 Visual Question Answering([**VQA**](https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf)),
@@ -59,7 +68,7 @@ _Code and pre-trained models related to VCR task are made public now, and those

 ### VCR
   * datasets
-      * The training, validation and testing data of VCR task are provided by [**VCR Website**](https://visualcommonsense.com/download/).
+      * The training, validation and testing data of **VCR** task are provided by [**VCR Website**](https://visualcommonsense.com/download/).
      * Organization of visual features is modified from [**ViLBERT**](https://github.com/jiasenlu/vilbert_beta), we directly use the data from it. Data can be downloaded [here](https://github.com/jiasenlu/vilbert_beta/tree/master/data).
      * Put all downloaded files under diretory "data/vcr".
      
@@ -67,19 +76,79 @@ _Code and pre-trained models related to VCR task are made public now, and those
   * Task pre-training: We perform task-pretraining on VCR task, which is also known as task-specific-pretraining. The trained models are as follows: 
      * [**ERNIE-ViL-VCR-task-pretrain _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-VCR-task-pre-en.1.tar.gz)
      * [**ERNIE-ViL-VCR-task-pretrain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-VCR-task-pre-en.1.tar.gz) 
-   * Performance: Results of VCR task for ERNIE-ViL model, compared with previous state-of-the-art pre-trained models([**VILLA**](https://arxiv.org/pdf/2006.06195.pdf)).
+   * Performance: Results of VCR task for different scale settings of ERNIE-ViL model

      | Models                                 |      <strong>Q->A</strong>    |    <strong>QA->R</strong>      |     <strong>Q->AR</strong>       |
      | :--------------------------------------| :---------------------------: | :----------------------------: | :-----------------------------:  |
-      | VILLA (task-pretrain) _base_           |        75.54(76.4)            |        78.78(79.1)             |         59.75(60.6)              |
      | ERNIE-ViL (task-pretrain) _base_       |        76.37(77.0)            |        79.65(80.3)             |         61.24(62.1)              |
-      | VILLA (task-pretrain) _large_          |        78.45(78.9)            |        82.57(82.8)             |          65.18(65.7)             |
      | ERNIE-ViL (task-pretrain) _large_      | <strong>78.52(79.2)</strong>  |  <strong>83.37(83.5)</strong>  |  <strong/>65.81(66.3) </strong>  |

        _Numerical results outside and inside parentheses represent the dev and test performance of VCR task respectively. 
        Test results are obtained from the [**VCR leadborad**](https://visualcommonsense.com/leaderboard/)._
+        
+### VQA
+   * datasets
+       * The training, validation and testing data of VCR task are provided by[**VQA Website**](https://visualqa.org/).
+       * Visual features are extracted by using tools in [bottom-up attention](https://github.com/jiasenlu/bottom-up-attention), The minimum and maximum number of the extracted boxes are 100 and 100.
+       * A single training & test data is organized as follows:
+           ```script
+           question_id, question, answer_label, answer_score, image_w, image_h, number_box, image_loc, image_embeddings
+           ```
+           _The labels and scores of multiple answers are separated by the character ‘|’._ 
+   * Performance: Results of **VQA** task for different scale settings of ERNIE-ViL model
+      | Models                              |      <strong>test-dev</strong>    |      <strong>test-std</strong>    |
+      | :-------------------------------- | :-------------------------------: | :------------------------------:  | 
+      | ERNIE-ViL _base_                  |           73.18                   |              73.36                |         
+      | ERNIE-ViL _large_                 |           73.78                   |              73.96                |
+      | ERNIE-ViL-Out&in-domain _large_   |           74.95                   |              75.10                |  


+      
+### IR&TR
+   * datasets
+       * The images and captions of Flickr30k datasets can be obtailed from [**here**](https://www.kaggle.com/hsankesara/flickr-image-dataset).
+       * Visual features are extracted by using tools in [bottom-up attention](https://github.com/jiasenlu/bottom-up-attention). The minimum and maximum number of the extracted boxes are 0 and 36. The organization of visual features is illstruated as follows:
+           ```script
+           image_w, image_h, number_box, image_loc, image_embeddings
+           ```
+       *  The organization of text data can refer to our given sample, e.g., data/flickr.flickr.dev.data.
+     
+           
+           
+   * Performance
+       * Results of **Image Retrieval** task on **Flickr30k dataset** for different scale settings of ERNIE-ViL model
+          | Models                            |    <strong>R@1</strong>  |    <strong>R@5</strong>   |   <strong>R@10</strong>   |
+          | :-------------------------------- | :---------------------:  | :----------------------:  | :----------------------:  | 
+          | ERNIE-ViL _base_                  |           74.44          |          92.72            |           95.94           |        
+          | ERNIE-ViL _large_                 |           75.10          |          93.42            |           96.26           |
+          | ERNIE-ViL-Out&in-domain _large_   |           76.66          |          94.16            |           96.76           |
+          
+       * Results of **Text Retrieval** task on **Flickr30k dataset** for different scale settings of ERNIE-ViL model
+          | Models                            |    <strong>R@1</strong>  |    <strong>R@5</strong>   |   <strong>R@10</strong>   |
+          | :-------------------------------- | :---------------------:  | :----------------------:  | :----------------------:  | 
+          | ERNIE-ViL _base_                  |           86.70          |          97.80            |           99.00           |        
+          | ERNIE-ViL _large_                 |           88.70          |          97.30            |           99.10           |
+          | ERNIE-ViL-Out&in-domain _large_   |           89.20          |          98.50            |           99.20           |
+         
+### RefCOCO+
+   * datasets
+       * Organization of visual features is modified from [MAttNet](https://github.com/lichengunc/MAttNet).
+       * A single training & test data is organized as follows:
+           ```script
+           expressions, image_w, image_h, number_box, number_boxes_gt, image_loc, image_embeddings, box_label, label
+           ```
+  * Performance
+      * Results of **RefCOCO+** task for different scale settings of ERNIE-ViL model
+     
+          | Models                            |   <strong>val</strong>  |    <strong>testA</strong>   |   <strong>testB</strong>   |
+          | :-------------------------------- | :---------------------:  | :----------------------:  | :----------------------:  | 
+          | ERNIE-ViL _base_                  |           74.02          |          80.33            |           64.74           |        
+          | ERNIE-ViL _large_                 |           74.24          |          80.97            |           64.70           |
+          | ERNIE-ViL-Out&in-domain _large_   |           75.89          |          82.39            |           66.91           |
+   
+      
+  
+

 ## Usage

@@ -92,32 +161,61 @@ This code has been tested with Paddle Fluid 1.8 with Python 2.7. Other dependenc

 ### Fine-tuning on ERNIE-ViL
 Please update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2 before fine-tuning. You can easily run fine-tuning through
-configuration files. For example, you can finetune ERNIE-ViL model on VCR task by
+configuration files. You can finetune ERNIE-ViL model on different downstream tasks by the following command:
 ```script
-    sh run_finetuning.sh vcr conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $pretrain_models
+    sh run_finetuning.sh $task_name(vqa/flickr/refcoco_plus/vcr) conf/${task_name}/model_conf_${task_name} $vocab_file $ernie_vil_config $pretrain_models_params
 ```
 Files which are needed by fine-tuning can be found in our given download links, incluing vocabulary dictionary, configuration
-file and pre-trained parameters. Note that our fine-tuning experiments on VCR are carried on 4 NVIDIA V100 (32GB) GPUs.
+file and pre-trained parameters. Training details of different downstream tasks (large scale) are illstruated in the table below.
+
+|  Tasks   | Batch Size | Learning Rate | # of Epochs |  GPUs    | Layer Decay rate | Hidden dropout |
+|   -----  | ----------:| -------------:| -----------:| --------:| ----------------:| --------------:| 
+|  VCR     |   16(x4)   |    1e-4       |      6      |  4x V100 |        0.9       |       0.1      |
+|  VQA 2.0 |   64(x4)   |    1e-4       |     15      |  4x V100 |        0.9       |       0.1      |
+| RefCOCO+ |   64(x2)   |    1e-4       |     30      |  2x V100 |        0.9       |       0.2      |
+| Flickr   |   8(x8)    |    2e-5       |     40      |  8x V100 |        0.0       |       0.1      | 
+
+Our fine-tuning experiments on downstream tasks are carried on NVIDIA V100 (32GB) GPUs.
 If your GPU memory is not enough, you can reduce the batch size in the corresponding configuration file, e.g., "conf/vcr/model_conf_vcr". 



 ### Inference
   
-  You can use the following command to infer fine-tuned models. For example, you can infer VCR models by the following commands for different sub-tasks:
-    
-  **Task Q->A** 
+  You can use the following command to infer fine-tuned models.
+  
+#### VCR
+  

+  
  ```script
-        sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
-  ``` 
-  **Task QA->R** 
-
+     Task Q->A: sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
+  ```
+ 
+  ```script
+     Task Q->AR: sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
+  ```
+  
+#### VQA
+ 
  ```script
-        sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
-  ``` 
+       sh run_inference.sh vqa eval $split(val/test_dev/test_std) conf/vqa/model_conf_vqa $vocab_file $ernie_vil_config $model_params $res_file
+  ```
+  _No test labels are given in the released test samples, you can obtailed the final score by submiting the result file to the [VQA website](https://visualqa.org/)_.
  
+#### RefCOCO+

+  ```script
+       sh run_inference.sh refcoco_plus eval $split(val/test_A/test_B) conf/refcoco_plus/model_conf_refcoco_plus $vocab_file $ernie_vil_config $model_params $res_file
+  ```
+  
+#### Flickr
+   
+  ```script
+       sh run_inference.sh flickr eval $split(dev/test) conf/flickr/model_conf_flickr $vocab_file $ernie_vil_config $model_params $res_file
+  ```
+  _Get the accuray score by using the given tools of tools/get_recall.py._
+  


 ## Citation

--- a/ernie-vil/README_zh.md
+++ b/ernie-vil/README_zh.md
@@ -5,7 +5,10 @@
 - [模型框架](#模型框架)
 - [预训练模型](#预训练模型)
 - [下游任务](#下游任务)
-  * [视觉推理](#视觉推理)
+  * [视觉常识推理](#视觉常识推理)
+  * [视觉问答](#视觉问答)
+   * [跨模态检索](#跨模态检索)
+  * [引用表达式理解](#引用表达式理解)
 - [使用说明](#使用说明)
  * [安装飞桨](#安装飞桨)
  * [运行微调](#运行微调)
@@ -41,45 +44,110 @@ ERNIE-ViL 场景图预训练任务结构
 ## 预训练模型


-ERNIE-ViL使用大规模图文对齐数据集作为预训练数据，基于[**Conceptual
+ERNIE-ViL使用大规模图文对齐数据作为预训练数据，基于[**Conceptual
 Captions**](https://www.aclweb.org/anthology/P18-1238.pdf)和[**SBU
-Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio)数据集，训练和发布了两种参数规模的模型：
+Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio)两个out-of-domain数据集，训练两种参数规模模型如下：

 - [**ERNIE-ViL _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-en.1.tar.gz) (_lowercased | 12-text-stream-layer, 6-visual-stream-layer_)
 - [**ERNIE-ViL _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_)

+基于两个out-of-domian数据集([**Conceptual
+Captions**](https://www.aclweb.org/anthology/P18-1238.pdf)，[**SBU
+Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio))和两个in-domain数据集([**MS-COCO**](https://arxiv.org/abs/1405.0312)，[**Visual-Genome**](https://arxiv.org/abs/1602.07332))训练了large参数规模的模型：
+
+- [**ERNIE-ViL-Out&in-domain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-all-domain-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_)
+
 ## 下游任务

 ERNIE-ViL在五个视觉语言下游任务进行了实验，包括[**视觉常识推理**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.pdf)，
 [**视觉问答**](https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf)，
 [**跨模态图片检索**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)，
 [**跨模态文本检索**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)，
-[**引用式理解**](https://www.aclweb.org/anthology/D14-1086.pdf)。 
+[**引用表达式理解**](https://www.aclweb.org/anthology/D14-1086.pdf)，与主流模型的效果对比可以参考开源论文。

-_当前仅开源视觉常识推理任务相关模型和代码，后续计划开源更多下游任务的模型和代码。_


 ### **视觉常识推理**
   * 数据集合
-      * 训练、验证和测试集合相关数据由[**视觉常识推理官网**](http://visualcommonsense.com/download/)提供；
+      * 训练、验证和测试集合相关数据可以由[**视觉常识推理官网**](http://visualcommonsense.com/download/)获取；
      * 视觉端特征的组织方式借鉴[**ViLBERT**](https://github.com/jiasenlu/vilbert_beta), 因此项目直接使用**ViLBERT**中的数据，数据[下载地址](https://github.com/jiasenlu/vilbert_beta/tree/master/data);
      * 将所有获取的文件放在 data/vcr 目录下；
      
-  
-   * 任务预训练： 在视觉推理任务中进行了任务预训练，预训练获得模型如下
+   * 任务预训练： 基于ERNIE-ViL的out-of-domain模型，在视觉推理任务中进行了任务预训练，预训练获得模型如下
      * [**ERNIE-ViL-VCR-task-pretrain _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-VCR-task-pre-en.1.tar.gz)
      * [**ERNIE-ViL-VCR-task-pretrain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-VCR-task-pre-en.1.tar.gz)
-   * 效果: ERNIE-ViL与之前最优预训练模型[**VILLA**](https://arxiv.org/pdf/2006.06195.pdf)在视觉常识推理任务上的效果对比如下：
+      
+   * 效果: ERNIE-ViL在视觉常识推理任务上的效果对比如下：

      | 模型                                |      <strong>Q->A</strong>    |      <strong>QA->R</strong>    |     <strong>Q->AR</strong>       |
      | :---------------------------------- | :---------------------------: | :----------------------------: | :---------------------------:    |
-      | VILLA (task-pretrain) _base_        |           75.54(76.4)         |            78.78(79.1)         |           59.75(60.6)            |
      | ERNIE-ViL (task-pretrain) _base_    |           76.37(77.0)         |            79.65(80.3)         |           61.24(62.1)            |
-      | VILLA (task-pretrain) _large_       |           78.45(78.9)         |            82.57(82.8)         |           65.18(65.7)            |
      | ERNIE-ViL (task-pretrain) _large_   |  <strong>78.52(79.2)</strong> |  <strong>83.37(83.5)</strong>  |  <strong/>65.81(66.3) </strong>  |

-      _注：括号外表示验证集效果，括号内表示测试集效果，测试集效果由[VCR榜单](https://visualcommonsense.com/leaderboard/)提供。_
+      _注：括号外表示验证集效果，括号内表示测试集效果，测试集效果提交到[VCR榜单](https://visualcommonsense.com/leaderboard/)获得。_

+### **视觉问答**
+   * 数据集合
+       * 原始图片、问题和答案可以由[**视觉问答官网**](https://visualqa.org/)获取。
+       * 视觉端特征使用[**bottom-up attention**](https://github.com/jiasenlu/bottom-up-attention)中的工具提取，提取的box动态值为100-100。
+       * 训练 & 测试数据按照如下方式组织:
+           ```script
+           question_id, question, answer_label, answer_score, image_w, image_h, number_box, image_loc, image_embeddings
+           ```
+           _多个答案的label和score用 ‘|’ 分隔，和image相关的项均可以从bottom up attention的工具提取。_
+           
+   * 效果：ERNIE-ViL的三种预训练模型在**视觉问答**任务下的效果如下表
+   
+      | 模型                               |      <strong>test-dev</strong>    |      <strong>test-std</strong>    |
+      | :-------------------------------- | :-------------------------------: | :------------------------------:  | 
+      | ERNIE-ViL _base_                  |           73.18                   |              73.36                |         
+      | ERNIE-ViL _large_                 |           73.78                   |              73.96                |
+      | ERNIE-ViL-Out&in-domain _large_   |           74.95                   |              75.10                |    
+      
+      
+### **跨模态检索**
+   * 数据集合
+       * 原始图片和文本描述相关的数据，可以从[**这里**](https://www.kaggle.com/hsankesara/flickr-image-dataset)获取。
+       * 视觉端特征使用[**bottom-up attention**](https://github.com/jiasenlu/bottom-up-attention)提取，提取的box动态值为0-36。
+       * 文本相关的数据可以参见data/flickr给出的示例 flickr.dev.data，图片端特征组织方式为
+           ```script
+           image_w, image_h, number_box, image_loc, image_embeddings
+           ```
+           
+   * 效果
+       * ERNIE-ViL的三种预训练模型在**跨模态图片检索（Flickr30k 数据集）**上的效果如下表
+          | 模型                               |    <strong>R@1</strong>  |    <strong>R@5</strong>   |   <strong>R@10</strong>   |
+          | :-------------------------------- | :---------------------:  | :----------------------:  | :----------------------:  | 
+          | ERNIE-ViL _base_                  |           74.44          |          92.72            |           95.94           |        
+          | ERNIE-ViL _large_                 |           75.10          |          93.42            |           96.26           |
+          | ERNIE-ViL-Out&in-domain _large_   |           76.66          |          94.16            |           96.76           |
+          
+       * ERNIE-ViL的三种预训练模型在**跨模态文本检索（Flickr30k 数据集）**任务上的效果如下表
+          | 模型                               |    <strong>R@1</strong>  |    <strong>R@5</strong>   |   <strong>R@10</strong>   |
+          | :-------------------------------- | :---------------------:  | :----------------------:  | :----------------------:  | 
+          | ERNIE-ViL _base_                  |           86.70          |          97.80            |           99.00           |        
+          | ERNIE-ViL _large_                 |           88.70          |          97.30            |           99.10           |
+          | ERNIE-ViL-Out&in-domain _large_   |           89.20          |          98.50            |           99.20           |
+         
+### **引用表达式理解**
+   * 数据集合
+       * 视觉端特征参考了[MAttNet](https://github.com/lichengunc/MAttNet)的提取方式。
+       * 单条训练 & 验证 数据的组织方式为
+           ```script
+           expressions, image_w, image_h, number_box, number_boxes_gt, image_loc, image_embeddings, box_label, label
+           ```
+  * 效果
+      * ERNIE-ViL的三种预训练模型在**引用表达式理解**任务上的效果如下表：
+     
+          | 模型                               |   <strong>val</strong>  |    <strong>testA</strong>   |   <strong>testB</strong>   |
+          | :-------------------------------- | :---------------------:  | :----------------------:  | :----------------------:  | 
+          | ERNIE-ViL _base_                  |           74.02          |          80.33            |           64.74           |        
+          | ERNIE-ViL _large_                 |           74.24          |          80.97            |           64.70           |
+          | ERNIE-ViL-Out&in-domain _large_   |           75.89          |          82.39            |           66.91           |
+   
+      
+  
+    

 ## 使用说明

@@ -90,32 +158,62 @@ ERNIE-ViL代码基于Paddle Fluid 1.8 和 Python 2.7， 依赖的其他模块也
      pip install -r requirements.txt
  ```
 ### 运行微调
-在运行 ERNIE-ViL 前，需要将 CUDA 、cuDNN 、NCCL2 的动态库路径添加到 LD_LIBRARY_PATH 。 我们把下游任务的参数配置文件放到了 conf/ ，可以简单地通过配置文件运行。 例如，您可以通过下面的指令在VCR上任务上进行微调：
+在运行 ERNIE-ViL 微调前，需要将 CUDA 、cuDNN 、NCCL2 的动态库路径添加到 LD_LIBRARY_PATH 。 我们把下游任务的参数配置文件放到了 conf/ ，可以简单地通过配置文件运行。 例如，您可以通过下面的指令在各个下游任务上进行微调：
+
 ```script
-    sh run_finetuning.sh vcr conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $pretrain_models_params
+    sh run_finetuning.sh $task_name(vqa/flickr/refcoco_plus/vcr) conf/${task_name}/model_conf_${task_name} $vocab_file $ernie_vil_config $pretrain_models_params
 ```
-前面提供的模型链接中包含了所有需要的文件, 包含词表文件，配置文件和预训练参数。VCR任务的微调实验是在 4 张32 GB 的英伟达V100 GPU上运行，如果您的GPU显存不够，可以考虑八张卡运行或者减小配置中的batch_size。
-_我们目前开放了预训练模型和VCR的任务代码，其他的下游任务可以参考任务自主尝试。_
+
+前面提供的模型链接中包含了所有需要的文件, 包含词表文件，配置文件和预训练参数。微调相关的模型配置和参数配置可以通过conf/ 目录下的文件找到，这里对论文最优结果（large模型）的一些关键参数进行汇总：
+
+|  Tasks   | Batch Size | Learning Rate | # of Epochs |  GPUs    | Layer Decay rate | Hidden dropout |
+|   -----  | ----------:| -------------:| -----------:| --------:| ----------------:| --------------:| 
+|  VCR     |   16(x4)   |    1e-4       |      6      |  4x V100 |        0.9       |       0.1      |
+|  VQA 2.0 |   64(x4)   |    1e-4       |     15      |  4x V100 |        0.9       |       0.1      |
+| RefCOCO+ |   64(x2)   |    1e-4       |     30      |  2x V100 |        0.9       |       0.2      |
+| Flickr   |   8(x8)    |    2e-5       |     40      |  8x V100 |        0.0       |       0.1      | 
+
+
+所有的下游任务的微调实验是在 32 GB 的英伟达V100 GPU上运行，如果您的GPU显存不够，可以考虑更多张卡运行或者减小配置中的batch_size。
+

 ### 预测
-基于已经训练的模型，您可以通过下面的命令测试VCR的效果：
+基于已经训练的模型，您可以通过下面的命令测试下游任务的效果（相关的配置文件可以从之前下载的包获得）

-  **Task Q->A**
+#### VCR
+      
 
  ```script
-       sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
+     Task Q->A: sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
  ```
-  **Task QA->R**
 
  ```script
-        sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
+     Task Q->AR: sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
  ```
  
  
-  VCR的测试可以在一张32GB的英伟达V100 GPU上运行，测试的结果包含Q->A 任务、QA->R任务和Q->AR任务，其中Q->AR任务由前两个任务结果合并所得。
-
+  _VCR的测试可以在一张32GB的英伟达V100 GPU上运行，测试的结果包含Q->A 任务、QA->R任务和Q->AR任务，其中Q->AR任务由前两个任务结果合并所得._

+#### VQA
+ 
+  ```script
+       sh run_inference.sh vqa eval $split(val/test_dev/test_std) conf/vqa/model_conf_vqa $vocab_file $ernie_vil_config $model_params $res_file
+  ```
+   注:_VQA的测试样本没有label信息，需要将结果文件提交到[**VQA网站**](https://visualqa.org/)查看结果。_
+   
+#### RefCOCO+

+  ```script
+       sh run_inference.sh refcoco_plus eval $split(val/test_A/test_B) conf/refcoco_plus/model_conf_refcoco_plus $vocab_file $ernie_vil_config $model_params $res_file
+  ```
+  
+#### Flickr
+   
+  ```script
+       sh run_inference.sh flickr eval $split(dev/test) conf/flickr/model_conf_flickr $vocab_file $ernie_vil_config $model_params $res_file
+  ```
+  注：_Flickr的结果是一个预测结果文件，可以参考 tools/get_recall.py 统计一下最终结果。_
+  
 ## 引用

 可以按下面的格式引用我们的论文:

--- a/ernie-vil/args/finetune_args.py
+++ b/ernie-vil/args/finetune_args.py
@@ -35,8 +35,13 @@ model_g.add_arg("task_name", str, "vcr", "Task to finetune on ERNIE-ViL")
 train_g = ArgumentGroup(parser, "training", "training options.")
 train_g.add_arg("epoch", int, 100, "Number of epoches for training.")
 train_g.add_arg("learning_rate", float, 0.0001, "Learning rate used to train with warmup.")
+train_g.add_arg("seq_dropout", float, 0.0, "dropout rate after the sequence output.")
 train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
                "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay', 'manual_warmup_decay'])
+train_g.add_arg("layer_decay_rate", float, 0.0, "layer wise decay, 0.0 denote no layer decay")
+train_g.add_arg("text_init_layers", int, 18, "diff from text and image layer, base:12-6=6, large:24-6=18")
+train_g.add_arg("n_layers", int, 30, "max layers of text and image, base:12 + 6 , large:24 + 6")
+
 train_g.add_arg("decay_steps", str, "", "learning rate decay steps, list with ;")
 train_g.add_arg("lr_decay_ratio", float, 0.1, "learning rate decay ratio, used with manual_warmup_decay")
 train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
@@ -68,6 +73,9 @@ data_g.add_arg("feature_size", int, 2048, "Number of roi feature size of image."
 data_g.add_arg("fusion_method", str, "sum", "Number of roi feature size of image.")
 data_g.add_arg("batch_size", int, 16, "Total examples' number in batch for training. see also --in_tokens.")
 data_g.add_arg("task_group_json", str, "", "Path to task json")
+data_g.add_arg("scale_circle", float, "1.0", "The scale factor in circle loss function, only use in circle loss mode")
+data_g.add_arg("use_sigmoid", bool, False, "Whether to use sigmoid to match score, use for explode problem")
+data_g.add_arg("margin", float, "0.2", "The margin value in triplet loss function")

 run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
 run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.")

--- a/ernie-vil/batching/finetune_batching.py
+++ b/ernie-vil/batching/finetune_batching.py
@@ -56,7 +56,6 @@ def prepare_batch_data(batch_records, num_choice, pad_id, task_index, task_num):
    src_pos = np.array(batch_input_pos).astype("int64").reshape([num_choice * num_sample, max_len, 1])
    src_seg = np.array(batch_seg_ids).astype("int64").reshape([num_choice * num_sample, max_len, 1])
    src_masks = np.array(batch_input_masks).astype("float32").reshape([num_choice * num_sample, max_len, 1])
-    src_task = np.zeros(src_ids.shape, dtype="int64")
    batch, seq_len, fea_len = image_embedding.shape
    image_embedding = np.tile(np.expand_dims(image_embedding, axis=1),    \
        (1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, fea_len])
@@ -64,7 +63,7 @@ def prepare_batch_data(batch_records, num_choice, pad_id, task_index, task_num):
        (1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, 1])
    image_loc = np.tile(np.expand_dims(image_loc, axis=1),     \
        (1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, 5])
-    return_list = [src_ids, src_pos, src_seg, src_task, src_masks, \
+    return_list = [src_ids, src_pos, src_seg, src_masks, \
        image_embedding, image_loc, image_mask, labels, batch_anno_ids]
    return_list.append(np.array([task_index]).astype('int64'))
    return_list.append(binary_labels)
@@ -76,14 +75,187 @@ def prepare_batch_data(batch_records, num_choice, pad_id, task_index, task_num):
    return return_list


+def prepare_vqa_batch_data(insts,
+                       total_token_num,
+                       task_index,
+                       task_num,
+                       voc_size=0,
+                       pad_id=None,
+                       cls_id=None,
+                       sep_id=None,
+                       mask_id=None,
+                       return_input_mask=True,
+                       return_max_len=True,
+                       return_num_token=False):
+    """
+    prepare batch data for vqa tasks
+    """
+    batch_src_ids = [inst["token_ids"] for inst in insts]
+    batch_sent_ids = [inst["sent_ids"] for inst in insts]
+    batch_pos_ids = [inst["pos_ids"] for inst in insts]
+    batch_image_embedding = [inst["image_embeddings"] for inst in insts]
+    batch_image_loc = [inst["image_loc"] for inst in insts]
+    batch_weight_label = [inst["weight_labels"] for inst in insts]
+    q_ids = np.array([inst["question_id"] for inst in insts])
+
+    #pad and trans to numpy array
+    src_id, self_input_mask, seq_lens = pad_batch_data(
+        batch_src_ids, pad_idx=pad_id, return_input_mask=True, return_seq_lens = True)
+    pos_id = pad_batch_data(batch_pos_ids, pad_idx=pad_id)
+    sent_id = pad_batch_data(batch_sent_ids, pad_idx=pad_id)
+    weight_labels = np.array(batch_weight_label).astype("float32")
+    #image_embedding_ori = copy.deepcopy(batch_image_embedding)
+    image_embedding, image_mask = pad_feature_data(batch_image_embedding, return_mask = True)
+    #image_embedding_ori = pad_feature_data(image_embedding_ori)
+    image_loc = pad_feature_data(batch_image_loc)
+
+    return_list = [
+        src_id, pos_id, sent_id, self_input_mask,  \
+            image_embedding, image_loc, image_mask, weight_labels, q_ids
+    ]
+    return return_list
+
+
+def prepare_flickr_data(insts,
+                       total_token_num,
+                       task_index,
+                       task_num,
+                       voc_size=0,
+                       pad_id=None,
+                       cls_id=None,
+                       sep_id=None,
+                       mask_id=None,
+                       outs=4,
+                       return_input_mask=True,
+                       return_max_len=True,
+                       return_num_token=False):
+    """
+    prepare flickr data for finetuning tasks
+    """
+    if outs > 1:
+        batch_src_ids = [inst["token_ids"][out] for inst in insts for out in range(outs)]
+        batch_sent_ids = [inst["sent_ids"][out] for inst in insts for out in range(outs)]
+        batch_pos_ids = [inst["pos_ids"][out] for inst in insts for out in range(outs)]
+        batch_image_embedding = [inst["image_embeddings"][out] for inst in insts for out in range(outs)]
+        batch_image_loc = [inst["image_loc"][out] for inst in insts for out in range(outs)]
+    else:
+        batch_src_ids = [inst["token_ids"] for inst in insts]
+        batch_sent_ids = [inst["sent_ids"] for inst in insts]
+        batch_pos_ids = [inst["pos_ids"] for inst in insts]
+        batch_image_embedding = [inst["image_embeddings"] for inst in insts ]
+        batch_image_loc = [inst["image_loc"] for inst in insts ]
+    batch_ids = [inst["ids"] for inst in insts for out in range(outs)]
+    batch_size = int(len(batch_src_ids) / outs)
+    label = np.array([[0] for i in range(batch_size)], dtype = "int64")
+
+    src_id, self_input_mask, seq_lens = pad_batch_data(
+        batch_src_ids, pad_idx=pad_id, return_input_mask=True, return_seq_lens = True)
+    pos_id = pad_batch_data(batch_pos_ids, pad_idx=pad_id)
+    sent_id = pad_batch_data(batch_sent_ids, pad_idx=pad_id)
+    image_embeddings, image_mask = pad_feature_data(batch_image_embedding, return_mask = True)
+    image_loc = pad_feature_data(batch_image_loc)
+    ids = np.array(batch_ids, dtype = "int64")
+
+    return_list = [
+        src_id, pos_id, sent_id, self_input_mask, image_embeddings, image_loc, image_mask, label, ids]
+    
+    return return_list
+
+
+def prepare_refcoco_plus_batch_data(insts,
+                       total_token_num,
+                       task_index,
+                       task_num,
+                       voc_size=0,
+                       pad_id=None,
+                       return_input_mask=True,
+                       return_max_len=True,
+                       return_num_token=False):
+    """
+    prepare batch data for refcoco_plus tasks
+    """
+    batch_src_ids = [inst["token_ids"] for inst in insts]
+    batch_sent_ids = [inst["sent_ids"] for inst in insts]
+    batch_pos_ids = [inst["pos_ids"] for inst in insts]
+    batch_image_embedding = [inst["image_embeddings"] for inst in insts]
+    batch_image_loc = [inst["image_loc"] for inst in insts]
+    batch_image_label = [inst["label"] for inst in insts]
+    add_items = np.array([inst["add_item"] for inst in insts], dtype="float32")
+
+    src_id, self_input_mask, seq_lens = pad_batch_data(
+        batch_src_ids, pad_idx=pad_id, return_input_mask=True, return_seq_lens = True)
+    pos_id = pad_batch_data(batch_pos_ids, pad_idx=pad_id)
+    sent_id = pad_batch_data(batch_sent_ids, pad_idx=pad_id)
+    image_embedding, image_mask = pad_feature_data(batch_image_embedding, return_mask = True)
+    image_loc = pad_feature_data(batch_image_loc)
+    image_label = pad_feature_data(batch_image_label)
+
+    return_list = [
+        src_id, pos_id, sent_id, self_input_mask, seq_lens,  \
+            image_embedding, image_loc, image_mask, image_label, add_items
+    ]
+    return return_list
+
+
+def pad_batch_data(insts,
+                   pad_idx=0,
+                   return_pos=False,
+                   return_input_mask=False,
+                   return_max_len=False,
+                   return_num_token=False,
+                   return_seq_lens=False):
+    """
+    Pad the instances to the max sequence length in batch, and generate the
+    corresponding position data and attention bias.
+    """
+    return_list = []
+    max_len = max(len(inst) for inst in insts)
+    # Any token included in dict can be used to pad, since the paddings' loss
+    # will be masked out by weights and make no effect on parameter gradients.
+
+    inst_data = np.array(
+        [inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
+    return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
+
+    # position data
+    if return_pos:
+        inst_pos = np.array([
+            list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
+            for inst in insts
+        ])
+
+        return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
+
+    if return_input_mask:
+        # This is used to avoid attention on paddings.
+        input_mask_data = np.array([[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts])
+        input_mask_data = np.expand_dims(input_mask_data, axis=-1)
+        return_list += [input_mask_data.astype("float32")]
+
+    if return_max_len:
+        return_list += [max_len]
+
+    if return_num_token:
+        num_token = 0
+        for inst in insts:
+            num_token += len(inst)
+        return_list += [num_token]
+
+    if return_seq_lens:
+        seq_lens = np.array([len(inst) for inst in insts])
+        return_list += [seq_lens.astype("int64").reshape([-1, 1])]
+
+    return return_list if len(return_list) > 1 else return_list[0]
+
+
 def pad_feature_data(data, pad_value=0.0, dtype="float32", return_mask=False):
    """
    pad visual features with given pad value
    """
-    max_lenth=max([len(item) for item in data])
+    max_length=max([len(item) for item in data])
    data_width = len(data[0][0])
-    out_data = np.ones((len(data), max_lenth, data_width), dtype=dtype) * pad_value
-    out_mask = np.zeros((len(data), max_lenth, 1), dtype=dtype)
+    out_data = np.ones((len(data), max_length, data_width), dtype=dtype) * pad_value
+    out_mask = np.zeros((len(data), max_length, 1), dtype=dtype)
    for i in range(len(data)):
        out_data[i, 0: len(data[i]), :] = data[i]
        if return_mask:

--- a/ernie-vil/conf/flickr/flickr_retrieval_dev.filelist
+++ b/ernie-vil/conf/flickr/flickr_retrieval_dev.filelist
+./data/flickr/flickr.dev.data
--- a/ernie-vil/conf/flickr/flickr_retrieval_test.filelist
+++ b/ernie-vil/conf/flickr/flickr_retrieval_test.filelist
+./data/flickr/flickr.test.data
--- a/ernie-vil/conf/flickr/flickr_retrieval_train.filelist
+++ b/ernie-vil/conf/flickr/flickr_retrieval_train.filelist
+data/flickr/flickr.train.data
--- a/ernie-vil/conf/flickr/model_conf_flickr
+++ b/ernie-vil/conf/flickr/model_conf_flickr
+output_model_path="./output_flickr"
+lr_scheduler="manual_warmup_decay"
+decay_steps="54360;72480"
+num_train_steps=90600
+SAVE_STEPS=4530
+WARMUP_STEPS=9060
+BATCH_SIZE=4
+LR_RATE=1e-5
+WEIGHT_DECAY=0.01
+MAX_LEN=48
+hardest=False
+meansum=False
+use_circle_loss=True
+scale_circle=32.0
+use_sigmoid=True
+margin=0.3
--- a/ernie-vil/conf/flickr/task_flickr.json
+++ b/ernie-vil/conf/flickr/task_flickr.json
+[
+{
+"prob": 1.0,
+"data_func": "image_text_match",
+"task_name": "image_text_match",
+"train_filelist": "./conf/flickr/flickr_retrieval_train.filelist",
+"train_image_path":"./data/flickr/flickr_bottom_up_10_36.csv.0",
+"train_caption_path":"./data/flickr/flickr.train.data",
+"hardest_setting_path":"./data/flickr/hard_negative.pkl",
+"Proprocessor": "PreprocessorBasic",
+"tokenizer_name" : "FullTokenizer",
+"vocab_path" : "./package/vocab.txt",
+"negative_schema":[ "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei"]
+}
+]
--- a/ernie-vil/conf/flickr/task_flickr_eval.json
+++ b/ernie-vil/conf/flickr/task_flickr_eval.json
+[
+{
+"prob": 1.0,
+"data_func": "image_text_match",
+"task_name": "image_text_match",
+"test_filelist": "./conf/flickr/flickr_retrieval_test.filelist",
+"dev_filelist": "./conf/flickr/flickr_retrieval_dev.filelist",
+"train_filelist": "./conf/flickr/flickr_retrieval_train.filelist",
+"train_image_path":"./data/flickr/flickr_bottom_up_10_36.csv.0",
+"dev_image_path": "data/flickr/flickr_dev_10_36.csv.0",
+"test_image_path": "data/flickr/flickr_test_10_36.csv.0",
+"Proprocessor": "PreprocessorBasic",
+"tokenizer_name" : "FullTokenizer",
+"vocab_path" : "./package/vocab.txt"
+}
+]
--- a/ernie-vil/conf/refcoco_plus/model_conf_refcoco_plus
+++ b/ernie-vil/conf/refcoco_plus/model_conf_refcoco_plus
+output_model_path="output_refcoco_plus"
+lr_scheduler="manual_warmup_decay"
+decay_steps="3290;4700;7050"
+lr_decay_ratio=0.2
+num_train_steps=26640
+SAVE_STEPS=470
+WARMUP_STEPS=940
+BATCH_SIZE=32
+VALID_STEPS=20000
+LR_RATE=2e-5
+WEIGHT_DECAY=0.01
+MAX_LEN=80
+layer_decay_rate=0.9
--- a/ernie-vil/conf/refcoco_plus/refer_testA.filelist
+++ b/ernie-vil/conf/refcoco_plus/refer_testA.filelist
+./data/refcoco_plus/refer_testA.part
--- a/ernie-vil/conf/refcoco_plus/refer_testB.filelist
+++ b/ernie-vil/conf/refcoco_plus/refer_testB.filelist
+./data/refcoco_plus/refer_testB.part
--- a/ernie-vil/conf/refcoco_plus/refer_train.filelist
+++ b/ernie-vil/conf/refcoco_plus/refer_train.filelist
+./data/refcoco_plus/train_part_sample
--- a/ernie-vil/conf/refcoco_plus/refer_val.filelist
+++ b/ernie-vil/conf/refcoco_plus/refer_val.filelist
+./data/refcoco_plus/refer_val.part
--- a/ernie-vil/conf/refcoco_plus/task_refcoco_plus.json
+++ b/ernie-vil/conf/refcoco_plus/task_refcoco_plus.json
+[
+{
+"prob": 1.0,
+"data_func": "refcoco+",
+"task_name": "refcoco+",
+"train_filelist": "./conf/refcoco_plus/refer_train.filelist",
+"Proprocessor": "PreprocessorBasic",
+"tokenizer_name" : "FullTokenizer",
+"vocab_path" : "./package/vocab.txt"
+}
+]
--- a/ernie-vil/conf/refcoco_plus/task_refcoco_plus_eval.json
+++ b/ernie-vil/conf/refcoco_plus/task_refcoco_plus_eval.json
+[
+{
+"prob": 1.0,
+"data_func": "refcoco+",
+"task_name": "refcoco+",
+"val_filelist": "./conf/refcoco_plus/refer_val.filelist",
+"testA_filelist": "./conf/refcoco_plus/refer_testA.filelist",
+"testB_filelist": "./conf/refcoco_plus/refer_testB.filelist",
+"Proprocessor": "PreprocessorBasic",
+"tokenizer_name" : "FullTokenizer",
+"vocab_path" : "./package/vocab.txt"
+}
+]
--- a/ernie-vil/conf/vqa/model_conf_vqa
+++ b/ernie-vil/conf/vqa/model_conf_vqa
+lr_decay_dict_file="./conf/vqa/vqa_finetune_decay.list"
+output_model_path="output_20_mask"
+lr_scheduler="manual_warmup_decay"
+num_train_steps=50200
+SAVE_STEPS=2510
+WARMUP_STEPS=3710
+BATCH_SIZE=16
+LR_RATE=1e-4
+decay_steps="15100;22590"
+WEIGHT_DECAY=0.01
+layer_decay_rate=0.9
+text_init_layers=6
+n_layers=18
+MAX_LEN=16
+task_group_json=./conf/vqa/task_vqa.json
--- a/ernie-vil/conf/vqa/task_vqa.json
+++ b/ernie-vil/conf/vqa/task_vqa.json
+[
+{
+"prob": 1.0,
+"valid_filelist": "./conf/vqa/vqa_valid.filelist",
+"vg_train_filelist": "./conf/vqa/vg_train.filelist",
+"train_filelist": "./conf/vqa/vqa_train.filelist",
+"num_class": 3129,
+"classifier_hid_size": 2048,
+"vg_init_epochs": 2,
+"Proprocessor": "PreprocessorBasic",
+"tokenizer_name" : "FullTokenizer",
+"vocab_path" : "./package/vocab.txt"
+}
+]
--- a/ernie-vil/conf/vqa/task_vqa_eval.json
+++ b/ernie-vil/conf/vqa/task_vqa_eval.json
+[
+{
+"prob": 1.0,
+"data_func": "image_text_match",
+"task_name": "image_text_match",
+"val_filelist": "./conf/vqa/vqa_val.filelist",
+"test_dev_filelist": "./conf/vqa/vqa_test_dev.filelist",
+"test_std_filelist": "./conf/vqa/vqa_test_std.filelist",
+"pickle_file": "./data/vqa/trainval_label2ans.pkl",
+"num_class": 3129,
+"classifier_hid_size": 2048,
+"Proprocessor": "PreprocessorBasic",
+"tokenizer_name" : "FullTokenizer",
+"vocab_path" : "./package/vocab.txt"
+}
+]
--- a/ernie-vil/conf/vqa/vg_train.filelist
+++ b/ernie-vil/conf/vqa/vg_train.filelist
+./data/vqa/vg_train_part_sample
--- a/ernie-vil/conf/vqa/vqa_finetune_decay.list
+++ b/ernie-vil/conf/vqa/vqa_finetune_decay.list
+vqa_fc_w_0	2.5
+vqa_fc_w_1	2.5
+vqa_fc_b_0	2.5
+vqa_fc_b_1	2.5
--- a/ernie-vil/conf/vqa/vqa_test_dev.filelist
+++ b/ernie-vil/conf/vqa/vqa_test_dev.filelist
+./data/vqa/test_dev_part_sample
--- a/ernie-vil/conf/vqa/vqa_test_std.filelist
+++ b/ernie-vil/conf/vqa/vqa_test_std.filelist
+./data/vqa/test_std_part_sample
--- a/ernie-vil/conf/vqa/vqa_train.filelist
+++ b/ernie-vil/conf/vqa/vqa_train.filelist
+./data/vqa/train_part_sample
--- a/ernie-vil/conf/vqa/vqa_val.filelist
+++ b/ernie-vil/conf/vqa/vqa_val.filelist
+./data/vqa/val_part_sample
--- a/ernie-vil/data/flickr/flickr.dev.data
+++ b/ernie-vil/data/flickr/flickr.dev.data
+67	0	A group of people stand in the back of a truck filled with cotton .
--- a/ernie-vil/data/vqa/trainval_label2ans.pkl
+++ b/ernie-vil/data/vqa/trainval_label2ans.pkl
--- a/ernie-vil/finetune.py
+++ b/ernie-vil/finetune.py
--- a/ernie-vil/model/ernie_vil.py
+++ b/ernie-vil/model/ernie_vil.py
@@ -63,7 +63,6 @@ class ErnieVilModel(object):
                 src_ids,
                 position_ids,
                 sentence_ids,
-                 task_ids,
                 input_mask,
                 image_embeddings,
                 image_loc,
@@ -115,10 +114,10 @@ class ErnieVilModel(object):
        self._param_initializer = fluid.initializer.TruncatedNormal(
            scale=config['initializer_range'])

-        self._build_model(src_ids, position_ids, sentence_ids, task_ids, input_mask, \
+        self._build_model(src_ids, position_ids, sentence_ids, input_mask, \
                image_embeddings, image_loc, input_image_mask)

-    def _build_model(self, src_ids, position_ids, sentence_ids, task_ids, input_mask, \
+    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask, \
            image_embeddings, image_loc, input_image_mask):
        # padding id in vocabulary must be set to 0
        emb_out = fluid.layers.embedding(

--- a/ernie-vil/optim/optimization.py
+++ b/ernie-vil/optim/optimization.py
@@ -20,9 +20,7 @@ import numpy as np
 import paddle.fluid as fluid

 def manual_warmup_decay(learning_rate, warmup_steps, num_train_steps, decay_steps=[], lr_decay_ratio=0.1):
-    """ 
-    Applies linear warmup of learning rate from 0 and keep constant.
-    """
+    """ Applies linear warmup of learning rate from 0 and keep constant."""
    with fluid.default_main_program()._lr_schedule_guard():
        lr = fluid.layers.tensor.create_global_var(
            shape=[1],
@@ -49,9 +47,7 @@ def manual_warmup_decay(learning_rate, warmup_steps, num_train_steps, decay_step


 def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
-    """ 
-    Applies linear warmup of learning rate from 0 and decay to 0.
-    """
+    """ Applies linear warmup of learning rate from 0 and decay to 0."""
    with fluid.default_main_program()._lr_schedule_guard():
        lr = fluid.layers.tensor.create_global_var(
            shape=[1],
@@ -78,6 +74,41 @@ def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):

        return lr

+
+def layer_decay(param, param_last, learning_rate, decay_rate, text_layers, n_layers):
+    """ layer_decay implementation """
+    delta = param - param_last
+    if "encoder_layer" in param.name and param.name.index("encoder_layer")==0:
+        layer = int(param.name.split("_")[2])
+        if layer >= text_layers:
+            cur_layer = text_layers + (layer - text_layers) * 2 + 1
+        else:
+            cur_layer = layer
+        ratio = decay_rate ** (n_layers - cur_layer)
+        print("text_layer_name:", param.name, "\t", "ratio:", ratio)
+        param_update = param + (ratio - 1) * delta
+    elif "encoder_vlayer" in param.name and param.name.index("encoder_vlayer")==0:
+        layer = int(param.name.split("_")[2])
+        cur_layer = text_layers + (layer) * 2 + 1
+        ratio = decay_rate ** (n_layers - cur_layer)
+        param_update = param + (ratio - 1) * delta
+        print("image_layer_name:", param.name, "\t", "ratio:", ratio)
+    elif "encoder_colayer" in param.name and param.name.index("encoder_colayer")==0:
+        layer = int(param.name.split("_")[2])
+        cur_layer = text_layers + (layer) * 2
+        ratio = decay_rate ** (n_layers - cur_layer)
+        param_update = param + (ratio - 1) * delta
+        print("co_layer_name:", param.name, "\t", "ratio:", ratio)
+    elif "embedding" in param.name:
+        ratio = decay_rate ** (n_layers + 1)
+        param_update = param + (ratio - 1) * delta
+    elif "image_emb" in param.name or "image_loc" in param.name:
+        ratio = decay_rate ** (n_layers - text_layers + 1)
+        param_update = param + (ratio - 1) * delta
+    else:
+        param_update = None
+    return param_update
+
 def optimization(loss,
                 warmup_steps,
                 num_train_steps,
@@ -88,10 +119,11 @@ def optimization(loss,
                 scheduler='linear_warmup_decay',
                 decay_steps=[],
                 lr_decay_dict_file="",
-                 lr_decay_ratio=0.1):
-    """ 
-    optimization implementation 
-    """
+                 lr_decay_ratio=0.1,
+                 layer_decay_rate=0.0,
+                 text_init_layers=18,
+                 n_layers=30):
+    """ optimization implementation """
    if warmup_steps > 0:
        if scheduler == 'noam_decay':
            scheduled_lr = fluid.layers.learning_rate_scheduler \
@@ -135,8 +167,7 @@ def optimization(loss,
        clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0))

    def exclude_from_weight_decay(name):
-        """ 
-        Parameters not use weight decay
+        """ parameters not use weight decay
        """
        if name.find("layer_norm") > -1:
            return True
@@ -154,6 +185,16 @@ def optimization(loss,

    _, param_grads = optimizer.minimize(loss)

+    
+    if layer_decay_rate > 0:
+        for param, grad in param_grads:
+            with param.block.program._optimized_guard(
+                [param, grad]), fluid.framework.name_scope("layer_decay"):
+                param_decay = layer_decay(param, param_list[param.name], scheduled_lr, 
+                                  layer_decay_rate, text_init_layers, n_layers)
+                if param_decay:
+                    fluid.layers.assign(output=param, input=param_decay)
+    
    if weight_decay > 0:
        for param, grad in param_grads:
            if exclude_from_weight_decay(param.name):

--- a/ernie-vil/preprocess/preprocessor.py
+++ b/ernie-vil/preprocess/preprocessor.py
@@ -17,7 +17,6 @@ import sys
 import os
 import base64
 import numpy as np
-
 reload(sys)
 sys.setdefaultencoding("utf-8")

@@ -25,7 +24,7 @@ from preprocess import tokenization

 class PreprocessorBasic(object):
    """
-    Main class for text preprocess
+    parent class for preprocess
    """
    def __init__(self,
                 tokenizer_name,
@@ -39,7 +38,7 @@ class PreprocessorBasic(object):
    
    def convert_sentence_to_ids_without_cls(self, sentence):
        """
-        Convert sentence to ids without cls
+        convert sentence to ids without cls
        """
        tokens = self.tokenizer.tokenize(sentence)
        ids = self.tokenizer.convert_tokens_to_ids(tokens)

--- a/ernie-vil/reader/flickr_finetuning.py
+++ b/ernie-vil/reader/flickr_finetuning.py
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" VQA Data Reader implementation """
+
+from __future__ import print_function
+from __future__ import division
+
+import os
+import base64
+import functools
+import numpy as np
+import types
+import gzip
+import logging
+import re
+import six
+import collections
+import copy
+import random
+import pickle
+
+import paddle
+import paddle.fluid as fluid
+
+from batching.finetune_batching import prepare_flickr_data
+from preprocess import preprocessor
+
+
+class FlickrDataReader(object):
+    """ 
+        data reader task for flickr
+    """
+    def __init__(self,
+                 task_group,
+                 vocab_path,
+                 split,
+                 batch_size=4096,
+                 max_seq_len=512,
+                 shuffle_files=True,
+                 epoch=100,
+                 voc_size=0,
+                 cls_size=0,
+                 is_test=False):
+
+        self.vocab = self.load_vocab(vocab_path)
+        self.task_group = task_group
+        self.max_seq_len = max_seq_len
+        self.processor = getattr(preprocessor, task_group[0]["Proprocessor"])(
+            tokenizer_name =self.task_group[0]["tokenizer_name"],
+            vocab_path = vocab_path)
+
+        self.batch_size = batch_size
+        self.shuffle_files = shuffle_files
+        self.epoch = epoch
+        self.current_epoch = 0
+        self.current_file_index = 0
+        self.total_file = 0
+        self.current_file = None
+        self.voc_size = voc_size
+        self.cls_size = cls_size
+        self.pad_id = self.vocab["[PAD]"]
+        self.cls_id = self.vocab["[CLS]"]
+        self.sep_id = self.vocab["[SEP]"]
+        self.mask_id = self.vocab["[MASK]"]
+        self.is_test = is_test
+
+        if self.is_test:
+            self.epoch = 1
+            self.shuffle_files = False
+            self._test_image_list = []
+            if split == "dev":
+                image_path = self.task_group[0]["dev_image_path"]
+            else:
+                image_path = self.task_group[0]["test_image_path"]
+        else:
+             caption_path = self.task_group[0]["train_caption_path"]
+             self._load_caption_dict(caption_path)
+             image_path = self.task_group[0]["train_image_path"]
+             self._get_hardest_setting(self.task_group[0]["hardest_setting_path"])
+             self._negative_schema=self.task_group[0]["negative_schema"] 
+        self._load_image_dict(image_path)
+
+    def decode_all(self, image_id, width, height, number_box, boxes, image_embeddings):
+        """ decode all data """
+        def decode_feature(base64_str, size):
+            """ decode feature from base64 """
+            size = int(size)
+            fea_base64 = base64.b64decode(base64_str)
+            fea_decode = np.frombuffer(fea_base64, dtype=np.float32)
+            shape = size, int(fea_decode.shape[0] / size)
+            features = np.resize(fea_decode, shape)
+            return features
+
+        image_embeddings = decode_feature(image_embeddings, number_box)
+        image_embeddings_cls = np.mean(image_embeddings, axis = 0, keepdims = True)
+        image_embeddings =  np.concatenate([image_embeddings_cls, image_embeddings], 0)
+
+        boxes = decode_feature(boxes, number_box)
+        shape = np.repeat(np.array([[float(width), float(height), float(width), float(height)]]), \
+                number_box, axis=0)
+        boxes = boxes / shape
+        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
+        image_loc = np.concatenate((boxes, np.expand_dims(area, 1)), axis = 1)
+        loc_cls = np.array([[0.0, 0.0, 1.0, 1.0, 1.0]], dtype = "float32")
+        image_loc = np.concatenate([loc_cls, image_loc], 0)
+        return int(number_box) + 1, image_loc, image_embeddings
+
+    def _load_image_dict(self, image_path):
+        self._image_feature_dict = {}
+        image_items = image_path.split(',')
+        cnt = 0
+        for image_item in image_items:
+            with open(image_item) as f:
+                for line in f:
+                    cnt += 1
+                    if cnt % 1000 == 0:
+                        print('precessing image feature:', cnt)
+                    image_id, width, height, number_box, image_loc, image_embeddings \
+                            = line.strip().split('\t')
+                    number_box, image_loc, image_embeddings = self.decode_all( \
+                            image_id, width, height, number_box, image_loc, image_embeddings)
+                    self._image_feature_dict[int(image_id)] = (width, height, image_embeddings, number_box, image_loc)
+                    if self.is_test:
+                        self._test_image_list.append(int(image_id))
+
+    def _load_caption_dict(self, image_caption):
+        """
+        Load caption dict for flickr 
+        """
+        self._caption_ids_dict = {}
+        self._image_sent_map = {}
+        with open(image_caption) as f:
+            cnt = 0
+            for line in f:
+                cnt += 1
+                line = line.strip().split("\t")
+                image_id, sent_id, text = line
+                token_ids = []
+                raw_ids = self.processor.convert_sentence_to_ids_without_cls(text)
+                token_ids.append(self.vocab["[CLS]"])
+                token_ids.extend(raw_ids)
+                token_ids.append(self.vocab["[SEP]"])
+                sent_ids = [0] * len(token_ids)
+                pos_ids = range(0, len(token_ids))
+                
+                if cnt % 5000 == 0:
+                    print(cnt)
+
+                if len(token_ids) > self.max_seq_len:
+                    token_ids = token_ids[0: self.max_seq_len - 1] + [token_ids[-1]]
+                    sent_ids = sent_ids[0: self.max_seq_len - 1] + [sent_ids[-1]]
+                    pos_ids = pos_ids[0: self.max_seq_len]
+
+                assert len(token_ids) == len(sent_ids) == len(pos_ids), \
+                        "[Must be true]len(token_ids) == len(sent_ids) == len(pos_ids)"
+
+                self._caption_ids_dict[int(sent_id)] = \
+                        [token_ids, sent_ids, pos_ids, int(image_id)]
+                self._image_sent_map.setdefault(int(image_id), [])
+                self._image_sent_map[int(image_id)].append(int(sent_id))
+        self._train_caption_ids = self._caption_ids_dict.keys()
+
+    def _get_hardest_setting(self, hardest_setting_path):
+        """
+        Get the training metrix 
+        """
+        with open(hardest_setting_path, 'rb') as f:
+            data = pickle.load(f)
+            self._train_hard_pool = data['train_hard_pool']
+            self._train_image_list = data['train_image_list']
+            self._train_imgId2pool = {imageId:i for i, imageId in enumerate(self._train_image_list)}
+
+    def get_progress(self):
+        """
+        Return current progress of traning data
+        """
+        progress_dict = {"current_epoch": self.current_epoch,
+                         "current_file_index": self.current_file_index,
+                         "total_file": self.total_file,
+                         "current_file": self.current_file
+                         }
+        return progress_dict
+
+    def process_vl(self, line, max_seq_len):
+        """
+        Process single v+l data
+        """
+        if self.is_test:
+            line = line.strip().split("\t")
+            image_id, sent_id, text = line
+            token_ids = []
+            raw_ids = self.processor.convert_sentence_to_ids_without_cls(text)
+            token_ids.append(self.vocab["[CLS]"])
+            token_ids.extend(raw_ids)
+            token_ids.append(self.vocab["[SEP]"])
+            sent_ids = [0] * len(token_ids)
+            pos_ids = range(0, len(token_ids))
+            if len(token_ids) > self.max_seq_len:
+                token_ids = token_ids[0: self.max_seq_len - 1] + [token_ids[-1]]
+                sent_ids = sent_ids[0: self.max_seq_len - 1] + [sent_ids[-1]]
+                pos_ids = pos_ids[0: self.max_seq_len]
+            width, height, image_embeddings, number_box, image_loc =  self._image_feature_dict[int(image_id)]
+        else:
+            sent_id = line
+            captions_pos = self._caption_ids_dict[sent_id]
+            image_id = captions_pos[-1]
+            captions = [captions_pos]
+
+            _, _, features, number_box, box = self._image_feature_dict[image_id]
+        
+            images = [[features, number_box, box]]
+            for item in self._negative_schema:
+                if item[0] == "h":
+                    rand_img_id_pool = self._train_hard_pool[self._train_imgId2pool[image_id]]
+                    rand_idx = rand_img_id_pool[random.randint(1, len(rand_img_id_pool) - 1)]
+                    image_id_neg = self._train_image_list[int(rand_idx)]
+                elif item[0] == "e":
+                    while True:
+                        image_id_neg = random.choice(self._train_image_list)
+                        if image_id_neg != image_id:
+                            break
+                else:
+                    print("error negative schema")
+                    exit()
+                if item[1] == "i":
+                    _, _, features_neg, number_box_neg, box_neg = self._image_feature_dict[image_id_neg]
+                    captions.append(self._caption_ids_dict[sent_id])
+                    images.append([features_neg, number_box_neg, box_neg])
+                elif item[1] == "c":
+                    sent_id_neg = random.choice(self._image_sent_map[image_id_neg])
+                    captions.append(self._caption_ids_dict[sent_id_neg])
+                    images.append([features, number_box, box])
+                else:
+                    print("error negative schema")
+                    exit()
+
+            token_ids, sent_ids, pos_ids, _ = zip(*captions)
+            image_embeddings, number_box, image_loc = zip(*images)
+
+        sample_json = {
+            "token_ids": token_ids,
+            "sent_ids": sent_ids,
+            "pos_ids": pos_ids,
+            "image_loc": image_loc,
+            "image_embeddings": image_embeddings,
+            "image_id": int(image_id),
+            "sent_id": int(sent_id),
+            "ids": [image_id, sent_id]
+        }
+        return sample_json
+
+    def parse_line(self, line, max_seq_len=512, task_index=None):
+        """ parse one line to token_ids, sentence_ids, pos_ids, label """
+
+        sample_json = self.process_vl(line, max_seq_len)
+        token_ids = sample_json["token_ids"]
+        return sample_json
+
+    def read_file(self, file, task_index):
+        """
+        read line data from file
+        """
+        if self.is_test:
+            with open(file) as f:
+                lines = f.readlines()
+                for line in lines:
+                    yield line
+        else:
+            random.shuffle(self._train_caption_ids)
+            for item in self._train_caption_ids:
+                yield item
+
+    def convert_to_unicode(self, text):
+        """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+        if six.PY3:
+            if isinstance(text, str):
+                return text
+            elif isinstance(text, bytes):
+                return text.decode("utf-8", "ignore")
+            else:
+                raise ValueError("Unsupported string type: %s" % (type(text)))
+        elif six.PY2:
+            if isinstance(text, str):
+                return text.decode("utf-8", "ignore")
+            elif isinstance(text, unicode):
+                return text
+            else:
+                raise ValueError("Unsupported string type: %s" % (type(text)))
+        else:
+            raise ValueError("Not running on Python2 or Python 3?")
+
+    def load_vocab(self, vocab_file):
+        """Loads a vocabulary file into a dictionary."""
+        vocab = collections.OrderedDict()
+        fin = open(vocab_file)
+        for num, line in enumerate(fin):
+            items = self.convert_to_unicode(line.strip()).split("\t")
+            if len(items) > 2:
+                break
+            token = items[0]
+            index = items[1] if len(items) == 2 else num
+            token = token.strip()
+            vocab[token] = int(index)
+        return vocab
+
+    def data_generator(self):
+        """ data_generator """
+        filelist_key = "train_filelist"
+        if self.is_test:
+            filelist_key = "dev_filelist"
+
+        all_files = []
+        task_probs = []
+        sum = 0.0
+        for task in self.task_group:
+            all_files.append(open(task[filelist_key]).readlines())
+            task_probs.append(task["prob"])
+            sum += task["prob"]
+        for i in xrange(len(task_probs)):
+            task_probs[i] = task_probs[i] / sum
+        task_probs = np.array(task_probs).ravel()
+
+        def wrapper():
+            """ wrapper """
+            def reader(task_index):
+                """ reader """
+                files = all_files[task_index]
+                for epoch in range(self.epoch):
+                    if self.shuffle_files:
+                        np.random.shuffle(files)
+                    for index, file in enumerate(files):
+                        file = file.strip()
+                        sample_generator = paddle.reader.xmap_readers(self.parse_line, \
+                            functools.partial(self.read_file, file=file, task_index=task_index), 8, 2000)
+                        for sample in sample_generator():
+                            if not self.is_test:
+                                self.current_epoch = epoch + 1
+                                self.current_file_index = index + 1
+                                self.current_file = file
+                                self.total_file = len(files)
+                                yield sample
+                            else:
+                                cap_id = sample["ids"][1]
+                                for image_id in self._test_image_list:
+                                    line_json = copy.deepcopy(sample)
+                                    _, _, image_embeddings, number_box, image_loc = self._image_feature_dict[image_id]
+                                    line_json["image_embeddings"] = image_embeddings
+                                    line_json["image_loc"] = image_loc
+                                    line_json["ids"][0] = image_id
+                                    yield line_json
+
+            def batch_reader(reader, batch_size):
+                """ batch reader """
+                batch, total_token_num, max_len = [], 0, 0
+                cur_size = 0
+                dev_count = 1
+                buff = []
+                readers = []
+                for i in xrange(len(task_probs)):
+                    buff.append(None)
+                    readers.append(reader(i))
+                task_indices = range(len(task_probs))
+                end_times = 0
+                while end_times < 50:
+                    task_index = np.random.choice(task_indices, p=task_probs)
+                    dev_num = 0
+                    cur_reader = readers[task_index]
+                    while dev_num < dev_count:
+                        if buff[task_index] is not None:
+                            cur_len = len(buff[task_index]["token_ids"])
+                            max_len = max(max_len, cur_len)
+                            batch.append(buff[task_index])
+                            total_token_num += cur_len
+                            buff[task_index] = None
+                            cur_size += 1
+
+                        parsed_line = next(cur_reader, None)
+                        if parsed_line is None:
+                            end_times += 1
+                            dev_num += 1
+                            if len(batch) > 0:
+                                yield batch, total_token_num, task_index
+                                batch, total_token_num, max_len = [], 0, 0
+                            continue
+
+                        end_times = 0
+                        cur_len = len(parsed_line["token_ids"])
+                        max_len = max(max_len, cur_len)
+                        if cur_size >= batch_size:
+                            yield batch, total_token_num, task_index
+                            batch, total_token_num, max_len = [], 0, 0
+                            cur_size = 0
+                            dev_num += 1
+                            buff[task_index] = parsed_line
+                        else:
+                            batch.append(parsed_line)
+                            cur_size += 1
+                            total_token_num += cur_len
+
+            for batch_data, total_token_num, task_index in batch_reader(reader, self.batch_size):
+                if self.is_test:
+                    outs = 1
+                else:
+                    outs = len(self._negative_schema)+1
+                yield prepare_flickr_data(
+                    batch_data,
+                    total_token_num,
+                    task_index,
+                    len(self.task_group),
+                    voc_size=self.voc_size,
+                    pad_id=self.pad_id,
+                    cls_id=self.cls_id,
+                    sep_id=self.sep_id,
+                    mask_id=self.mask_id,
+                    outs=outs,
+                    return_input_mask=True,
+                    return_max_len=False,
+                    return_num_token=False)
+
+        return wrapper
+
+
+if __name__ == "__main__":
+    pass
--- a/ernie-vil/reader/refcoco_plus_finetuning.py
+++ b/ernie-vil/reader/refcoco_plus_finetuning.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" RefcocoPlus DataReader implementation """
+
+from __future__ import print_function
+from __future__ import division
+
+import os
+import base64
+import numpy as np
+import types
+import gzip
+import logging
+import re
+import six
+import collections
+import random
+
+import paddle
+import paddle.fluid as fluid
+
+from batching.finetune_batching import prepare_refcoco_plus_batch_data
+from preprocess import preprocessor
+
+class RefcocoPlusDataReader(object):
+    """ 
+        data reader task for refcoco plus
+    """
+    def __init__(self,
+                 task_group,
+                 split,
+                 vocab_path,
+                 batch_size=4096,
+                 max_seq_len=512,
+                 shuffle_files=True,
+                 epoch=100,
+                 voc_size=0,
+                 is_test=False):
+
+        self.vocab = self.load_vocab(vocab_path)
+        self.task_group = task_group
+        self.processor = getattr(preprocessor, task_group[0]["Proprocessor"])(
+            tokenizer_name =self.task_group[0]["tokenizer_name"],
+            vocab_path = vocab_path)
+        self.batch_size = batch_size
+        self.shuffle_files = shuffle_files
+        self.epoch = epoch
+        self.split = split
+        self.current_epoch = 0
+        self.current_file_index = 0
+        self.total_file = 0
+        self.current_file = None
+        self.voc_size = voc_size
+        self.max_seq_len = max_seq_len
+        self.pad_id = self.vocab["[PAD]"]
+        self.cls_id = self.vocab["[CLS]"]
+        self.sep_id = self.vocab["[SEP]"]
+        self.mask_id = self.vocab["[MASK]"]
+        self.input_slots = 9
+        self.is_test = is_test
+
+        if is_test:
+            self.epoch = 1
+            self.shuffle_files = False
+
+    def get_progress(self):
+        """
+        return current progress of traning data
+        """
+        self.progress_dict = {"current_epoch": self.current_epoch,
+                         "current_file_index": self.current_file_index,
+                         "total_file": self.total_file,
+                         "current_file": self.current_file
+                         }
+        return self.progress_dict
+
+    def process_vl(self, line, max_seq_len):
+        """
+        process single v+l data
+        """
+        def decode_feature(base64_str, size):
+            """ 
+            decode feature from base64 
+            """
+            fea_base64 = base64.b64decode(base64_str)
+            fea_decode = np.frombuffer(fea_base64, dtype=np.float32)
+            shape = size, int(fea_decode.shape[0] / size)
+            features = np.resize(fea_decode, shape)
+            return features
+
+        text, image_w, image_h, number_boxes, number_boxes_gl, image_loc, \
+             image_embeddings, box_label, label = line
+        
+        token_ids = []
+        raw_ids = self.processor.convert_sentence_to_ids_without_cls(text)
+        token_ids.append(self.vocab["[CLS]"])
+        token_ids.extend(raw_ids)
+        token_ids.append(self.vocab["[SEP]"])
+        sent_ids = [0] * len(token_ids)
+        pos_ids = range(0, len(token_ids))
+
+        #print("sent_ids:", sent_ids)
+        token_ids = [int(token) for token in token_ids]
+        sent_ids = [int(token) for token in sent_ids]
+        pos_ids = [int(token) for token in pos_ids]
+        assert len(token_ids) == len(sent_ids) == len(pos_ids), \
+                "[Must be true]len(token_ids) == len(sent_ids) == len(pos_ids)"
+        if len(token_ids) > self.max_seq_len:
+            token_ids = token_ids[0: self.max_seq_len - 1] + [token_ids[-1]]
+            sent_ids = sent_ids[0: self.max_seq_len - 1] + [sent_ids[-1]]
+            pos_ids = pos_ids[0: self.max_seq_len]
+
+        all_number_box = int(number_boxes) + int(number_boxes_gl)
+        image_loc = decode_feature(image_loc, all_number_box)
+        shape_np = np.repeat(np.array(\
+            [[float(image_w), float(image_h), float(image_w), float(image_h)]]), all_number_box, axis=0)
+        boxes_np = image_loc / shape_np
+        
+        area = (boxes_np[:, 3] - boxes_np[:, 1]) * (boxes_np[:, 2] - boxes_np[:, 0])
+        image_loc = np.concatenate((boxes_np, np.expand_dims(area, 1)), axis = 1)
+        loc_cls = np.array([[0.0, 0.0, 1.0, 1.0, 1.0]], dtype = "float32")
+        image_loc = np.concatenate([loc_cls, image_loc], 0)
+
+        image_embeddings = decode_feature(image_embeddings, all_number_box)
+        image_embeddings_cls = np.mean(image_embeddings, axis = 0, keepdims = True)
+        image_embeddings =  np.concatenate([image_embeddings_cls, image_embeddings], 0)
+        x1, y1, x2, y2 = [float(item) for item in box_label.split(" ")]
+        cls_label = (x2 - x1 + 1) * (y2 - y1 + 1) /(float(image_w) * float(image_h))
+        score_th = 0.5
+        if cls_label < score_th:
+            cls_label = 0.0
+
+        label_tmp = label.split(" ")
+        if not self.is_test:
+            for i  in range(len(label_tmp)):
+                if float(label_tmp[i]) < score_th:
+                    label_tmp[i] = 0.0
+
+        label = [[cls_label]] + [[float(token)] for token in label_tmp]
+        label = np.array(label, dtype="float32")
+        add_item = [all_number_box + 1, image_w, image_h] + [float(item) for item in box_label.split(" ")]
+        sample_json = {
+            "token_ids": token_ids,
+            "sent_ids": sent_ids,
+            "pos_ids": pos_ids,
+            "label": label,
+            "image_loc": image_loc,
+            "image_embeddings": image_embeddings,
+            "all_number_box": all_number_box,
+            "add_item": add_item
+        }
+        return sample_json
+
+    def parse_line(self, line, max_seq_len=512, task_index=None):
+        """ parse one line to token_ids, sentence_ids, pos_ids, label """
+        line = line.strip().split("\t")
+        assert len(line) == self.input_slots, "One sample must have %d fields!" % self.input_slots
+        sample_json = self.process_vl(line, max_seq_len)
+        token_ids = sample_json["token_ids"]
+        return sample_json
+
+    def read_file(self, file, task_index):
+        """ read line data from a file """
+        try:
+            assert file.endswith('.gz'), "[ERROR] %s is not a gzip file" % file
+            with gzip.open(file, "rb") as f:
+                lines = f.readlines()
+        except:
+            with open(file, "rb") as f:
+                lines = f.readlines()
+        if not self.is_test:
+            np.random.shuffle(lines)
+        for line in lines:
+            parsed_line = self.parse_line(
+                line, max_seq_len=self.max_seq_len, task_index=task_index)
+            if parsed_line is None:
+                continue
+            yield parsed_line
+
+    def convert_to_unicode(self, text):
+        """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+        if six.PY3:
+            if isinstance(text, str):
+                return text
+            elif isinstance(text, bytes):
+                return text.decode("utf-8", "ignore")
+            else:
+                raise ValueError("Unsupported string type: %s" % (type(text)))
+        elif six.PY2:
+            if isinstance(text, str):
+                return text.decode("utf-8", "ignore")
+            elif isinstance(text, unicode):
+                return text
+            else:
+                raise ValueError("Unsupported string type: %s" % (type(text)))
+        else:
+            raise ValueError("Not running on Python2 or Python 3?")
+
+    def load_vocab(self, vocab_file):
+        """Loads a vocabulary file into a dictionary."""
+        vocab = collections.OrderedDict()
+        fin = open(vocab_file)
+        for num, line in enumerate(fin):
+            items = self.convert_to_unicode(line.strip()).split("\t")
+            if len(items) > 2:
+                break
+            token = items[0]
+            index = items[1] if len(items) == 2 else num
+            token = token.strip()
+            vocab[token] = int(index)
+        return vocab
+
+    def data_generator(self):
+        """ data_generator """
+        if self.split == "train":
+            filelist_key = "train_filelist"
+        elif self.split == "val":
+            filelist_key = "val_filelist"
+        elif self.split == "testA": 
+            filelist_key = "testA_filelist"
+        else: filelist_key = "testB_filelist"
+
+        all_files = []
+        task_probs = []
+        sum = 0.0
+        for task in self.task_group:
+            all_files.append(open(task[filelist_key]).readlines())
+            task_probs.append(task["prob"])
+            sum += task["prob"]
+        for i in xrange(len(task_probs)):
+            task_probs[i] = task_probs[i] / sum
+        task_probs = np.array(task_probs).ravel()
+
+        def wrapper():
+            """ 
+            wrapper 
+            """
+            def reader(task_index):
+                """
+                reader
+                """
+                files = all_files[task_index]
+                for epoch in range(self.epoch):
+                    if self.shuffle_files:
+                        if epoch < 0:
+                            files = files + open(task["gt_train_filelist"]).readlines()
+                        np.random.shuffle(files)
+                    for index, file in enumerate(files):
+                        file = file.strip()
+                        sample_generator = self.read_file(file, task_index)
+                        for sample in sample_generator:
+                            self.current_epoch = epoch + 1
+                            self.current_file_index = index + 1
+                            self.current_file = file
+                            self.total_file = len(files)
+                            if sample is None:
+                                continue
+                            yield sample
+
+            def batch_reader(reader, batch_size):
+                """
+                batch reader
+                """
+                batch, total_token_num, max_len = [], 0, 0
+                cur_size = 0
+                dev_count = 1
+                buff = []
+                readers = []
+                for i in xrange(len(task_probs)):
+                    buff.append(None)
+                    readers.append(reader(i))
+                task_indices = range(len(task_probs))
+                end_times = 0
+                while end_times < 50:
+                    task_index = np.random.choice(task_indices, p=task_probs)
+                    dev_num = 0
+                    cur_reader = readers[task_index]
+                    while dev_num < dev_count:
+                        if buff[task_index] is not None:
+                            cur_len = len(buff[task_index]["token_ids"])
+                            max_len = max(max_len, cur_len)
+                            batch.append(buff[task_index])
+                            total_token_num += cur_len
+                            buff[task_index] = None
+                            cur_size += 1
+
+                        parsed_line = next(cur_reader, None)
+
+                        if parsed_line is None:
+                            end_times += 1
+                            dev_num += 1
+                            if len(batch) > 0:
+                                yield batch, total_token_num, task_index
+                                batch, total_token_num, max_len = [], 0, 0
+                            continue
+
+                        end_times = 0
+                        cur_len = len(parsed_line["token_ids"])
+                        max_len = max(max_len, cur_len)
+                        if cur_size >= batch_size:
+                            yield batch, total_token_num, task_index
+                            batch, total_token_num, max_len = [], 0, 0
+                            cur_size = 0
+                            dev_num += 1
+                            buff[task_index] = parsed_line
+                        else:
+                            batch.append(parsed_line)
+                            cur_size += 1
+                            total_token_num += cur_len
+
+            for batch_data, total_token_num, task_index in batch_reader(reader, self.batch_size):
+                yield prepare_refcoco_plus_batch_data(
+                    batch_data,
+                    total_token_num,
+                    task_index,
+                    len(self.task_group),
+                    voc_size=self.voc_size,
+                    pad_id=self.pad_id,
+                    return_input_mask=True,
+                    return_max_len=False,
+                    return_num_token=False)
+
+        return wrapper
+
+
+if __name__ == "__main__":
+    pass
--- a/ernie-vil/reader/vcr_finetuning.py
+++ b/ernie-vil/reader/vcr_finetuning.py
@@ -33,9 +33,7 @@ from batching.finetune_batching import prepare_batch_data
 import paddle.fluid as fluid

 def _converId(img_id):
-    """ 
-    conversion for image ID 
-    """
+    """ conversion for image ID """
    img_id = img_id.split('-')
    if 'train' in img_id[0]:
        new_id = int(img_id[1])
@@ -49,9 +47,7 @@ def _converId(img_id):


 def _load_annotationsQ_A(annotations_jsonpath, split):
-    """
-    Build an index out of FOIL annotations, mapping each image ID with its corresponding captions.
-    """
+    """Build an index out of FOIL annotations, mapping each image ID with its corresponding captions."""
    entries = []
    with open(annotations_jsonpath) as f:
        for annotation in json_lines.reader(f):
@@ -76,9 +72,7 @@ def _load_annotationsQ_A(annotations_jsonpath, split):


 def _load_annotationsQA_R(annotations_jsonpath, split):
-    """
-    Build an index out of FOIL annotations, mapping each image ID with its corresponding captions.
-    """
+    """Build an index out of FOIL annotations, mapping each image ID with its corresponding captions."""
    entries = []
    with open(annotations_jsonpath, 'rb') as f: 
        for annotation in json_lines.reader(f):
@@ -117,7 +111,7 @@ def _load_annotationsQA_R(annotations_jsonpath, split):

 class VCRDataReader(object):
    """ 
-    Data reader for sub VCR task
+        data reader task for vcr
    """
    def __init__(self,
                 task_conf,
@@ -193,7 +187,7 @@ class VCRDataReader(object):

    def generate_random_name(self, det_names):
        """ 
-        Replace "person" with a random name
+            replace "person" with a random name
        """
        random_name = []
        for name in det_names:
@@ -207,7 +201,7 @@ class VCRDataReader(object):

    def replace_det_with_name(self, inputs, random_names):
        """
-        Replace det with name
+            replace det with name
        """
        tokens = []
        mask = []
@@ -224,7 +218,7 @@ class VCRDataReader(object):

    def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
        """
-        Truncates a sequence pair in place to the maximum length.
+            Truncates a sequence pair in place to the maximum length.
        """
        while True:
            total_length = len(tokens_a) + len(tokens_b)
@@ -237,7 +231,7 @@ class VCRDataReader(object):

    def get_progress(self):
        """
-        Return current progress of traning data
+            return current progress of traning data
        """
        progress_dict = {"current_epoch": self.current_epoch,
                         "current_file_index": self.current_file_index,
@@ -248,7 +242,7 @@ class VCRDataReader(object):

    def tokenize(self):
        """
-        Tokenizes the captions.
+            Tokenizes the captions.
        """
        # This will add caption_tokens in each entry of the dataset.
        # -1 represents nil, and should be treated as padding_idx in embedding.
@@ -312,7 +306,7 @@ class VCRDataReader(object):

    def parse_line(self, s_index):
        """
-        Form slot info with the line information
+           form the slot info from line
        """
        entry = self._entries[s_index]
        image_id = entry["img_id"]
@@ -367,13 +361,11 @@ class VCRDataReader(object):
        return record

    def data_generator(self):
-        """ 
-        Data_generator 
-        """
+        """ data_generator """
        sample_indice = range(len(self._entries))
        def wrapper():
            """
-            Wrapper
+            wrapper
            """
            for epoch_index in range(self.epoch):
                if self._split == "train":
@@ -402,9 +394,7 @@ class VCRDataReader(object):


 class VCRDataJointReader(object):
-    """ 
-    Joint data reader for Q2A task and QA2R task
-    """
+    """ Joint data reader for Q2A task and QA2R task"""
    def __init__(self,
                 task_conf_group,
                 split,
@@ -435,8 +425,7 @@ class VCRDataJointReader(object):
        self.task_generators = [reader.data_generator() for reader in self.task_readers]

    def get_progress(self):
-        """
-        Return current progress of traning data
+        """return current progress of traning data
        """
        current_epoch = max([reader.current_epoch for reader in self.task_readers])
        current_file_index = max([reader.current_file_index for reader in self.task_readers])
@@ -450,9 +439,7 @@ class VCRDataJointReader(object):
        return self.progress_dict

    def data_generator(self):
-        """ 
-        Data_generator 
-        """
+        """ data_generator """
        def wrapper():
            """
            warpper

--- a/ernie-vil/reader/vqa_finetuning.py
+++ b/ernie-vil/reader/vqa_finetuning.py
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" VQA Data Reader implementation """
+
+from __future__ import print_function
+from __future__ import division
+
+import os
+import base64
+import functools
+import numpy as np
+import types
+import gzip
+import logging
+import re
+import six
+import collections
+import random
+
+import paddle
+import paddle.fluid as fluid
+
+from batching.finetune_batching import prepare_vqa_batch_data
+from preprocess import preprocessor
+
+class VQADataReader(object):
+    """ 
+        data reader task for vqa
+    """
+    def __init__(self,
+                 task_group,
+                 split,
+                 vocab_path,
+                 batch_size=4096,
+                 num_class=3129,
+                 max_seq_len=512,
+                 shuffle_files=True,
+                 epoch=100,
+                 voc_size=0,
+                 cls_size=0,
+                 is_test=False):
+
+        self.vocab = self.load_vocab(vocab_path)
+        self.task_group = task_group
+        self.processor = getattr(preprocessor, task_group[0]["Proprocessor"])(
+            tokenizer_name =self.task_group[0]["tokenizer_name"],
+            vocab_path = vocab_path)
+        self.batch_size = batch_size
+        self.shuffle_files = shuffle_files
+        self.epoch = epoch
+        self.current_epoch = 0
+        self.current_file_index = 0
+        self.total_file = 0
+        self.num_class=num_class
+        self.current_file = None
+        self.voc_size = voc_size
+        self.cls_size = cls_size
+        self.max_seq_len = max_seq_len
+        self.pad_id = self.vocab["[PAD]"]
+        self.cls_id = self.vocab["[CLS]"]
+        self.sep_id = self.vocab["[SEP]"]
+        self.mask_id = self.vocab["[MASK]"]
+        self.is_test = is_test
+        self.split = split
+        if self.is_test:
+            self.epoch = 1
+            self.shuffle_files = False
+            self.vg_init_epochs = 0
+        else:
+            self.vg_init_epochs = int(self.task_group[0]["vg_init_epochs"])
+
+    def get_progress(self):
+        """
+        return current progress of traning data
+        """
+        self.progress_dict = {"current_epoch": self.current_epoch,
+                         "current_file_index": self.current_file_index,
+                         "total_file": self.total_file,
+                         "current_file": self.current_file
+                         }
+        return self.progress_dict
+     
+    def process_vl(self, line, max_seq_len):
+        """
+        trans the orgin tokens to the wanted tokens
+        """
+        def decode_feature(base64_str, size):
+            """
+            decode feature from base64
+            """
+            fea_base64 = base64.b64decode(base64_str)
+            fea_decode = np.frombuffer(fea_base64, dtype=np.float32)
+            shape = size, int(fea_decode.shape[0] / size)
+            features = np.resize(fea_decode, shape)
+            return features
+        question_id, text, match_label, score, image_w, image_h, number_box, \
+            image_loc, image_embeddings = line
+        
+        token_ids = []
+        raw_ids = self.processor.convert_sentence_to_ids_without_cls(text)
+        token_ids.append(self.vocab["[CLS]"])
+        token_ids.extend(raw_ids)
+        token_ids.append(self.vocab["[SEP]"])
+        sent_ids = [0] * len(token_ids)
+        pos_ids = range(0, len(token_ids))
+        token_ids = [int(token) for token in token_ids]
+        sent_ids = [int(token) for token in sent_ids]
+        pos_ids = [int(token) for token in pos_ids]
+
+        if len(token_ids) > self.max_seq_len:
+            token_ids = token_ids[0: self.max_seq_len - 1] + [token_ids[-1]]
+            sent_ids = sent_ids[0: self.max_seq_len - 1] + [sent_ids[-1]]
+            pos_ids = pos_ids[0: self.max_seq_len] 
+       
+        labels = [int(label_tok) for label_tok in match_label.split("|")] 
+        scores = [float(score_tok) for score_tok in score.split("|")]
+        number_box = int(number_box)
+        
+        question_id = int(question_id)
+        image_loc = decode_feature(image_loc, number_box)
+        shape_np = np.repeat(np.array(\
+            [[float(image_w), float(image_h), float(image_w), float(image_h)]]), number_box, axis=0)
+        boxes_np = image_loc / shape_np
+        area = (boxes_np[:, 3] - boxes_np[:, 1]) * (boxes_np[:, 2] - boxes_np[:, 0])
+        image_loc = np.concatenate((boxes_np, np.expand_dims(area, 1)), axis = 1)
+        loc_cls = np.array([[0.0, 0.0, 1.0, 1.0, 1.0]], dtype = "float32")
+        image_loc = np.concatenate([loc_cls, image_loc], 0)
+        try:
+            image_embeddings = decode_feature(image_embeddings, number_box)
+            image_embeddings_cls = np.mean(image_embeddings, axis = 0, keepdims = True)
+            image_embeddings =  np.concatenate([image_embeddings_cls, image_embeddings], 0)
+            self.default_image_emb = image_embeddings
+        except:
+            print("error data occur, a random default image emb will be assin to this one")
+            print("the wrong line occur")
+            image_embeddings = self.default_image_emb
+        weight_labels = self.get_weight_label(self.num_class, labels, scores)
+
+        sample_json = {
+            "question_id": question_id,
+            "token_ids": token_ids,
+            "sent_ids": sent_ids,
+            "pos_ids": pos_ids,
+            "weight_labels": weight_labels,
+            "image_loc": image_loc,
+            "image_embeddings": image_embeddings,
+        }
+        return sample_json
+
+    def get_weight_label(self, num_class, labels, scores):
+        """assign the corresponding score for the labels
+        Input: labels  (Indefinite length list, like [1, 2, 3])
+               scores  (Indefinite length list, like [0.1, 0.2, 0.3])
+        Output: weight_score  (list, length equals num_class)
+        """
+        assert len(labels) == len(scores), \
+            "unequals length with labels has %d number(s) while scores has %d number(s)!" % (len(labels), len(scores))
+        weight_score = [0] * num_class
+        for i in range(len(labels)):
+            weight_score[labels[i]] =  scores[i]
+        return weight_score
+
+    def parse_line(self, line, max_seq_len=512, task_index=None):
+        """ parse one line to token_ids, sentence_ids, pos_ids, label """
+
+        line = line.strip().split("\t")
+        sample_json = self.process_vl(line, max_seq_len)
+        return sample_json
+
+    def read_file(self, file, task_index):
+        """ read line data from a file """
+        with open(file, "rb") as f:
+            lines = f.readlines()
+            if not self.is_test:
+                np.random.shuffle(lines)
+            for line in lines:
+                yield line
+
+    def convert_to_unicode(self, text):
+        """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+        if six.PY3:
+            if isinstance(text, str):
+                return text
+            elif isinstance(text, bytes):
+                return text.decode("utf-8", "ignore")
+            else:
+                raise ValueError("Unsupported string type: %s" % (type(text)))
+        elif six.PY2:
+            if isinstance(text, str):
+                return text.decode("utf-8", "ignore")
+            elif isinstance(text, unicode):
+                return text
+            else:
+                raise ValueError("Unsupported string type: %s" % (type(text)))
+        else:
+            raise ValueError("Not running on Python2 or Python 3?")
+
+    def load_vocab(self, vocab_file):
+        """Loads a vocabulary file into a dictionary."""
+        vocab = collections.OrderedDict()
+        fin = open(vocab_file)
+        for num, line in enumerate(fin):
+            items = self.convert_to_unicode(line.strip()).split("\t")
+            if len(items) > 2:
+                break
+            token = items[0]
+            index = items[1] if len(items) == 2 else num
+            token = token.strip()
+            vocab[token] = int(index)
+        return vocab
+
+    def data_generator(self):
+        """ data_generator """
+        filelist_key = "train_filelist"
+        if self.is_test:
+            if self.split == "val":
+                filelist_key = "val_filelist"
+            elif self.split == "test_dev":
+                filelist_key = "test_dev_filelist"
+            elif self.split == "test_std":
+                filelist_key = "test_std_filelist"
+            else:
+                print("*************no split named as :", self.split, "********************")
+                return None
+                
+
+        all_files = []
+        task_probs = []
+        sum = 0.0
+        for task in self.task_group:
+            all_files.append(open(task[filelist_key]).readlines())
+            task_probs.append(task["prob"])
+            sum += task["prob"]
+        for i in xrange(len(task_probs)):
+            task_probs[i] = task_probs[i] / sum
+        task_probs = np.array(task_probs).ravel()
+
+        def wrapper():
+            """ 
+            warpper
+            """
+            def reader(task_index):
+                """
+                reader
+                """
+                files = all_files[task_index]
+                global_rng = np.random.RandomState(0)
+                for epoch in range(self.epoch):
+                    if epoch < self.vg_init_epochs:
+                        files =  open(task["vg_train_filelist"]).readlines() + all_files[task_index]
+                    if self.shuffle_files:
+                        global_rng.shuffle(files)
+                    for index, file in enumerate(files):
+                        file = file.strip()
+                        trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+                        try:
+                            trainers_num = int(os.getenv("PADDLE_TRAINERS_NUM"))
+                        except:
+                            print("can not get env PADDLE_TRAINERS_NUM, set trainer_nums to 1")
+                            trainers_num = 1
+
+                        if index % trainers_num != trainer_id:
+                            continue
+                        sample_generator = paddle.reader.xmap_readers(self.parse_line, \
+                            functools.partial(self.read_file, file=file, task_index=task_index), 4, 200)
+                        for sample in sample_generator():
+                            self.current_epoch = epoch + 1
+                            self.current_file_index = index + 1
+                            self.current_file = file
+                            self.total_file = len(files)
+                            if sample is None:
+                                continue
+                            yield sample
+
+            def batch_reader(reader, batch_size):
+                """
+                Batch data reader
+                """
+                batch, total_token_num, max_len = [], 0, 0
+                cur_size = 0
+                dev_count = 1
+                buff = []
+                readers = []
+                for i in xrange(len(task_probs)):
+                    buff.append(None)
+                    readers.append(reader(i))
+                task_indices = range(len(task_probs))
+                end_times = 0
+                while end_times < 50:
+                    task_index = np.random.choice(task_indices, p=task_probs)
+                    dev_num = 0
+                    cur_reader = readers[task_index]
+                    while dev_num < dev_count:
+                        if buff[task_index] is not None:
+                            cur_len = len(buff[task_index]["token_ids"])
+                            max_len = max(max_len, cur_len)
+                            batch.append(buff[task_index])
+                            total_token_num += cur_len
+                            buff[task_index] = None
+                            cur_size += 1
+
+                        parsed_line = next(cur_reader, None)
+                        if parsed_line is None:
+                            end_times += 1
+                            dev_num += 1
+                            if len(batch) > 0:
+                                yield batch, total_token_num, task_index
+                                batch, total_token_num, max_len = [], 0, 0
+                            continue
+
+                        end_times = 0
+                        cur_len = len(parsed_line["token_ids"])
+                        max_len = max(max_len, cur_len)
+                        if cur_size >= batch_size:
+                            yield batch, total_token_num, task_index
+                            batch, total_token_num, max_len = [], 0, 0
+                            cur_size = 0
+                            dev_num += 1
+                            buff[task_index] = parsed_line
+                        else:
+                            batch.append(parsed_line)
+                            cur_size += 1
+                            total_token_num += cur_len
+
+            for batch_data, total_token_num, task_index in batch_reader(reader, self.batch_size):
+                yield prepare_vqa_batch_data(
+                    batch_data,
+                    total_token_num,
+                    task_index,
+                    len(self.task_group),
+                    voc_size=self.voc_size,
+                    pad_id=self.pad_id,
+                    cls_id=self.cls_id,
+                    sep_id=self.sep_id,
+                    mask_id=self.mask_id,
+                    return_input_mask=True,
+                    return_max_len=False,
+                    return_num_token=False)
+
+        return wrapper
+
+
+if __name__ == "__main__":
+    pass
--- a/ernie-vil/run_finetuning.sh
+++ b/ernie-vil/run_finetuning.sh
@@ -13,6 +13,8 @@ source $CONF_FILE

 #configure your cuda and cudnn 
 #configure nccl
+#export LD_LIBRARY_PATH=/home/work/cuda-9.0/lib64:/home/work/cudnn/cudnn_v7/cuda/lib64:$LD_LIBRARY_PATH
+#export LD_LIBRARY_PATH=./nccl_2.3.5/lib/:$LD_LIBRARY_PATH

 export FLAGS_fast_eager_deletion_mode=1
 export FLAGS_eager_delete_tensor_gb=0.0
@@ -44,6 +46,10 @@ python finetune.py --use_cuda "True"             \
                --lr_scheduler ${lr_scheduler}                                 \
                --decay_steps ${decay_steps-""}                                 \
                --lr_decay_ratio ${lr_decay_ratio-0.1}                                 \
+                --layer_decay_rate ${layer_decay_rate-0.0}                         \
+                --text_init_layers ${text_init_layers-18}                        \
+                --n_layers ${n_layers-30}                                      \
+                --margin ${margin-0.3}                                       \
                --num_train_steps ${num_train_steps}                           \
                --checkpoints $output_model_path                                       \
                --save_steps ${SAVE_STEPS}                                     \
@@ -53,7 +59,6 @@ python finetune.py --use_cuda "True"             \
                --warmup_steps ${WARMUP_STEPS}                                               \
                --weight_decay ${WEIGHT_DECAY:-0}                              \
                --max_seq_len ${MAX_LEN}                                       \
-                --validation_steps ${VALID_STEPS}                              \
                --skip_steps 10 


--- a/ernie-vil/run_inference.sh
+++ b/ernie-vil/run_inference.sh
@@ -13,6 +13,8 @@ RES_FILE=$8

 source $CONF_FILE

+#export LD_LIBRARY_PATH=/home/work/cuda-9.0/lib64:/home/work/cudnn/cudnn_v7/cuda/lib64:$LD_LIBRARY_PATH
+#export LD_LIBRARY_PATH=./nccl_2.3.5/lib/:$LD_LIBRARY_PATH
 #configure your cuda and cudnn
 #configure nccl


--- a/ernie-vil/tools/get_recall.py
+++ b/ernie-vil/tools/get_recall.py
+import sys
+
+ans_dict = {}
+text_ans_dict = {}
+
+filename = './data/flickr/flickr.dev.data'
+with open(filename) as f:
+    for line in f:
+        line = line.strip().split('\t')
+        image_id, sent_id = line[0], line[1]
+        ans_dict[sent_id.strip(' ')] = image_id.strip(' ')
+        text_ans_dict.setdefault(image_id.strip(' '), [])
+        text_ans_dict[image_id.strip(' ')].append(sent_id.strip(' '))
+
+if len(sys.argv) > 1:
+    res_file = sys.argv[1]
+else:
+    res_file = "./result"
+print ('=============== IMAGE RETRIEVAL ==================')
+with open(res_file) as f:
+    r1, r5, r10 = 0, 0, 0
+    cnt = 0
+    res_dict = {}
+    text_res_dict = {}
+    idx_all = 0.0
+    for line in f:
+        line = line.strip().split('\t')
+        if len(line) != 3:
+            break
+        score, image_id, sent_id = float(line[0]), line[1], line[2]
+        res_dict.setdefault(sent_id, [])
+        res_dict[sent_id].append((score, image_id))
+        text_res_dict.setdefault(image_id, [])
+        text_res_dict[image_id].append((score, sent_id))
+        if len(res_dict[sent_id]) == 1000:
+            res_list = res_dict[sent_id]
+            res_list = sorted(res_list, reverse = True)
+            ans = ans_dict[sent_id]
+            image_id_sort = list(zip(*res_list)[1])
+            ans_idx = image_id_sort.index(ans.strip())
+            if ans_idx < 1:
+                r1 += 1.0
+            if ans_idx < 5:
+                r5 += 1.0
+            if ans_idx < 10:
+                r10 += 1.0
+            idx_all += (ans_idx + 1)
+            cnt += 1
+            if cnt %  100 == 0:
+                print cnt, round(r1/cnt, 4), round(r5/cnt, 4), round(r10/cnt, 4), round(idx_all/cnt, 4)
+    print '-----------------------------'
+    print "instance %d r1:%.4f, r5:%.4f, r10:%.4f, avg_rank:%.4f" % (cnt, r1/cnt, r5/cnt, r10/cnt, idx_all/cnt)
+
+print ('\n=============== TEXT RETRIEVAL ==================')
+cnt = 0
+r1, r5, r10 = 0, 0, 0
+idx_all = 0.0
+for image_id in text_res_dict:
+    res_list = text_res_dict[image_id]
+    res_list = sorted(res_list, reverse = True)
+    ans = text_ans_dict[image_id]
+    text_id_sort = list(zip(*res_list)[1])
+    ans_idx_all = []
+    for item in ans: 
+        ans_idx_all.append(text_id_sort.index(item.strip()))
+    ans_idx = min(ans_idx_all)
+    if ans_idx < 1:
+        r1 += 1.0
+    if ans_idx < 5:
+        r5 += 1.0
+    if ans_idx < 10:
+        r10 += 1.0
+    idx_all += (ans_idx + 1)
+    cnt += 1
+    if cnt % 500 == 0:
+        print cnt, round(r1/cnt, 4), round(r5/cnt, 4), round(r10/cnt, 4), round(idx_all/cnt, 4)
+
+print '-----------------------------'
+print "instance %d r1:%.4f, r5:%.4f, r10:%.4f, avg_rank:%.4f" % (cnt, r1/cnt, r5/cnt, r10/cnt, idx_all/cnt)
--- a/ernie-vil/utils/loss.py
+++ b/ernie-vil/utils/loss.py
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""parameters init function implementations"""
+
+
+from __future__ import print_function
+
+import os
+import six
+
+import numpy as np
+import paddle.fluid as fluid
+
+def circle_loss(sp, sn, m, scale_circle):
+    """
+    sp: score list of positive samples, shape [B * L]
+    sn: score list of negative samples, shape [B * K]
+    m: relaxation factor in circle loss function
+    scale:  scale factor in circle loss function
+
+    return: circle loss value, shape [1]
+    """
+    op = 1. + m
+    on = 0. - m
+
+    delta_p = 1 - m
+    delta_n = m
+
+    ap = fluid.layers.relu(op - sp)
+    ap.stop_gradient = True
+    an = fluid.layers.relu(sn - on)
+    an.stop_gradient = True
+
+    logit_p = ap * (sp - delta_p)
+    logit_p = -1. * scale_circle * logit_p
+    logit_p = fluid.layers.cast(x=logit_p, dtype=np.float64)
+    loss_p = fluid.layers.reduce_sum(fluid.layers.exp(logit_p), dim=1, keep_dim=False)
+
+    logit_n = an * (sn - delta_n)
+    logit_n = scale_circle * logit_n
+    logit_n = fluid.layers.cast(x=logit_n, dtype=np.float64)
+    loss_n = fluid.layers.reduce_sum(fluid.layers.exp(logit_n), dim=1, keep_dim=False)
+
+    circle_loss = fluid.layers.log(1 + loss_n * loss_p)
+    circle_loss = fluid.layers.cast(x=circle_loss, dtype=np.float32)
+    return fluid.layers.mean(circle_loss)
+
+