未验证 提交 f6628ee8 编写于 作者: T tangjiji 提交者: GitHub

Repro (#628)

* add other tasks

* Update README_zh.md

* Update README.md

* Update README.md
上级 69a9e2fa
...@@ -5,6 +5,11 @@ English| [简体中文](./README_zh.md) ...@@ -5,6 +5,11 @@ English| [简体中文](./README_zh.md)
- [Pre-trained models](#pre-trained-models) - [Pre-trained models](#pre-trained-models)
- [Downstream tasks](#downstream-tasks) - [Downstream tasks](#downstream-tasks)
* [VCR](#VCR) * [VCR](#VCR)
* [VQA](#VQA)
* [IR&TR](#Retrieval)
* [RefCOCO+](#RefCOCO+)
- [Usage](#usage) - [Usage](#usage)
* [Install PaddlePaddle](#install-paddlepaddle) * [Install PaddlePaddle](#install-paddlepaddle)
* [Fine-tuning on ERNIE-ViL](#fine-tuning-on-ernie-vil) * [Fine-tuning on ERNIE-ViL](#fine-tuning-on-ernie-vil)
...@@ -43,11 +48,15 @@ Based on the scene graph parsed from the text using Scene Graph Parser, we const ...@@ -43,11 +48,15 @@ Based on the scene graph parsed from the text using Scene Graph Parser, we const
## Pre-trained Models ## Pre-trained Models
ERNIE-ViL adopts large-scale image-text aligned datasets as the pre-training data. We provide ERNIE-ViL models of two scale settings which are pretrained on [**Conceptual Captions**](https://www.aclweb.org/anthology/P18-1238.pdf) and [**SBU Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio). ERNIE-ViL adopts large-scale image-text aligned datasets as the pre-training data. We provide ERNIE-ViL models of two scale settings which are pretrained on two out-of-domain datasets, e.g., [**Conceptual Captions**](https://www.aclweb.org/anthology/P18-1238.pdf) and [**SBU Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio).
- [**ERNIE-ViL _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-en.1.tar.gz) (_lowercased | 12-text-stream-layer, 6-visual-stream-layer_) - [**ERNIE-ViL _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-en.1.tar.gz) (_lowercased | 12-text-stream-layer, 6-visual-stream-layer_)
- [**ERNIE-ViL _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_) - [**ERNIE-ViL _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_)
We also provide large scale settings model which are pretrained on both out-of-domain datasets([**Conceptual Captions**](https://www.aclweb.org/anthology/P18-1238.pdf), [**SBU Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio)) and in-domain([**MS-COCO**](https://arxiv.org/abs/1405.0312)[**Visual-Genome**](https://arxiv.org/abs/1602.07332)) datasets.
- [**ERNIE-ViL-Out&in-domain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-all-domain-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_)
## Downstream tasks ## Downstream tasks
We finetune ERNIE-ViL on five vision-langage downstream tasks, i.e., Visual Commensense Reasoning([**VCR**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.pdf)), We finetune ERNIE-ViL on five vision-langage downstream tasks, i.e., Visual Commensense Reasoning([**VCR**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.pdf)),
Visual Question Answering([**VQA**](https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf)), Visual Question Answering([**VQA**](https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf)),
...@@ -59,7 +68,7 @@ _Code and pre-trained models related to VCR task are made public now, and those ...@@ -59,7 +68,7 @@ _Code and pre-trained models related to VCR task are made public now, and those
### VCR ### VCR
* datasets * datasets
* The training, validation and testing data of VCR task are provided by [**VCR Website**](https://visualcommonsense.com/download/). * The training, validation and testing data of **VCR** task are provided by [**VCR Website**](https://visualcommonsense.com/download/).
* Organization of visual features is modified from [**ViLBERT**](https://github.com/jiasenlu/vilbert_beta), we directly use the data from it. Data can be downloaded [here](https://github.com/jiasenlu/vilbert_beta/tree/master/data). * Organization of visual features is modified from [**ViLBERT**](https://github.com/jiasenlu/vilbert_beta), we directly use the data from it. Data can be downloaded [here](https://github.com/jiasenlu/vilbert_beta/tree/master/data).
* Put all downloaded files under diretory "data/vcr". * Put all downloaded files under diretory "data/vcr".
...@@ -67,19 +76,79 @@ _Code and pre-trained models related to VCR task are made public now, and those ...@@ -67,19 +76,79 @@ _Code and pre-trained models related to VCR task are made public now, and those
* Task pre-training: We perform task-pretraining on VCR task, which is also known as task-specific-pretraining. The trained models are as follows: * Task pre-training: We perform task-pretraining on VCR task, which is also known as task-specific-pretraining. The trained models are as follows:
* [**ERNIE-ViL-VCR-task-pretrain _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-VCR-task-pre-en.1.tar.gz) * [**ERNIE-ViL-VCR-task-pretrain _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-VCR-task-pre-en.1.tar.gz)
* [**ERNIE-ViL-VCR-task-pretrain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-VCR-task-pre-en.1.tar.gz) * [**ERNIE-ViL-VCR-task-pretrain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-VCR-task-pre-en.1.tar.gz)
* Performance: Results of VCR task for ERNIE-ViL model, compared with previous state-of-the-art pre-trained models([**VILLA**](https://arxiv.org/pdf/2006.06195.pdf)). * Performance: Results of VCR task for different scale settings of ERNIE-ViL model
| Models | <strong>Q->A</strong> | <strong>QA->R</strong> | <strong>Q->AR</strong> | | Models | <strong>Q->A</strong> | <strong>QA->R</strong> | <strong>Q->AR</strong> |
| :--------------------------------------| :---------------------------: | :----------------------------: | :-----------------------------: | | :--------------------------------------| :---------------------------: | :----------------------------: | :-----------------------------: |
| VILLA (task-pretrain) _base_ | 75.54(76.4) | 78.78(79.1) | 59.75(60.6) |
| ERNIE-ViL (task-pretrain) _base_ | 76.37(77.0) | 79.65(80.3) | 61.24(62.1) | | ERNIE-ViL (task-pretrain) _base_ | 76.37(77.0) | 79.65(80.3) | 61.24(62.1) |
| VILLA (task-pretrain) _large_ | 78.45(78.9) | 82.57(82.8) | 65.18(65.7) |
| ERNIE-ViL (task-pretrain) _large_ | <strong>78.52(79.2)</strong> | <strong>83.37(83.5)</strong> | <strong/>65.81(66.3) </strong> | | ERNIE-ViL (task-pretrain) _large_ | <strong>78.52(79.2)</strong> | <strong>83.37(83.5)</strong> | <strong/>65.81(66.3) </strong> |
_Numerical results outside and inside parentheses represent the dev and test performance of VCR task respectively. _Numerical results outside and inside parentheses represent the dev and test performance of VCR task respectively.
Test results are obtained from the [**VCR leadborad**](https://visualcommonsense.com/leaderboard/)._ Test results are obtained from the [**VCR leadborad**](https://visualcommonsense.com/leaderboard/)._
### VQA
* datasets
* The training, validation and testing data of VCR task are provided by[**VQA Website**](https://visualqa.org/).
* Visual features are extracted by using tools in [bottom-up attention](https://github.com/jiasenlu/bottom-up-attention), The minimum and maximum number of the extracted boxes are 100 and 100.
* A single training & test data is organized as follows:
```script
question_id, question, answer_label, answer_score, image_w, image_h, number_box, image_loc, image_embeddings
```
_The labels and scores of multiple answers are separated by the character ‘|’._
* Performance: Results of **VQA** task for different scale settings of ERNIE-ViL model
| Models | <strong>test-dev</strong> | <strong>test-std</strong> |
| :-------------------------------- | :-------------------------------: | :------------------------------: |
| ERNIE-ViL _base_ | 73.18 | 73.36 |
| ERNIE-ViL _large_ | 73.78 | 73.96 |
| ERNIE-ViL-Out&in-domain _large_ | 74.95 | 75.10 |
### IR&TR
* datasets
* The images and captions of Flickr30k datasets can be obtailed from [**here**](https://www.kaggle.com/hsankesara/flickr-image-dataset).
* Visual features are extracted by using tools in [bottom-up attention](https://github.com/jiasenlu/bottom-up-attention). The minimum and maximum number of the extracted boxes are 0 and 36. The organization of visual features is illstruated as follows:
```script
image_w, image_h, number_box, image_loc, image_embeddings
```
* The organization of text data can refer to our given sample, e.g., data/flickr.flickr.dev.data.
* Performance
* Results of **Image Retrieval** task on **Flickr30k dataset** for different scale settings of ERNIE-ViL model
| Models | <strong>R@1</strong> | <strong>R@5</strong> | <strong>R@10</strong> |
| :-------------------------------- | :---------------------: | :----------------------: | :----------------------: |
| ERNIE-ViL _base_ | 74.44 | 92.72 | 95.94 |
| ERNIE-ViL _large_ | 75.10 | 93.42 | 96.26 |
| ERNIE-ViL-Out&in-domain _large_ | 76.66 | 94.16 | 96.76 |
* Results of **Text Retrieval** task on **Flickr30k dataset** for different scale settings of ERNIE-ViL model
| Models | <strong>R@1</strong> | <strong>R@5</strong> | <strong>R@10</strong> |
| :-------------------------------- | :---------------------: | :----------------------: | :----------------------: |
| ERNIE-ViL _base_ | 86.70 | 97.80 | 99.00 |
| ERNIE-ViL _large_ | 88.70 | 97.30 | 99.10 |
| ERNIE-ViL-Out&in-domain _large_ | 89.20 | 98.50 | 99.20 |
### RefCOCO+
* datasets
* Organization of visual features is modified from [MAttNet](https://github.com/lichengunc/MAttNet).
* A single training & test data is organized as follows:
```script
expressions, image_w, image_h, number_box, number_boxes_gt, image_loc, image_embeddings, box_label, label
```
* Performance
* Results of **RefCOCO+** task for different scale settings of ERNIE-ViL model
| Models | <strong>val</strong> | <strong>testA</strong> | <strong>testB</strong> |
| :-------------------------------- | :---------------------: | :----------------------: | :----------------------: |
| ERNIE-ViL _base_ | 74.02 | 80.33 | 64.74 |
| ERNIE-ViL _large_ | 74.24 | 80.97 | 64.70 |
| ERNIE-ViL-Out&in-domain _large_ | 75.89 | 82.39 | 66.91 |
## Usage ## Usage
...@@ -92,32 +161,61 @@ This code has been tested with Paddle Fluid 1.8 with Python 2.7. Other dependenc ...@@ -92,32 +161,61 @@ This code has been tested with Paddle Fluid 1.8 with Python 2.7. Other dependenc
### Fine-tuning on ERNIE-ViL ### Fine-tuning on ERNIE-ViL
Please update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2 before fine-tuning. You can easily run fine-tuning through Please update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2 before fine-tuning. You can easily run fine-tuning through
configuration files. For example, you can finetune ERNIE-ViL model on VCR task by configuration files. You can finetune ERNIE-ViL model on different downstream tasks by the following command:
```script ```script
sh run_finetuning.sh vcr conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $pretrain_models sh run_finetuning.sh $task_name(vqa/flickr/refcoco_plus/vcr) conf/${task_name}/model_conf_${task_name} $vocab_file $ernie_vil_config $pretrain_models_params
``` ```
Files which are needed by fine-tuning can be found in our given download links, incluing vocabulary dictionary, configuration Files which are needed by fine-tuning can be found in our given download links, incluing vocabulary dictionary, configuration
file and pre-trained parameters. Note that our fine-tuning experiments on VCR are carried on 4 NVIDIA V100 (32GB) GPUs. file and pre-trained parameters. Training details of different downstream tasks (large scale) are illstruated in the table below.
| Tasks | Batch Size | Learning Rate | # of Epochs | GPUs | Layer Decay rate | Hidden dropout |
| ----- | ----------:| -------------:| -----------:| --------:| ----------------:| --------------:|
| VCR | 16(x4) | 1e-4 | 6 | 4x V100 | 0.9 | 0.1 |
| VQA 2.0 | 64(x4) | 1e-4 | 15 | 4x V100 | 0.9 | 0.1 |
| RefCOCO+ | 64(x2) | 1e-4 | 30 | 2x V100 | 0.9 | 0.2 |
| Flickr | 8(x8) | 2e-5 | 40 | 8x V100 | 0.0 | 0.1 |
Our fine-tuning experiments on downstream tasks are carried on NVIDIA V100 (32GB) GPUs.
If your GPU memory is not enough, you can reduce the batch size in the corresponding configuration file, e.g., "conf/vcr/model_conf_vcr". If your GPU memory is not enough, you can reduce the batch size in the corresponding configuration file, e.g., "conf/vcr/model_conf_vcr".
### Inference ### Inference
You can use the following command to infer fine-tuned models. For example, you can infer VCR models by the following commands for different sub-tasks: You can use the following command to infer fine-tuned models.
**Task Q->A** #### VCR
```script ```script
sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file Task Q->A: sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
``` ```
**Task QA->R**
```script
Task Q->AR: sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
```
#### VQA
```script ```script
sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file sh run_inference.sh vqa eval $split(val/test_dev/test_std) conf/vqa/model_conf_vqa $vocab_file $ernie_vil_config $model_params $res_file
``` ```
_No test labels are given in the released test samples, you can obtailed the final score by submiting the result file to the [VQA website](https://visualqa.org/)_.
#### RefCOCO+
```script
sh run_inference.sh refcoco_plus eval $split(val/test_A/test_B) conf/refcoco_plus/model_conf_refcoco_plus $vocab_file $ernie_vil_config $model_params $res_file
```
#### Flickr
```script
sh run_inference.sh flickr eval $split(dev/test) conf/flickr/model_conf_flickr $vocab_file $ernie_vil_config $model_params $res_file
```
_Get the accuray score by using the given tools of tools/get_recall.py._
## Citation ## Citation
......
...@@ -5,7 +5,10 @@ ...@@ -5,7 +5,10 @@
- [模型框架](#模型框架) - [模型框架](#模型框架)
- [预训练模型](#预训练模型) - [预训练模型](#预训练模型)
- [下游任务](#下游任务) - [下游任务](#下游任务)
* [视觉推理](#视觉推理) * [视觉常识推理](#视觉常识推理)
* [视觉问答](#视觉问答)
* [跨模态检索](#跨模态检索)
* [引用表达式理解](#引用表达式理解)
- [使用说明](#使用说明) - [使用说明](#使用说明)
* [安装飞桨](#安装飞桨) * [安装飞桨](#安装飞桨)
* [运行微调](#运行微调) * [运行微调](#运行微调)
...@@ -41,45 +44,110 @@ ERNIE-ViL 场景图预训练任务结构 ...@@ -41,45 +44,110 @@ ERNIE-ViL 场景图预训练任务结构
## 预训练模型 ## 预训练模型
ERNIE-ViL使用大规模图文对齐数据作为预训练数据,基于[**Conceptual ERNIE-ViL使用大规模图文对齐数据作为预训练数据,基于[**Conceptual
Captions**](https://www.aclweb.org/anthology/P18-1238.pdf)[**SBU Captions**](https://www.aclweb.org/anthology/P18-1238.pdf)[**SBU
Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio)数据集,训练和发布了两种参数规模的模型 Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio)两个out-of-domain数据集,训练两种参数规模模型如下
- [**ERNIE-ViL _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-en.1.tar.gz) (_lowercased | 12-text-stream-layer, 6-visual-stream-layer_) - [**ERNIE-ViL _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-en.1.tar.gz) (_lowercased | 12-text-stream-layer, 6-visual-stream-layer_)
- [**ERNIE-ViL _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_) - [**ERNIE-ViL _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_)
基于两个out-of-domian数据集([**Conceptual
Captions**](https://www.aclweb.org/anthology/P18-1238.pdf)[**SBU
Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio))和两个in-domain数据集([**MS-COCO**](https://arxiv.org/abs/1405.0312)[**Visual-Genome**](https://arxiv.org/abs/1602.07332))训练了large参数规模的模型:
- [**ERNIE-ViL-Out&in-domain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-all-domain-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_)
## 下游任务 ## 下游任务
ERNIE-ViL在五个视觉语言下游任务进行了实验,包括[**视觉常识推理**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.pdf) ERNIE-ViL在五个视觉语言下游任务进行了实验,包括[**视觉常识推理**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.pdf)
[**视觉问答**](https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf) [**视觉问答**](https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf)
[**跨模态图片检索**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166) [**跨模态图片检索**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)
[**跨模态文本检索**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166) [**跨模态文本检索**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)
[**引用式理解**](https://www.aclweb.org/anthology/D14-1086.pdf) [**引用表达式理解**](https://www.aclweb.org/anthology/D14-1086.pdf),与主流模型的效果对比可以参考开源论文。
_当前仅开源视觉常识推理任务相关模型和代码,后续计划开源更多下游任务的模型和代码。_
### **视觉常识推理** ### **视觉常识推理**
* 数据集合 * 数据集合
* 训练、验证和测试集合相关数据[**视觉常识推理官网**](http://visualcommonsense.com/download/)提供 * 训练、验证和测试集合相关数据可以由[**视觉常识推理官网**](http://visualcommonsense.com/download/)获取
* 视觉端特征的组织方式借鉴[**ViLBERT**](https://github.com/jiasenlu/vilbert_beta), 因此项目直接使用**ViLBERT**中的数据,数据[下载地址](https://github.com/jiasenlu/vilbert_beta/tree/master/data); * 视觉端特征的组织方式借鉴[**ViLBERT**](https://github.com/jiasenlu/vilbert_beta), 因此项目直接使用**ViLBERT**中的数据,数据[下载地址](https://github.com/jiasenlu/vilbert_beta/tree/master/data);
* 将所有获取的文件放在 data/vcr 目录下; * 将所有获取的文件放在 data/vcr 目录下;
* 任务预训练: 基于ERNIE-ViL的out-of-domain模型,在视觉推理任务中进行了任务预训练,预训练获得模型如下
* 任务预训练: 在视觉推理任务中进行了任务预训练,预训练获得模型如下
* [**ERNIE-ViL-VCR-task-pretrain _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-VCR-task-pre-en.1.tar.gz) * [**ERNIE-ViL-VCR-task-pretrain _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-VCR-task-pre-en.1.tar.gz)
* [**ERNIE-ViL-VCR-task-pretrain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-VCR-task-pre-en.1.tar.gz) * [**ERNIE-ViL-VCR-task-pretrain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-VCR-task-pre-en.1.tar.gz)
* 效果: ERNIE-ViL与之前最优预训练模型[**VILLA**](https://arxiv.org/pdf/2006.06195.pdf)在视觉常识推理任务上的效果对比如下:
* 效果: ERNIE-ViL在视觉常识推理任务上的效果对比如下:
| 模型 | <strong>Q->A</strong> | <strong>QA->R</strong> | <strong>Q->AR</strong> | | 模型 | <strong>Q->A</strong> | <strong>QA->R</strong> | <strong>Q->AR</strong> |
| :---------------------------------- | :---------------------------: | :----------------------------: | :---------------------------: | | :---------------------------------- | :---------------------------: | :----------------------------: | :---------------------------: |
| VILLA (task-pretrain) _base_ | 75.54(76.4) | 78.78(79.1) | 59.75(60.6) |
| ERNIE-ViL (task-pretrain) _base_ | 76.37(77.0) | 79.65(80.3) | 61.24(62.1) | | ERNIE-ViL (task-pretrain) _base_ | 76.37(77.0) | 79.65(80.3) | 61.24(62.1) |
| VILLA (task-pretrain) _large_ | 78.45(78.9) | 82.57(82.8) | 65.18(65.7) |
| ERNIE-ViL (task-pretrain) _large_ | <strong>78.52(79.2)</strong> | <strong>83.37(83.5)</strong> | <strong/>65.81(66.3) </strong> | | ERNIE-ViL (task-pretrain) _large_ | <strong>78.52(79.2)</strong> | <strong>83.37(83.5)</strong> | <strong/>65.81(66.3) </strong> |
_注:括号外表示验证集效果,括号内表示测试集效果,测试集效果由[VCR榜单](https://visualcommonsense.com/leaderboard/)提供。_ _注:括号外表示验证集效果,括号内表示测试集效果,测试集效果提交到[VCR榜单](https://visualcommonsense.com/leaderboard/)获得。_
### **视觉问答**
* 数据集合
* 原始图片、问题和答案可以由[**视觉问答官网**](https://visualqa.org/)获取。
* 视觉端特征使用[**bottom-up attention**](https://github.com/jiasenlu/bottom-up-attention)中的工具提取,提取的box动态值为100-100。
* 训练 & 测试数据按照如下方式组织:
```script
question_id, question, answer_label, answer_score, image_w, image_h, number_box, image_loc, image_embeddings
```
_多个答案的label和score用 ‘|’ 分隔,和image相关的项均可以从bottom up attention的工具提取。_
* 效果:ERNIE-ViL的三种预训练模型在**视觉问答**任务下的效果如下表
| 模型 | <strong>test-dev</strong> | <strong>test-std</strong> |
| :-------------------------------- | :-------------------------------: | :------------------------------: |
| ERNIE-ViL _base_ | 73.18 | 73.36 |
| ERNIE-ViL _large_ | 73.78 | 73.96 |
| ERNIE-ViL-Out&in-domain _large_ | 74.95 | 75.10 |
### **跨模态检索**
* 数据集合
* 原始图片和文本描述相关的数据,可以从[**这里**](https://www.kaggle.com/hsankesara/flickr-image-dataset)获取。
* 视觉端特征使用[**bottom-up attention**](https://github.com/jiasenlu/bottom-up-attention)提取,提取的box动态值为0-36。
* 文本相关的数据可以参见data/flickr给出的示例 flickr.dev.data,图片端特征组织方式为
```script
image_w, image_h, number_box, image_loc, image_embeddings
```
* 效果
* ERNIE-ViL的三种预训练模型在**跨模态图片检索(Flickr30k 数据集)**上的效果如下表
| 模型 | <strong>R@1</strong> | <strong>R@5</strong> | <strong>R@10</strong> |
| :-------------------------------- | :---------------------: | :----------------------: | :----------------------: |
| ERNIE-ViL _base_ | 74.44 | 92.72 | 95.94 |
| ERNIE-ViL _large_ | 75.10 | 93.42 | 96.26 |
| ERNIE-ViL-Out&in-domain _large_ | 76.66 | 94.16 | 96.76 |
* ERNIE-ViL的三种预训练模型在**跨模态文本检索(Flickr30k 数据集)**任务上的效果如下表
| 模型 | <strong>R@1</strong> | <strong>R@5</strong> | <strong>R@10</strong> |
| :-------------------------------- | :---------------------: | :----------------------: | :----------------------: |
| ERNIE-ViL _base_ | 86.70 | 97.80 | 99.00 |
| ERNIE-ViL _large_ | 88.70 | 97.30 | 99.10 |
| ERNIE-ViL-Out&in-domain _large_ | 89.20 | 98.50 | 99.20 |
### **引用表达式理解**
* 数据集合
* 视觉端特征参考了[MAttNet](https://github.com/lichengunc/MAttNet)的提取方式。
* 单条训练 & 验证 数据的组织方式为
```script
expressions, image_w, image_h, number_box, number_boxes_gt, image_loc, image_embeddings, box_label, label
```
* 效果
* ERNIE-ViL的三种预训练模型在**引用表达式理解**任务上的效果如下表:
| 模型 | <strong>val</strong> | <strong>testA</strong> | <strong>testB</strong> |
| :-------------------------------- | :---------------------: | :----------------------: | :----------------------: |
| ERNIE-ViL _base_ | 74.02 | 80.33 | 64.74 |
| ERNIE-ViL _large_ | 74.24 | 80.97 | 64.70 |
| ERNIE-ViL-Out&in-domain _large_ | 75.89 | 82.39 | 66.91 |
## 使用说明 ## 使用说明
...@@ -90,32 +158,62 @@ ERNIE-ViL代码基于Paddle Fluid 1.8 和 Python 2.7, 依赖的其他模块也 ...@@ -90,32 +158,62 @@ ERNIE-ViL代码基于Paddle Fluid 1.8 和 Python 2.7, 依赖的其他模块也
pip install -r requirements.txt pip install -r requirements.txt
``` ```
### 运行微调 ### 运行微调
在运行 ERNIE-ViL 前,需要将 CUDA 、cuDNN 、NCCL2 的动态库路径添加到 LD_LIBRARY_PATH 。 我们把下游任务的参数配置文件放到了 conf/ ,可以简单地通过配置文件运行。 例如,您可以通过下面的指令在VCR上任务上进行微调: 在运行 ERNIE-ViL 微调前,需要将 CUDA 、cuDNN 、NCCL2 的动态库路径添加到 LD_LIBRARY_PATH 。 我们把下游任务的参数配置文件放到了 conf/ ,可以简单地通过配置文件运行。 例如,您可以通过下面的指令在各个下游任务上进行微调:
```script ```script
sh run_finetuning.sh vcr conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $pretrain_models_params sh run_finetuning.sh $task_name(vqa/flickr/refcoco_plus/vcr) conf/${task_name}/model_conf_${task_name} $vocab_file $ernie_vil_config $pretrain_models_params
``` ```
前面提供的模型链接中包含了所有需要的文件, 包含词表文件,配置文件和预训练参数。VCR任务的微调实验是在 4 张32 GB 的英伟达V100 GPU上运行,如果您的GPU显存不够,可以考虑八张卡运行或者减小配置中的batch_size。
_我们目前开放了预训练模型和VCR的任务代码,其他的下游任务可以参考任务自主尝试。_ 前面提供的模型链接中包含了所有需要的文件, 包含词表文件,配置文件和预训练参数。微调相关的模型配置和参数配置可以通过conf/ 目录下的文件找到,这里对论文最优结果(large模型)的一些关键参数进行汇总:
| Tasks | Batch Size | Learning Rate | # of Epochs | GPUs | Layer Decay rate | Hidden dropout |
| ----- | ----------:| -------------:| -----------:| --------:| ----------------:| --------------:|
| VCR | 16(x4) | 1e-4 | 6 | 4x V100 | 0.9 | 0.1 |
| VQA 2.0 | 64(x4) | 1e-4 | 15 | 4x V100 | 0.9 | 0.1 |
| RefCOCO+ | 64(x2) | 1e-4 | 30 | 2x V100 | 0.9 | 0.2 |
| Flickr | 8(x8) | 2e-5 | 40 | 8x V100 | 0.0 | 0.1 |
所有的下游任务的微调实验是在 32 GB 的英伟达V100 GPU上运行,如果您的GPU显存不够,可以考虑更多张卡运行或者减小配置中的batch_size。
### 预测 ### 预测
基于已经训练的模型,您可以通过下面的命令测试VCR的效果: 基于已经训练的模型,您可以通过下面的命令测试下游任务的效果(相关的配置文件可以从之前下载的包获得)
**Task Q->A** #### VCR
```script ```script
sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file Task Q->A: sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
``` ```
**Task QA->R**
```script ```script
sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file Task Q->AR: sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file
``` ```
VCR的测试可以在一张32GB的英伟达V100 GPU上运行,测试的结果包含Q->A 任务、QA->R任务和Q->AR任务,其中Q->AR任务由前两个任务结果合并所得。 _VCR的测试可以在一张32GB的英伟达V100 GPU上运行,测试的结果包含Q->A 任务、QA->R任务和Q->AR任务,其中Q->AR任务由前两个任务结果合并所得._
#### VQA
```script
sh run_inference.sh vqa eval $split(val/test_dev/test_std) conf/vqa/model_conf_vqa $vocab_file $ernie_vil_config $model_params $res_file
```
注:_VQA的测试样本没有label信息,需要将结果文件提交到[**VQA网站**](https://visualqa.org/)查看结果。_
#### RefCOCO+
```script
sh run_inference.sh refcoco_plus eval $split(val/test_A/test_B) conf/refcoco_plus/model_conf_refcoco_plus $vocab_file $ernie_vil_config $model_params $res_file
```
#### Flickr
```script
sh run_inference.sh flickr eval $split(dev/test) conf/flickr/model_conf_flickr $vocab_file $ernie_vil_config $model_params $res_file
```
注:_Flickr的结果是一个预测结果文件,可以参考 tools/get_recall.py 统计一下最终结果。_
## 引用 ## 引用
可以按下面的格式引用我们的论文: 可以按下面的格式引用我们的论文:
......
...@@ -35,8 +35,13 @@ model_g.add_arg("task_name", str, "vcr", "Task to finetune on ERNIE-ViL") ...@@ -35,8 +35,13 @@ model_g.add_arg("task_name", str, "vcr", "Task to finetune on ERNIE-ViL")
train_g = ArgumentGroup(parser, "training", "training options.") train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 100, "Number of epoches for training.") train_g.add_arg("epoch", int, 100, "Number of epoches for training.")
train_g.add_arg("learning_rate", float, 0.0001, "Learning rate used to train with warmup.") train_g.add_arg("learning_rate", float, 0.0001, "Learning rate used to train with warmup.")
train_g.add_arg("seq_dropout", float, 0.0, "dropout rate after the sequence output.")
train_g.add_arg("lr_scheduler", str, "linear_warmup_decay", train_g.add_arg("lr_scheduler", str, "linear_warmup_decay",
"scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay', 'manual_warmup_decay']) "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay', 'manual_warmup_decay'])
train_g.add_arg("layer_decay_rate", float, 0.0, "layer wise decay, 0.0 denote no layer decay")
train_g.add_arg("text_init_layers", int, 18, "diff from text and image layer, base:12-6=6, large:24-6=18")
train_g.add_arg("n_layers", int, 30, "max layers of text and image, base:12 + 6 , large:24 + 6")
train_g.add_arg("decay_steps", str, "", "learning rate decay steps, list with ;") train_g.add_arg("decay_steps", str, "", "learning rate decay steps, list with ;")
train_g.add_arg("lr_decay_ratio", float, 0.1, "learning rate decay ratio, used with manual_warmup_decay") train_g.add_arg("lr_decay_ratio", float, 0.1, "learning rate decay ratio, used with manual_warmup_decay")
train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.") train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
...@@ -68,6 +73,9 @@ data_g.add_arg("feature_size", int, 2048, "Number of roi feature size of image." ...@@ -68,6 +73,9 @@ data_g.add_arg("feature_size", int, 2048, "Number of roi feature size of image."
data_g.add_arg("fusion_method", str, "sum", "Number of roi feature size of image.") data_g.add_arg("fusion_method", str, "sum", "Number of roi feature size of image.")
data_g.add_arg("batch_size", int, 16, "Total examples' number in batch for training. see also --in_tokens.") data_g.add_arg("batch_size", int, 16, "Total examples' number in batch for training. see also --in_tokens.")
data_g.add_arg("task_group_json", str, "", "Path to task json") data_g.add_arg("task_group_json", str, "", "Path to task json")
data_g.add_arg("scale_circle", float, "1.0", "The scale factor in circle loss function, only use in circle loss mode")
data_g.add_arg("use_sigmoid", bool, False, "Whether to use sigmoid to match score, use for explode problem")
data_g.add_arg("margin", float, "0.2", "The margin value in triplet loss function")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.") run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.") run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.")
......
...@@ -56,7 +56,6 @@ def prepare_batch_data(batch_records, num_choice, pad_id, task_index, task_num): ...@@ -56,7 +56,6 @@ def prepare_batch_data(batch_records, num_choice, pad_id, task_index, task_num):
src_pos = np.array(batch_input_pos).astype("int64").reshape([num_choice * num_sample, max_len, 1]) src_pos = np.array(batch_input_pos).astype("int64").reshape([num_choice * num_sample, max_len, 1])
src_seg = np.array(batch_seg_ids).astype("int64").reshape([num_choice * num_sample, max_len, 1]) src_seg = np.array(batch_seg_ids).astype("int64").reshape([num_choice * num_sample, max_len, 1])
src_masks = np.array(batch_input_masks).astype("float32").reshape([num_choice * num_sample, max_len, 1]) src_masks = np.array(batch_input_masks).astype("float32").reshape([num_choice * num_sample, max_len, 1])
src_task = np.zeros(src_ids.shape, dtype="int64")
batch, seq_len, fea_len = image_embedding.shape batch, seq_len, fea_len = image_embedding.shape
image_embedding = np.tile(np.expand_dims(image_embedding, axis=1), \ image_embedding = np.tile(np.expand_dims(image_embedding, axis=1), \
(1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, fea_len]) (1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, fea_len])
...@@ -64,7 +63,7 @@ def prepare_batch_data(batch_records, num_choice, pad_id, task_index, task_num): ...@@ -64,7 +63,7 @@ def prepare_batch_data(batch_records, num_choice, pad_id, task_index, task_num):
(1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, 1]) (1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, 1])
image_loc = np.tile(np.expand_dims(image_loc, axis=1), \ image_loc = np.tile(np.expand_dims(image_loc, axis=1), \
(1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, 5]) (1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, 5])
return_list = [src_ids, src_pos, src_seg, src_task, src_masks, \ return_list = [src_ids, src_pos, src_seg, src_masks, \
image_embedding, image_loc, image_mask, labels, batch_anno_ids] image_embedding, image_loc, image_mask, labels, batch_anno_ids]
return_list.append(np.array([task_index]).astype('int64')) return_list.append(np.array([task_index]).astype('int64'))
return_list.append(binary_labels) return_list.append(binary_labels)
...@@ -76,14 +75,187 @@ def prepare_batch_data(batch_records, num_choice, pad_id, task_index, task_num): ...@@ -76,14 +75,187 @@ def prepare_batch_data(batch_records, num_choice, pad_id, task_index, task_num):
return return_list return return_list
def prepare_vqa_batch_data(insts,
total_token_num,
task_index,
task_num,
voc_size=0,
pad_id=None,
cls_id=None,
sep_id=None,
mask_id=None,
return_input_mask=True,
return_max_len=True,
return_num_token=False):
"""
prepare batch data for vqa tasks
"""
batch_src_ids = [inst["token_ids"] for inst in insts]
batch_sent_ids = [inst["sent_ids"] for inst in insts]
batch_pos_ids = [inst["pos_ids"] for inst in insts]
batch_image_embedding = [inst["image_embeddings"] for inst in insts]
batch_image_loc = [inst["image_loc"] for inst in insts]
batch_weight_label = [inst["weight_labels"] for inst in insts]
q_ids = np.array([inst["question_id"] for inst in insts])
#pad and trans to numpy array
src_id, self_input_mask, seq_lens = pad_batch_data(
batch_src_ids, pad_idx=pad_id, return_input_mask=True, return_seq_lens = True)
pos_id = pad_batch_data(batch_pos_ids, pad_idx=pad_id)
sent_id = pad_batch_data(batch_sent_ids, pad_idx=pad_id)
weight_labels = np.array(batch_weight_label).astype("float32")
#image_embedding_ori = copy.deepcopy(batch_image_embedding)
image_embedding, image_mask = pad_feature_data(batch_image_embedding, return_mask = True)
#image_embedding_ori = pad_feature_data(image_embedding_ori)
image_loc = pad_feature_data(batch_image_loc)
return_list = [
src_id, pos_id, sent_id, self_input_mask, \
image_embedding, image_loc, image_mask, weight_labels, q_ids
]
return return_list
def prepare_flickr_data(insts,
total_token_num,
task_index,
task_num,
voc_size=0,
pad_id=None,
cls_id=None,
sep_id=None,
mask_id=None,
outs=4,
return_input_mask=True,
return_max_len=True,
return_num_token=False):
"""
prepare flickr data for finetuning tasks
"""
if outs > 1:
batch_src_ids = [inst["token_ids"][out] for inst in insts for out in range(outs)]
batch_sent_ids = [inst["sent_ids"][out] for inst in insts for out in range(outs)]
batch_pos_ids = [inst["pos_ids"][out] for inst in insts for out in range(outs)]
batch_image_embedding = [inst["image_embeddings"][out] for inst in insts for out in range(outs)]
batch_image_loc = [inst["image_loc"][out] for inst in insts for out in range(outs)]
else:
batch_src_ids = [inst["token_ids"] for inst in insts]
batch_sent_ids = [inst["sent_ids"] for inst in insts]
batch_pos_ids = [inst["pos_ids"] for inst in insts]
batch_image_embedding = [inst["image_embeddings"] for inst in insts ]
batch_image_loc = [inst["image_loc"] for inst in insts ]
batch_ids = [inst["ids"] for inst in insts for out in range(outs)]
batch_size = int(len(batch_src_ids) / outs)
label = np.array([[0] for i in range(batch_size)], dtype = "int64")
src_id, self_input_mask, seq_lens = pad_batch_data(
batch_src_ids, pad_idx=pad_id, return_input_mask=True, return_seq_lens = True)
pos_id = pad_batch_data(batch_pos_ids, pad_idx=pad_id)
sent_id = pad_batch_data(batch_sent_ids, pad_idx=pad_id)
image_embeddings, image_mask = pad_feature_data(batch_image_embedding, return_mask = True)
image_loc = pad_feature_data(batch_image_loc)
ids = np.array(batch_ids, dtype = "int64")
return_list = [
src_id, pos_id, sent_id, self_input_mask, image_embeddings, image_loc, image_mask, label, ids]
return return_list
def prepare_refcoco_plus_batch_data(insts,
total_token_num,
task_index,
task_num,
voc_size=0,
pad_id=None,
return_input_mask=True,
return_max_len=True,
return_num_token=False):
"""
prepare batch data for refcoco_plus tasks
"""
batch_src_ids = [inst["token_ids"] for inst in insts]
batch_sent_ids = [inst["sent_ids"] for inst in insts]
batch_pos_ids = [inst["pos_ids"] for inst in insts]
batch_image_embedding = [inst["image_embeddings"] for inst in insts]
batch_image_loc = [inst["image_loc"] for inst in insts]
batch_image_label = [inst["label"] for inst in insts]
add_items = np.array([inst["add_item"] for inst in insts], dtype="float32")
src_id, self_input_mask, seq_lens = pad_batch_data(
batch_src_ids, pad_idx=pad_id, return_input_mask=True, return_seq_lens = True)
pos_id = pad_batch_data(batch_pos_ids, pad_idx=pad_id)
sent_id = pad_batch_data(batch_sent_ids, pad_idx=pad_id)
image_embedding, image_mask = pad_feature_data(batch_image_embedding, return_mask = True)
image_loc = pad_feature_data(batch_image_loc)
image_label = pad_feature_data(batch_image_label)
return_list = [
src_id, pos_id, sent_id, self_input_mask, seq_lens, \
image_embedding, image_loc, image_mask, image_label, add_items
]
return return_list
def pad_batch_data(insts,
pad_idx=0,
return_pos=False,
return_input_mask=False,
return_max_len=False,
return_num_token=False,
return_seq_lens=False):
"""
Pad the instances to the max sequence length in batch, and generate the
corresponding position data and attention bias.
"""
return_list = []
max_len = max(len(inst) for inst in insts)
# Any token included in dict can be used to pad, since the paddings' loss
# will be masked out by weights and make no effect on parameter gradients.
inst_data = np.array(
[inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
# position data
if return_pos:
inst_pos = np.array([
list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
for inst in insts
])
return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
if return_input_mask:
# This is used to avoid attention on paddings.
input_mask_data = np.array([[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts])
input_mask_data = np.expand_dims(input_mask_data, axis=-1)
return_list += [input_mask_data.astype("float32")]
if return_max_len:
return_list += [max_len]
if return_num_token:
num_token = 0
for inst in insts:
num_token += len(inst)
return_list += [num_token]
if return_seq_lens:
seq_lens = np.array([len(inst) for inst in insts])
return_list += [seq_lens.astype("int64").reshape([-1, 1])]
return return_list if len(return_list) > 1 else return_list[0]
def pad_feature_data(data, pad_value=0.0, dtype="float32", return_mask=False): def pad_feature_data(data, pad_value=0.0, dtype="float32", return_mask=False):
""" """
pad visual features with given pad value pad visual features with given pad value
""" """
max_lenth=max([len(item) for item in data]) max_length=max([len(item) for item in data])
data_width = len(data[0][0]) data_width = len(data[0][0])
out_data = np.ones((len(data), max_lenth, data_width), dtype=dtype) * pad_value out_data = np.ones((len(data), max_length, data_width), dtype=dtype) * pad_value
out_mask = np.zeros((len(data), max_lenth, 1), dtype=dtype) out_mask = np.zeros((len(data), max_length, 1), dtype=dtype)
for i in range(len(data)): for i in range(len(data)):
out_data[i, 0: len(data[i]), :] = data[i] out_data[i, 0: len(data[i]), :] = data[i]
if return_mask: if return_mask:
......
output_model_path="./output_flickr"
lr_scheduler="manual_warmup_decay"
decay_steps="54360;72480"
num_train_steps=90600
SAVE_STEPS=4530
WARMUP_STEPS=9060
BATCH_SIZE=4
LR_RATE=1e-5
WEIGHT_DECAY=0.01
MAX_LEN=48
hardest=False
meansum=False
use_circle_loss=True
scale_circle=32.0
use_sigmoid=True
margin=0.3
[
{
"prob": 1.0,
"data_func": "image_text_match",
"task_name": "image_text_match",
"train_filelist": "./conf/flickr/flickr_retrieval_train.filelist",
"train_image_path":"./data/flickr/flickr_bottom_up_10_36.csv.0",
"train_caption_path":"./data/flickr/flickr.train.data",
"hardest_setting_path":"./data/flickr/hard_negative.pkl",
"Proprocessor": "PreprocessorBasic",
"tokenizer_name" : "FullTokenizer",
"vocab_path" : "./package/vocab.txt",
"negative_schema":[ "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei", "ei"]
}
]
[
{
"prob": 1.0,
"data_func": "image_text_match",
"task_name": "image_text_match",
"test_filelist": "./conf/flickr/flickr_retrieval_test.filelist",
"dev_filelist": "./conf/flickr/flickr_retrieval_dev.filelist",
"train_filelist": "./conf/flickr/flickr_retrieval_train.filelist",
"train_image_path":"./data/flickr/flickr_bottom_up_10_36.csv.0",
"dev_image_path": "data/flickr/flickr_dev_10_36.csv.0",
"test_image_path": "data/flickr/flickr_test_10_36.csv.0",
"Proprocessor": "PreprocessorBasic",
"tokenizer_name" : "FullTokenizer",
"vocab_path" : "./package/vocab.txt"
}
]
output_model_path="output_refcoco_plus"
lr_scheduler="manual_warmup_decay"
decay_steps="3290;4700;7050"
lr_decay_ratio=0.2
num_train_steps=26640
SAVE_STEPS=470
WARMUP_STEPS=940
BATCH_SIZE=32
VALID_STEPS=20000
LR_RATE=2e-5
WEIGHT_DECAY=0.01
MAX_LEN=80
layer_decay_rate=0.9
./data/refcoco_plus/refer_testA.part
./data/refcoco_plus/refer_testB.part
./data/refcoco_plus/train_part_sample
./data/refcoco_plus/refer_val.part
[
{
"prob": 1.0,
"data_func": "refcoco+",
"task_name": "refcoco+",
"train_filelist": "./conf/refcoco_plus/refer_train.filelist",
"Proprocessor": "PreprocessorBasic",
"tokenizer_name" : "FullTokenizer",
"vocab_path" : "./package/vocab.txt"
}
]
[
{
"prob": 1.0,
"data_func": "refcoco+",
"task_name": "refcoco+",
"val_filelist": "./conf/refcoco_plus/refer_val.filelist",
"testA_filelist": "./conf/refcoco_plus/refer_testA.filelist",
"testB_filelist": "./conf/refcoco_plus/refer_testB.filelist",
"Proprocessor": "PreprocessorBasic",
"tokenizer_name" : "FullTokenizer",
"vocab_path" : "./package/vocab.txt"
}
]
lr_decay_dict_file="./conf/vqa/vqa_finetune_decay.list"
output_model_path="output_20_mask"
lr_scheduler="manual_warmup_decay"
num_train_steps=50200
SAVE_STEPS=2510
WARMUP_STEPS=3710
BATCH_SIZE=16
LR_RATE=1e-4
decay_steps="15100;22590"
WEIGHT_DECAY=0.01
layer_decay_rate=0.9
text_init_layers=6
n_layers=18
MAX_LEN=16
task_group_json=./conf/vqa/task_vqa.json
[
{
"prob": 1.0,
"valid_filelist": "./conf/vqa/vqa_valid.filelist",
"vg_train_filelist": "./conf/vqa/vg_train.filelist",
"train_filelist": "./conf/vqa/vqa_train.filelist",
"num_class": 3129,
"classifier_hid_size": 2048,
"vg_init_epochs": 2,
"Proprocessor": "PreprocessorBasic",
"tokenizer_name" : "FullTokenizer",
"vocab_path" : "./package/vocab.txt"
}
]
[
{
"prob": 1.0,
"data_func": "image_text_match",
"task_name": "image_text_match",
"val_filelist": "./conf/vqa/vqa_val.filelist",
"test_dev_filelist": "./conf/vqa/vqa_test_dev.filelist",
"test_std_filelist": "./conf/vqa/vqa_test_std.filelist",
"pickle_file": "./data/vqa/trainval_label2ans.pkl",
"num_class": 3129,
"classifier_hid_size": 2048,
"Proprocessor": "PreprocessorBasic",
"tokenizer_name" : "FullTokenizer",
"vocab_path" : "./package/vocab.txt"
}
]
./data/vqa/vg_train_part_sample
vqa_fc_w_0 2.5
vqa_fc_w_1 2.5
vqa_fc_b_0 2.5
vqa_fc_b_1 2.5
./data/vqa/test_dev_part_sample
./data/vqa/test_std_part_sample
./data/vqa/train_part_sample
67 0 A group of people stand in the back of a truck filled with cotton .
此差异已折叠。
此差异已折叠。
...@@ -63,7 +63,6 @@ class ErnieVilModel(object): ...@@ -63,7 +63,6 @@ class ErnieVilModel(object):
src_ids, src_ids,
position_ids, position_ids,
sentence_ids, sentence_ids,
task_ids,
input_mask, input_mask,
image_embeddings, image_embeddings,
image_loc, image_loc,
...@@ -115,10 +114,10 @@ class ErnieVilModel(object): ...@@ -115,10 +114,10 @@ class ErnieVilModel(object):
self._param_initializer = fluid.initializer.TruncatedNormal( self._param_initializer = fluid.initializer.TruncatedNormal(
scale=config['initializer_range']) scale=config['initializer_range'])
self._build_model(src_ids, position_ids, sentence_ids, task_ids, input_mask, \ self._build_model(src_ids, position_ids, sentence_ids, input_mask, \
image_embeddings, image_loc, input_image_mask) image_embeddings, image_loc, input_image_mask)
def _build_model(self, src_ids, position_ids, sentence_ids, task_ids, input_mask, \ def _build_model(self, src_ids, position_ids, sentence_ids, input_mask, \
image_embeddings, image_loc, input_image_mask): image_embeddings, image_loc, input_image_mask):
# padding id in vocabulary must be set to 0 # padding id in vocabulary must be set to 0
emb_out = fluid.layers.embedding( emb_out = fluid.layers.embedding(
......
...@@ -20,9 +20,7 @@ import numpy as np ...@@ -20,9 +20,7 @@ import numpy as np
import paddle.fluid as fluid import paddle.fluid as fluid
def manual_warmup_decay(learning_rate, warmup_steps, num_train_steps, decay_steps=[], lr_decay_ratio=0.1): def manual_warmup_decay(learning_rate, warmup_steps, num_train_steps, decay_steps=[], lr_decay_ratio=0.1):
""" """ Applies linear warmup of learning rate from 0 and keep constant."""
Applies linear warmup of learning rate from 0 and keep constant.
"""
with fluid.default_main_program()._lr_schedule_guard(): with fluid.default_main_program()._lr_schedule_guard():
lr = fluid.layers.tensor.create_global_var( lr = fluid.layers.tensor.create_global_var(
shape=[1], shape=[1],
...@@ -49,9 +47,7 @@ def manual_warmup_decay(learning_rate, warmup_steps, num_train_steps, decay_step ...@@ -49,9 +47,7 @@ def manual_warmup_decay(learning_rate, warmup_steps, num_train_steps, decay_step
def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps): def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
""" """ Applies linear warmup of learning rate from 0 and decay to 0."""
Applies linear warmup of learning rate from 0 and decay to 0.
"""
with fluid.default_main_program()._lr_schedule_guard(): with fluid.default_main_program()._lr_schedule_guard():
lr = fluid.layers.tensor.create_global_var( lr = fluid.layers.tensor.create_global_var(
shape=[1], shape=[1],
...@@ -78,6 +74,41 @@ def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps): ...@@ -78,6 +74,41 @@ def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
return lr return lr
def layer_decay(param, param_last, learning_rate, decay_rate, text_layers, n_layers):
""" layer_decay implementation """
delta = param - param_last
if "encoder_layer" in param.name and param.name.index("encoder_layer")==0:
layer = int(param.name.split("_")[2])
if layer >= text_layers:
cur_layer = text_layers + (layer - text_layers) * 2 + 1
else:
cur_layer = layer
ratio = decay_rate ** (n_layers - cur_layer)
print("text_layer_name:", param.name, "\t", "ratio:", ratio)
param_update = param + (ratio - 1) * delta
elif "encoder_vlayer" in param.name and param.name.index("encoder_vlayer")==0:
layer = int(param.name.split("_")[2])
cur_layer = text_layers + (layer) * 2 + 1
ratio = decay_rate ** (n_layers - cur_layer)
param_update = param + (ratio - 1) * delta
print("image_layer_name:", param.name, "\t", "ratio:", ratio)
elif "encoder_colayer" in param.name and param.name.index("encoder_colayer")==0:
layer = int(param.name.split("_")[2])
cur_layer = text_layers + (layer) * 2
ratio = decay_rate ** (n_layers - cur_layer)
param_update = param + (ratio - 1) * delta
print("co_layer_name:", param.name, "\t", "ratio:", ratio)
elif "embedding" in param.name:
ratio = decay_rate ** (n_layers + 1)
param_update = param + (ratio - 1) * delta
elif "image_emb" in param.name or "image_loc" in param.name:
ratio = decay_rate ** (n_layers - text_layers + 1)
param_update = param + (ratio - 1) * delta
else:
param_update = None
return param_update
def optimization(loss, def optimization(loss,
warmup_steps, warmup_steps,
num_train_steps, num_train_steps,
...@@ -88,10 +119,11 @@ def optimization(loss, ...@@ -88,10 +119,11 @@ def optimization(loss,
scheduler='linear_warmup_decay', scheduler='linear_warmup_decay',
decay_steps=[], decay_steps=[],
lr_decay_dict_file="", lr_decay_dict_file="",
lr_decay_ratio=0.1): lr_decay_ratio=0.1,
""" layer_decay_rate=0.0,
optimization implementation text_init_layers=18,
""" n_layers=30):
""" optimization implementation """
if warmup_steps > 0: if warmup_steps > 0:
if scheduler == 'noam_decay': if scheduler == 'noam_decay':
scheduled_lr = fluid.layers.learning_rate_scheduler \ scheduled_lr = fluid.layers.learning_rate_scheduler \
...@@ -135,8 +167,7 @@ def optimization(loss, ...@@ -135,8 +167,7 @@ def optimization(loss,
clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0)) clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0))
def exclude_from_weight_decay(name): def exclude_from_weight_decay(name):
""" """ parameters not use weight decay
Parameters not use weight decay
""" """
if name.find("layer_norm") > -1: if name.find("layer_norm") > -1:
return True return True
...@@ -154,6 +185,16 @@ def optimization(loss, ...@@ -154,6 +185,16 @@ def optimization(loss,
_, param_grads = optimizer.minimize(loss) _, param_grads = optimizer.minimize(loss)
if layer_decay_rate > 0:
for param, grad in param_grads:
with param.block.program._optimized_guard(
[param, grad]), fluid.framework.name_scope("layer_decay"):
param_decay = layer_decay(param, param_list[param.name], scheduled_lr,
layer_decay_rate, text_init_layers, n_layers)
if param_decay:
fluid.layers.assign(output=param, input=param_decay)
if weight_decay > 0: if weight_decay > 0:
for param, grad in param_grads: for param, grad in param_grads:
if exclude_from_weight_decay(param.name): if exclude_from_weight_decay(param.name):
......
...@@ -17,7 +17,6 @@ import sys ...@@ -17,7 +17,6 @@ import sys
import os import os
import base64 import base64
import numpy as np import numpy as np
reload(sys) reload(sys)
sys.setdefaultencoding("utf-8") sys.setdefaultencoding("utf-8")
...@@ -25,7 +24,7 @@ from preprocess import tokenization ...@@ -25,7 +24,7 @@ from preprocess import tokenization
class PreprocessorBasic(object): class PreprocessorBasic(object):
""" """
Main class for text preprocess parent class for preprocess
""" """
def __init__(self, def __init__(self,
tokenizer_name, tokenizer_name,
...@@ -39,7 +38,7 @@ class PreprocessorBasic(object): ...@@ -39,7 +38,7 @@ class PreprocessorBasic(object):
def convert_sentence_to_ids_without_cls(self, sentence): def convert_sentence_to_ids_without_cls(self, sentence):
""" """
Convert sentence to ids without cls convert sentence to ids without cls
""" """
tokens = self.tokenizer.tokenize(sentence) tokens = self.tokenizer.tokenize(sentence)
ids = self.tokenizer.convert_tokens_to_ids(tokens) ids = self.tokenizer.convert_tokens_to_ids(tokens)
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" VQA Data Reader implementation """
from __future__ import print_function
from __future__ import division
import os
import base64
import functools
import numpy as np
import types
import gzip
import logging
import re
import six
import collections
import copy
import random
import pickle
import paddle
import paddle.fluid as fluid
from batching.finetune_batching import prepare_flickr_data
from preprocess import preprocessor
class FlickrDataReader(object):
"""
data reader task for flickr
"""
def __init__(self,
task_group,
vocab_path,
split,
batch_size=4096,
max_seq_len=512,
shuffle_files=True,
epoch=100,
voc_size=0,
cls_size=0,
is_test=False):
self.vocab = self.load_vocab(vocab_path)
self.task_group = task_group
self.max_seq_len = max_seq_len
self.processor = getattr(preprocessor, task_group[0]["Proprocessor"])(
tokenizer_name =self.task_group[0]["tokenizer_name"],
vocab_path = vocab_path)
self.batch_size = batch_size
self.shuffle_files = shuffle_files
self.epoch = epoch
self.current_epoch = 0
self.current_file_index = 0
self.total_file = 0
self.current_file = None
self.voc_size = voc_size
self.cls_size = cls_size
self.pad_id = self.vocab["[PAD]"]
self.cls_id = self.vocab["[CLS]"]
self.sep_id = self.vocab["[SEP]"]
self.mask_id = self.vocab["[MASK]"]
self.is_test = is_test
if self.is_test:
self.epoch = 1
self.shuffle_files = False
self._test_image_list = []
if split == "dev":
image_path = self.task_group[0]["dev_image_path"]
else:
image_path = self.task_group[0]["test_image_path"]
else:
caption_path = self.task_group[0]["train_caption_path"]
self._load_caption_dict(caption_path)
image_path = self.task_group[0]["train_image_path"]
self._get_hardest_setting(self.task_group[0]["hardest_setting_path"])
self._negative_schema=self.task_group[0]["negative_schema"]
self._load_image_dict(image_path)
def decode_all(self, image_id, width, height, number_box, boxes, image_embeddings):
""" decode all data """
def decode_feature(base64_str, size):
""" decode feature from base64 """
size = int(size)
fea_base64 = base64.b64decode(base64_str)
fea_decode = np.frombuffer(fea_base64, dtype=np.float32)
shape = size, int(fea_decode.shape[0] / size)
features = np.resize(fea_decode, shape)
return features
image_embeddings = decode_feature(image_embeddings, number_box)
image_embeddings_cls = np.mean(image_embeddings, axis = 0, keepdims = True)
image_embeddings = np.concatenate([image_embeddings_cls, image_embeddings], 0)
boxes = decode_feature(boxes, number_box)
shape = np.repeat(np.array([[float(width), float(height), float(width), float(height)]]), \
number_box, axis=0)
boxes = boxes / shape
area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
image_loc = np.concatenate((boxes, np.expand_dims(area, 1)), axis = 1)
loc_cls = np.array([[0.0, 0.0, 1.0, 1.0, 1.0]], dtype = "float32")
image_loc = np.concatenate([loc_cls, image_loc], 0)
return int(number_box) + 1, image_loc, image_embeddings
def _load_image_dict(self, image_path):
self._image_feature_dict = {}
image_items = image_path.split(',')
cnt = 0
for image_item in image_items:
with open(image_item) as f:
for line in f:
cnt += 1
if cnt % 1000 == 0:
print('precessing image feature:', cnt)
image_id, width, height, number_box, image_loc, image_embeddings \
= line.strip().split('\t')
number_box, image_loc, image_embeddings = self.decode_all( \
image_id, width, height, number_box, image_loc, image_embeddings)
self._image_feature_dict[int(image_id)] = (width, height, image_embeddings, number_box, image_loc)
if self.is_test:
self._test_image_list.append(int(image_id))
def _load_caption_dict(self, image_caption):
"""
Load caption dict for flickr
"""
self._caption_ids_dict = {}
self._image_sent_map = {}
with open(image_caption) as f:
cnt = 0
for line in f:
cnt += 1
line = line.strip().split("\t")
image_id, sent_id, text = line
token_ids = []
raw_ids = self.processor.convert_sentence_to_ids_without_cls(text)
token_ids.append(self.vocab["[CLS]"])
token_ids.extend(raw_ids)
token_ids.append(self.vocab["[SEP]"])
sent_ids = [0] * len(token_ids)
pos_ids = range(0, len(token_ids))
if cnt % 5000 == 0:
print(cnt)
if len(token_ids) > self.max_seq_len:
token_ids = token_ids[0: self.max_seq_len - 1] + [token_ids[-1]]
sent_ids = sent_ids[0: self.max_seq_len - 1] + [sent_ids[-1]]
pos_ids = pos_ids[0: self.max_seq_len]
assert len(token_ids) == len(sent_ids) == len(pos_ids), \
"[Must be true]len(token_ids) == len(sent_ids) == len(pos_ids)"
self._caption_ids_dict[int(sent_id)] = \
[token_ids, sent_ids, pos_ids, int(image_id)]
self._image_sent_map.setdefault(int(image_id), [])
self._image_sent_map[int(image_id)].append(int(sent_id))
self._train_caption_ids = self._caption_ids_dict.keys()
def _get_hardest_setting(self, hardest_setting_path):
"""
Get the training metrix
"""
with open(hardest_setting_path, 'rb') as f:
data = pickle.load(f)
self._train_hard_pool = data['train_hard_pool']
self._train_image_list = data['train_image_list']
self._train_imgId2pool = {imageId:i for i, imageId in enumerate(self._train_image_list)}
def get_progress(self):
"""
Return current progress of traning data
"""
progress_dict = {"current_epoch": self.current_epoch,
"current_file_index": self.current_file_index,
"total_file": self.total_file,
"current_file": self.current_file
}
return progress_dict
def process_vl(self, line, max_seq_len):
"""
Process single v+l data
"""
if self.is_test:
line = line.strip().split("\t")
image_id, sent_id, text = line
token_ids = []
raw_ids = self.processor.convert_sentence_to_ids_without_cls(text)
token_ids.append(self.vocab["[CLS]"])
token_ids.extend(raw_ids)
token_ids.append(self.vocab["[SEP]"])
sent_ids = [0] * len(token_ids)
pos_ids = range(0, len(token_ids))
if len(token_ids) > self.max_seq_len:
token_ids = token_ids[0: self.max_seq_len - 1] + [token_ids[-1]]
sent_ids = sent_ids[0: self.max_seq_len - 1] + [sent_ids[-1]]
pos_ids = pos_ids[0: self.max_seq_len]
width, height, image_embeddings, number_box, image_loc = self._image_feature_dict[int(image_id)]
else:
sent_id = line
captions_pos = self._caption_ids_dict[sent_id]
image_id = captions_pos[-1]
captions = [captions_pos]
_, _, features, number_box, box = self._image_feature_dict[image_id]
images = [[features, number_box, box]]
for item in self._negative_schema:
if item[0] == "h":
rand_img_id_pool = self._train_hard_pool[self._train_imgId2pool[image_id]]
rand_idx = rand_img_id_pool[random.randint(1, len(rand_img_id_pool) - 1)]
image_id_neg = self._train_image_list[int(rand_idx)]
elif item[0] == "e":
while True:
image_id_neg = random.choice(self._train_image_list)
if image_id_neg != image_id:
break
else:
print("error negative schema")
exit()
if item[1] == "i":
_, _, features_neg, number_box_neg, box_neg = self._image_feature_dict[image_id_neg]
captions.append(self._caption_ids_dict[sent_id])
images.append([features_neg, number_box_neg, box_neg])
elif item[1] == "c":
sent_id_neg = random.choice(self._image_sent_map[image_id_neg])
captions.append(self._caption_ids_dict[sent_id_neg])
images.append([features, number_box, box])
else:
print("error negative schema")
exit()
token_ids, sent_ids, pos_ids, _ = zip(*captions)
image_embeddings, number_box, image_loc = zip(*images)
sample_json = {
"token_ids": token_ids,
"sent_ids": sent_ids,
"pos_ids": pos_ids,
"image_loc": image_loc,
"image_embeddings": image_embeddings,
"image_id": int(image_id),
"sent_id": int(sent_id),
"ids": [image_id, sent_id]
}
return sample_json
def parse_line(self, line, max_seq_len=512, task_index=None):
""" parse one line to token_ids, sentence_ids, pos_ids, label """
sample_json = self.process_vl(line, max_seq_len)
token_ids = sample_json["token_ids"]
return sample_json
def read_file(self, file, task_index):
"""
read line data from file
"""
if self.is_test:
with open(file) as f:
lines = f.readlines()
for line in lines:
yield line
else:
random.shuffle(self._train_caption_ids)
for item in self._train_caption_ids:
yield item
def convert_to_unicode(self, text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def load_vocab(self, vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
fin = open(vocab_file)
for num, line in enumerate(fin):
items = self.convert_to_unicode(line.strip()).split("\t")
if len(items) > 2:
break
token = items[0]
index = items[1] if len(items) == 2 else num
token = token.strip()
vocab[token] = int(index)
return vocab
def data_generator(self):
""" data_generator """
filelist_key = "train_filelist"
if self.is_test:
filelist_key = "dev_filelist"
all_files = []
task_probs = []
sum = 0.0
for task in self.task_group:
all_files.append(open(task[filelist_key]).readlines())
task_probs.append(task["prob"])
sum += task["prob"]
for i in xrange(len(task_probs)):
task_probs[i] = task_probs[i] / sum
task_probs = np.array(task_probs).ravel()
def wrapper():
""" wrapper """
def reader(task_index):
""" reader """
files = all_files[task_index]
for epoch in range(self.epoch):
if self.shuffle_files:
np.random.shuffle(files)
for index, file in enumerate(files):
file = file.strip()
sample_generator = paddle.reader.xmap_readers(self.parse_line, \
functools.partial(self.read_file, file=file, task_index=task_index), 8, 2000)
for sample in sample_generator():
if not self.is_test:
self.current_epoch = epoch + 1
self.current_file_index = index + 1
self.current_file = file
self.total_file = len(files)
yield sample
else:
cap_id = sample["ids"][1]
for image_id in self._test_image_list:
line_json = copy.deepcopy(sample)
_, _, image_embeddings, number_box, image_loc = self._image_feature_dict[image_id]
line_json["image_embeddings"] = image_embeddings
line_json["image_loc"] = image_loc
line_json["ids"][0] = image_id
yield line_json
def batch_reader(reader, batch_size):
""" batch reader """
batch, total_token_num, max_len = [], 0, 0
cur_size = 0
dev_count = 1
buff = []
readers = []
for i in xrange(len(task_probs)):
buff.append(None)
readers.append(reader(i))
task_indices = range(len(task_probs))
end_times = 0
while end_times < 50:
task_index = np.random.choice(task_indices, p=task_probs)
dev_num = 0
cur_reader = readers[task_index]
while dev_num < dev_count:
if buff[task_index] is not None:
cur_len = len(buff[task_index]["token_ids"])
max_len = max(max_len, cur_len)
batch.append(buff[task_index])
total_token_num += cur_len
buff[task_index] = None
cur_size += 1
parsed_line = next(cur_reader, None)
if parsed_line is None:
end_times += 1
dev_num += 1
if len(batch) > 0:
yield batch, total_token_num, task_index
batch, total_token_num, max_len = [], 0, 0
continue
end_times = 0
cur_len = len(parsed_line["token_ids"])
max_len = max(max_len, cur_len)
if cur_size >= batch_size:
yield batch, total_token_num, task_index
batch, total_token_num, max_len = [], 0, 0
cur_size = 0
dev_num += 1
buff[task_index] = parsed_line
else:
batch.append(parsed_line)
cur_size += 1
total_token_num += cur_len
for batch_data, total_token_num, task_index in batch_reader(reader, self.batch_size):
if self.is_test:
outs = 1
else:
outs = len(self._negative_schema)+1
yield prepare_flickr_data(
batch_data,
total_token_num,
task_index,
len(self.task_group),
voc_size=self.voc_size,
pad_id=self.pad_id,
cls_id=self.cls_id,
sep_id=self.sep_id,
mask_id=self.mask_id,
outs=outs,
return_input_mask=True,
return_max_len=False,
return_num_token=False)
return wrapper
if __name__ == "__main__":
pass
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" RefcocoPlus DataReader implementation """
from __future__ import print_function
from __future__ import division
import os
import base64
import numpy as np
import types
import gzip
import logging
import re
import six
import collections
import random
import paddle
import paddle.fluid as fluid
from batching.finetune_batching import prepare_refcoco_plus_batch_data
from preprocess import preprocessor
class RefcocoPlusDataReader(object):
"""
data reader task for refcoco plus
"""
def __init__(self,
task_group,
split,
vocab_path,
batch_size=4096,
max_seq_len=512,
shuffle_files=True,
epoch=100,
voc_size=0,
is_test=False):
self.vocab = self.load_vocab(vocab_path)
self.task_group = task_group
self.processor = getattr(preprocessor, task_group[0]["Proprocessor"])(
tokenizer_name =self.task_group[0]["tokenizer_name"],
vocab_path = vocab_path)
self.batch_size = batch_size
self.shuffle_files = shuffle_files
self.epoch = epoch
self.split = split
self.current_epoch = 0
self.current_file_index = 0
self.total_file = 0
self.current_file = None
self.voc_size = voc_size
self.max_seq_len = max_seq_len
self.pad_id = self.vocab["[PAD]"]
self.cls_id = self.vocab["[CLS]"]
self.sep_id = self.vocab["[SEP]"]
self.mask_id = self.vocab["[MASK]"]
self.input_slots = 9
self.is_test = is_test
if is_test:
self.epoch = 1
self.shuffle_files = False
def get_progress(self):
"""
return current progress of traning data
"""
self.progress_dict = {"current_epoch": self.current_epoch,
"current_file_index": self.current_file_index,
"total_file": self.total_file,
"current_file": self.current_file
}
return self.progress_dict
def process_vl(self, line, max_seq_len):
"""
process single v+l data
"""
def decode_feature(base64_str, size):
"""
decode feature from base64
"""
fea_base64 = base64.b64decode(base64_str)
fea_decode = np.frombuffer(fea_base64, dtype=np.float32)
shape = size, int(fea_decode.shape[0] / size)
features = np.resize(fea_decode, shape)
return features
text, image_w, image_h, number_boxes, number_boxes_gl, image_loc, \
image_embeddings, box_label, label = line
token_ids = []
raw_ids = self.processor.convert_sentence_to_ids_without_cls(text)
token_ids.append(self.vocab["[CLS]"])
token_ids.extend(raw_ids)
token_ids.append(self.vocab["[SEP]"])
sent_ids = [0] * len(token_ids)
pos_ids = range(0, len(token_ids))
#print("sent_ids:", sent_ids)
token_ids = [int(token) for token in token_ids]
sent_ids = [int(token) for token in sent_ids]
pos_ids = [int(token) for token in pos_ids]
assert len(token_ids) == len(sent_ids) == len(pos_ids), \
"[Must be true]len(token_ids) == len(sent_ids) == len(pos_ids)"
if len(token_ids) > self.max_seq_len:
token_ids = token_ids[0: self.max_seq_len - 1] + [token_ids[-1]]
sent_ids = sent_ids[0: self.max_seq_len - 1] + [sent_ids[-1]]
pos_ids = pos_ids[0: self.max_seq_len]
all_number_box = int(number_boxes) + int(number_boxes_gl)
image_loc = decode_feature(image_loc, all_number_box)
shape_np = np.repeat(np.array(\
[[float(image_w), float(image_h), float(image_w), float(image_h)]]), all_number_box, axis=0)
boxes_np = image_loc / shape_np
area = (boxes_np[:, 3] - boxes_np[:, 1]) * (boxes_np[:, 2] - boxes_np[:, 0])
image_loc = np.concatenate((boxes_np, np.expand_dims(area, 1)), axis = 1)
loc_cls = np.array([[0.0, 0.0, 1.0, 1.0, 1.0]], dtype = "float32")
image_loc = np.concatenate([loc_cls, image_loc], 0)
image_embeddings = decode_feature(image_embeddings, all_number_box)
image_embeddings_cls = np.mean(image_embeddings, axis = 0, keepdims = True)
image_embeddings = np.concatenate([image_embeddings_cls, image_embeddings], 0)
x1, y1, x2, y2 = [float(item) for item in box_label.split(" ")]
cls_label = (x2 - x1 + 1) * (y2 - y1 + 1) /(float(image_w) * float(image_h))
score_th = 0.5
if cls_label < score_th:
cls_label = 0.0
label_tmp = label.split(" ")
if not self.is_test:
for i in range(len(label_tmp)):
if float(label_tmp[i]) < score_th:
label_tmp[i] = 0.0
label = [[cls_label]] + [[float(token)] for token in label_tmp]
label = np.array(label, dtype="float32")
add_item = [all_number_box + 1, image_w, image_h] + [float(item) for item in box_label.split(" ")]
sample_json = {
"token_ids": token_ids,
"sent_ids": sent_ids,
"pos_ids": pos_ids,
"label": label,
"image_loc": image_loc,
"image_embeddings": image_embeddings,
"all_number_box": all_number_box,
"add_item": add_item
}
return sample_json
def parse_line(self, line, max_seq_len=512, task_index=None):
""" parse one line to token_ids, sentence_ids, pos_ids, label """
line = line.strip().split("\t")
assert len(line) == self.input_slots, "One sample must have %d fields!" % self.input_slots
sample_json = self.process_vl(line, max_seq_len)
token_ids = sample_json["token_ids"]
return sample_json
def read_file(self, file, task_index):
""" read line data from a file """
try:
assert file.endswith('.gz'), "[ERROR] %s is not a gzip file" % file
with gzip.open(file, "rb") as f:
lines = f.readlines()
except:
with open(file, "rb") as f:
lines = f.readlines()
if not self.is_test:
np.random.shuffle(lines)
for line in lines:
parsed_line = self.parse_line(
line, max_seq_len=self.max_seq_len, task_index=task_index)
if parsed_line is None:
continue
yield parsed_line
def convert_to_unicode(self, text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def load_vocab(self, vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
fin = open(vocab_file)
for num, line in enumerate(fin):
items = self.convert_to_unicode(line.strip()).split("\t")
if len(items) > 2:
break
token = items[0]
index = items[1] if len(items) == 2 else num
token = token.strip()
vocab[token] = int(index)
return vocab
def data_generator(self):
""" data_generator """
if self.split == "train":
filelist_key = "train_filelist"
elif self.split == "val":
filelist_key = "val_filelist"
elif self.split == "testA":
filelist_key = "testA_filelist"
else: filelist_key = "testB_filelist"
all_files = []
task_probs = []
sum = 0.0
for task in self.task_group:
all_files.append(open(task[filelist_key]).readlines())
task_probs.append(task["prob"])
sum += task["prob"]
for i in xrange(len(task_probs)):
task_probs[i] = task_probs[i] / sum
task_probs = np.array(task_probs).ravel()
def wrapper():
"""
wrapper
"""
def reader(task_index):
"""
reader
"""
files = all_files[task_index]
for epoch in range(self.epoch):
if self.shuffle_files:
if epoch < 0:
files = files + open(task["gt_train_filelist"]).readlines()
np.random.shuffle(files)
for index, file in enumerate(files):
file = file.strip()
sample_generator = self.read_file(file, task_index)
for sample in sample_generator:
self.current_epoch = epoch + 1
self.current_file_index = index + 1
self.current_file = file
self.total_file = len(files)
if sample is None:
continue
yield sample
def batch_reader(reader, batch_size):
"""
batch reader
"""
batch, total_token_num, max_len = [], 0, 0
cur_size = 0
dev_count = 1
buff = []
readers = []
for i in xrange(len(task_probs)):
buff.append(None)
readers.append(reader(i))
task_indices = range(len(task_probs))
end_times = 0
while end_times < 50:
task_index = np.random.choice(task_indices, p=task_probs)
dev_num = 0
cur_reader = readers[task_index]
while dev_num < dev_count:
if buff[task_index] is not None:
cur_len = len(buff[task_index]["token_ids"])
max_len = max(max_len, cur_len)
batch.append(buff[task_index])
total_token_num += cur_len
buff[task_index] = None
cur_size += 1
parsed_line = next(cur_reader, None)
if parsed_line is None:
end_times += 1
dev_num += 1
if len(batch) > 0:
yield batch, total_token_num, task_index
batch, total_token_num, max_len = [], 0, 0
continue
end_times = 0
cur_len = len(parsed_line["token_ids"])
max_len = max(max_len, cur_len)
if cur_size >= batch_size:
yield batch, total_token_num, task_index
batch, total_token_num, max_len = [], 0, 0
cur_size = 0
dev_num += 1
buff[task_index] = parsed_line
else:
batch.append(parsed_line)
cur_size += 1
total_token_num += cur_len
for batch_data, total_token_num, task_index in batch_reader(reader, self.batch_size):
yield prepare_refcoco_plus_batch_data(
batch_data,
total_token_num,
task_index,
len(self.task_group),
voc_size=self.voc_size,
pad_id=self.pad_id,
return_input_mask=True,
return_max_len=False,
return_num_token=False)
return wrapper
if __name__ == "__main__":
pass
...@@ -33,9 +33,7 @@ from batching.finetune_batching import prepare_batch_data ...@@ -33,9 +33,7 @@ from batching.finetune_batching import prepare_batch_data
import paddle.fluid as fluid import paddle.fluid as fluid
def _converId(img_id): def _converId(img_id):
""" """ conversion for image ID """
conversion for image ID
"""
img_id = img_id.split('-') img_id = img_id.split('-')
if 'train' in img_id[0]: if 'train' in img_id[0]:
new_id = int(img_id[1]) new_id = int(img_id[1])
...@@ -49,9 +47,7 @@ def _converId(img_id): ...@@ -49,9 +47,7 @@ def _converId(img_id):
def _load_annotationsQ_A(annotations_jsonpath, split): def _load_annotationsQ_A(annotations_jsonpath, split):
""" """Build an index out of FOIL annotations, mapping each image ID with its corresponding captions."""
Build an index out of FOIL annotations, mapping each image ID with its corresponding captions.
"""
entries = [] entries = []
with open(annotations_jsonpath) as f: with open(annotations_jsonpath) as f:
for annotation in json_lines.reader(f): for annotation in json_lines.reader(f):
...@@ -76,9 +72,7 @@ def _load_annotationsQ_A(annotations_jsonpath, split): ...@@ -76,9 +72,7 @@ def _load_annotationsQ_A(annotations_jsonpath, split):
def _load_annotationsQA_R(annotations_jsonpath, split): def _load_annotationsQA_R(annotations_jsonpath, split):
""" """Build an index out of FOIL annotations, mapping each image ID with its corresponding captions."""
Build an index out of FOIL annotations, mapping each image ID with its corresponding captions.
"""
entries = [] entries = []
with open(annotations_jsonpath, 'rb') as f: with open(annotations_jsonpath, 'rb') as f:
for annotation in json_lines.reader(f): for annotation in json_lines.reader(f):
...@@ -117,7 +111,7 @@ def _load_annotationsQA_R(annotations_jsonpath, split): ...@@ -117,7 +111,7 @@ def _load_annotationsQA_R(annotations_jsonpath, split):
class VCRDataReader(object): class VCRDataReader(object):
""" """
Data reader for sub VCR task data reader task for vcr
""" """
def __init__(self, def __init__(self,
task_conf, task_conf,
...@@ -193,7 +187,7 @@ class VCRDataReader(object): ...@@ -193,7 +187,7 @@ class VCRDataReader(object):
def generate_random_name(self, det_names): def generate_random_name(self, det_names):
""" """
Replace "person" with a random name replace "person" with a random name
""" """
random_name = [] random_name = []
for name in det_names: for name in det_names:
...@@ -207,7 +201,7 @@ class VCRDataReader(object): ...@@ -207,7 +201,7 @@ class VCRDataReader(object):
def replace_det_with_name(self, inputs, random_names): def replace_det_with_name(self, inputs, random_names):
""" """
Replace det with name replace det with name
""" """
tokens = [] tokens = []
mask = [] mask = []
...@@ -224,7 +218,7 @@ class VCRDataReader(object): ...@@ -224,7 +218,7 @@ class VCRDataReader(object):
def _truncate_seq_pair(self, tokens_a, tokens_b, max_length): def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
""" """
Truncates a sequence pair in place to the maximum length. Truncates a sequence pair in place to the maximum length.
""" """
while True: while True:
total_length = len(tokens_a) + len(tokens_b) total_length = len(tokens_a) + len(tokens_b)
...@@ -237,7 +231,7 @@ class VCRDataReader(object): ...@@ -237,7 +231,7 @@ class VCRDataReader(object):
def get_progress(self): def get_progress(self):
""" """
Return current progress of traning data return current progress of traning data
""" """
progress_dict = {"current_epoch": self.current_epoch, progress_dict = {"current_epoch": self.current_epoch,
"current_file_index": self.current_file_index, "current_file_index": self.current_file_index,
...@@ -248,7 +242,7 @@ class VCRDataReader(object): ...@@ -248,7 +242,7 @@ class VCRDataReader(object):
def tokenize(self): def tokenize(self):
""" """
Tokenizes the captions. Tokenizes the captions.
""" """
# This will add caption_tokens in each entry of the dataset. # This will add caption_tokens in each entry of the dataset.
# -1 represents nil, and should be treated as padding_idx in embedding. # -1 represents nil, and should be treated as padding_idx in embedding.
...@@ -312,7 +306,7 @@ class VCRDataReader(object): ...@@ -312,7 +306,7 @@ class VCRDataReader(object):
def parse_line(self, s_index): def parse_line(self, s_index):
""" """
Form slot info with the line information form the slot info from line
""" """
entry = self._entries[s_index] entry = self._entries[s_index]
image_id = entry["img_id"] image_id = entry["img_id"]
...@@ -367,13 +361,11 @@ class VCRDataReader(object): ...@@ -367,13 +361,11 @@ class VCRDataReader(object):
return record return record
def data_generator(self): def data_generator(self):
""" """ data_generator """
Data_generator
"""
sample_indice = range(len(self._entries)) sample_indice = range(len(self._entries))
def wrapper(): def wrapper():
""" """
Wrapper wrapper
""" """
for epoch_index in range(self.epoch): for epoch_index in range(self.epoch):
if self._split == "train": if self._split == "train":
...@@ -402,9 +394,7 @@ class VCRDataReader(object): ...@@ -402,9 +394,7 @@ class VCRDataReader(object):
class VCRDataJointReader(object): class VCRDataJointReader(object):
""" """ Joint data reader for Q2A task and QA2R task"""
Joint data reader for Q2A task and QA2R task
"""
def __init__(self, def __init__(self,
task_conf_group, task_conf_group,
split, split,
...@@ -435,8 +425,7 @@ class VCRDataJointReader(object): ...@@ -435,8 +425,7 @@ class VCRDataJointReader(object):
self.task_generators = [reader.data_generator() for reader in self.task_readers] self.task_generators = [reader.data_generator() for reader in self.task_readers]
def get_progress(self): def get_progress(self):
""" """return current progress of traning data
Return current progress of traning data
""" """
current_epoch = max([reader.current_epoch for reader in self.task_readers]) current_epoch = max([reader.current_epoch for reader in self.task_readers])
current_file_index = max([reader.current_file_index for reader in self.task_readers]) current_file_index = max([reader.current_file_index for reader in self.task_readers])
...@@ -450,9 +439,7 @@ class VCRDataJointReader(object): ...@@ -450,9 +439,7 @@ class VCRDataJointReader(object):
return self.progress_dict return self.progress_dict
def data_generator(self): def data_generator(self):
""" """ data_generator """
Data_generator
"""
def wrapper(): def wrapper():
""" """
warpper warpper
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" VQA Data Reader implementation """
from __future__ import print_function
from __future__ import division
import os
import base64
import functools
import numpy as np
import types
import gzip
import logging
import re
import six
import collections
import random
import paddle
import paddle.fluid as fluid
from batching.finetune_batching import prepare_vqa_batch_data
from preprocess import preprocessor
class VQADataReader(object):
"""
data reader task for vqa
"""
def __init__(self,
task_group,
split,
vocab_path,
batch_size=4096,
num_class=3129,
max_seq_len=512,
shuffle_files=True,
epoch=100,
voc_size=0,
cls_size=0,
is_test=False):
self.vocab = self.load_vocab(vocab_path)
self.task_group = task_group
self.processor = getattr(preprocessor, task_group[0]["Proprocessor"])(
tokenizer_name =self.task_group[0]["tokenizer_name"],
vocab_path = vocab_path)
self.batch_size = batch_size
self.shuffle_files = shuffle_files
self.epoch = epoch
self.current_epoch = 0
self.current_file_index = 0
self.total_file = 0
self.num_class=num_class
self.current_file = None
self.voc_size = voc_size
self.cls_size = cls_size
self.max_seq_len = max_seq_len
self.pad_id = self.vocab["[PAD]"]
self.cls_id = self.vocab["[CLS]"]
self.sep_id = self.vocab["[SEP]"]
self.mask_id = self.vocab["[MASK]"]
self.is_test = is_test
self.split = split
if self.is_test:
self.epoch = 1
self.shuffle_files = False
self.vg_init_epochs = 0
else:
self.vg_init_epochs = int(self.task_group[0]["vg_init_epochs"])
def get_progress(self):
"""
return current progress of traning data
"""
self.progress_dict = {"current_epoch": self.current_epoch,
"current_file_index": self.current_file_index,
"total_file": self.total_file,
"current_file": self.current_file
}
return self.progress_dict
def process_vl(self, line, max_seq_len):
"""
trans the orgin tokens to the wanted tokens
"""
def decode_feature(base64_str, size):
"""
decode feature from base64
"""
fea_base64 = base64.b64decode(base64_str)
fea_decode = np.frombuffer(fea_base64, dtype=np.float32)
shape = size, int(fea_decode.shape[0] / size)
features = np.resize(fea_decode, shape)
return features
question_id, text, match_label, score, image_w, image_h, number_box, \
image_loc, image_embeddings = line
token_ids = []
raw_ids = self.processor.convert_sentence_to_ids_without_cls(text)
token_ids.append(self.vocab["[CLS]"])
token_ids.extend(raw_ids)
token_ids.append(self.vocab["[SEP]"])
sent_ids = [0] * len(token_ids)
pos_ids = range(0, len(token_ids))
token_ids = [int(token) for token in token_ids]
sent_ids = [int(token) for token in sent_ids]
pos_ids = [int(token) for token in pos_ids]
if len(token_ids) > self.max_seq_len:
token_ids = token_ids[0: self.max_seq_len - 1] + [token_ids[-1]]
sent_ids = sent_ids[0: self.max_seq_len - 1] + [sent_ids[-1]]
pos_ids = pos_ids[0: self.max_seq_len]
labels = [int(label_tok) for label_tok in match_label.split("|")]
scores = [float(score_tok) for score_tok in score.split("|")]
number_box = int(number_box)
question_id = int(question_id)
image_loc = decode_feature(image_loc, number_box)
shape_np = np.repeat(np.array(\
[[float(image_w), float(image_h), float(image_w), float(image_h)]]), number_box, axis=0)
boxes_np = image_loc / shape_np
area = (boxes_np[:, 3] - boxes_np[:, 1]) * (boxes_np[:, 2] - boxes_np[:, 0])
image_loc = np.concatenate((boxes_np, np.expand_dims(area, 1)), axis = 1)
loc_cls = np.array([[0.0, 0.0, 1.0, 1.0, 1.0]], dtype = "float32")
image_loc = np.concatenate([loc_cls, image_loc], 0)
try:
image_embeddings = decode_feature(image_embeddings, number_box)
image_embeddings_cls = np.mean(image_embeddings, axis = 0, keepdims = True)
image_embeddings = np.concatenate([image_embeddings_cls, image_embeddings], 0)
self.default_image_emb = image_embeddings
except:
print("error data occur, a random default image emb will be assin to this one")
print("the wrong line occur")
image_embeddings = self.default_image_emb
weight_labels = self.get_weight_label(self.num_class, labels, scores)
sample_json = {
"question_id": question_id,
"token_ids": token_ids,
"sent_ids": sent_ids,
"pos_ids": pos_ids,
"weight_labels": weight_labels,
"image_loc": image_loc,
"image_embeddings": image_embeddings,
}
return sample_json
def get_weight_label(self, num_class, labels, scores):
"""assign the corresponding score for the labels
Input: labels (Indefinite length list, like [1, 2, 3])
scores (Indefinite length list, like [0.1, 0.2, 0.3])
Output: weight_score (list, length equals num_class)
"""
assert len(labels) == len(scores), \
"unequals length with labels has %d number(s) while scores has %d number(s)!" % (len(labels), len(scores))
weight_score = [0] * num_class
for i in range(len(labels)):
weight_score[labels[i]] = scores[i]
return weight_score
def parse_line(self, line, max_seq_len=512, task_index=None):
""" parse one line to token_ids, sentence_ids, pos_ids, label """
line = line.strip().split("\t")
sample_json = self.process_vl(line, max_seq_len)
return sample_json
def read_file(self, file, task_index):
""" read line data from a file """
with open(file, "rb") as f:
lines = f.readlines()
if not self.is_test:
np.random.shuffle(lines)
for line in lines:
yield line
def convert_to_unicode(self, text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def load_vocab(self, vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
fin = open(vocab_file)
for num, line in enumerate(fin):
items = self.convert_to_unicode(line.strip()).split("\t")
if len(items) > 2:
break
token = items[0]
index = items[1] if len(items) == 2 else num
token = token.strip()
vocab[token] = int(index)
return vocab
def data_generator(self):
""" data_generator """
filelist_key = "train_filelist"
if self.is_test:
if self.split == "val":
filelist_key = "val_filelist"
elif self.split == "test_dev":
filelist_key = "test_dev_filelist"
elif self.split == "test_std":
filelist_key = "test_std_filelist"
else:
print("*************no split named as :", self.split, "********************")
return None
all_files = []
task_probs = []
sum = 0.0
for task in self.task_group:
all_files.append(open(task[filelist_key]).readlines())
task_probs.append(task["prob"])
sum += task["prob"]
for i in xrange(len(task_probs)):
task_probs[i] = task_probs[i] / sum
task_probs = np.array(task_probs).ravel()
def wrapper():
"""
warpper
"""
def reader(task_index):
"""
reader
"""
files = all_files[task_index]
global_rng = np.random.RandomState(0)
for epoch in range(self.epoch):
if epoch < self.vg_init_epochs:
files = open(task["vg_train_filelist"]).readlines() + all_files[task_index]
if self.shuffle_files:
global_rng.shuffle(files)
for index, file in enumerate(files):
file = file.strip()
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
try:
trainers_num = int(os.getenv("PADDLE_TRAINERS_NUM"))
except:
print("can not get env PADDLE_TRAINERS_NUM, set trainer_nums to 1")
trainers_num = 1
if index % trainers_num != trainer_id:
continue
sample_generator = paddle.reader.xmap_readers(self.parse_line, \
functools.partial(self.read_file, file=file, task_index=task_index), 4, 200)
for sample in sample_generator():
self.current_epoch = epoch + 1
self.current_file_index = index + 1
self.current_file = file
self.total_file = len(files)
if sample is None:
continue
yield sample
def batch_reader(reader, batch_size):
"""
Batch data reader
"""
batch, total_token_num, max_len = [], 0, 0
cur_size = 0
dev_count = 1
buff = []
readers = []
for i in xrange(len(task_probs)):
buff.append(None)
readers.append(reader(i))
task_indices = range(len(task_probs))
end_times = 0
while end_times < 50:
task_index = np.random.choice(task_indices, p=task_probs)
dev_num = 0
cur_reader = readers[task_index]
while dev_num < dev_count:
if buff[task_index] is not None:
cur_len = len(buff[task_index]["token_ids"])
max_len = max(max_len, cur_len)
batch.append(buff[task_index])
total_token_num += cur_len
buff[task_index] = None
cur_size += 1
parsed_line = next(cur_reader, None)
if parsed_line is None:
end_times += 1
dev_num += 1
if len(batch) > 0:
yield batch, total_token_num, task_index
batch, total_token_num, max_len = [], 0, 0
continue
end_times = 0
cur_len = len(parsed_line["token_ids"])
max_len = max(max_len, cur_len)
if cur_size >= batch_size:
yield batch, total_token_num, task_index
batch, total_token_num, max_len = [], 0, 0
cur_size = 0
dev_num += 1
buff[task_index] = parsed_line
else:
batch.append(parsed_line)
cur_size += 1
total_token_num += cur_len
for batch_data, total_token_num, task_index in batch_reader(reader, self.batch_size):
yield prepare_vqa_batch_data(
batch_data,
total_token_num,
task_index,
len(self.task_group),
voc_size=self.voc_size,
pad_id=self.pad_id,
cls_id=self.cls_id,
sep_id=self.sep_id,
mask_id=self.mask_id,
return_input_mask=True,
return_max_len=False,
return_num_token=False)
return wrapper
if __name__ == "__main__":
pass
...@@ -13,6 +13,8 @@ source $CONF_FILE ...@@ -13,6 +13,8 @@ source $CONF_FILE
#configure your cuda and cudnn #configure your cuda and cudnn
#configure nccl #configure nccl
#export LD_LIBRARY_PATH=/home/work/cuda-9.0/lib64:/home/work/cudnn/cudnn_v7/cuda/lib64:$LD_LIBRARY_PATH
#export LD_LIBRARY_PATH=./nccl_2.3.5/lib/:$LD_LIBRARY_PATH
export FLAGS_fast_eager_deletion_mode=1 export FLAGS_fast_eager_deletion_mode=1
export FLAGS_eager_delete_tensor_gb=0.0 export FLAGS_eager_delete_tensor_gb=0.0
...@@ -44,6 +46,10 @@ python finetune.py --use_cuda "True" \ ...@@ -44,6 +46,10 @@ python finetune.py --use_cuda "True" \
--lr_scheduler ${lr_scheduler} \ --lr_scheduler ${lr_scheduler} \
--decay_steps ${decay_steps-""} \ --decay_steps ${decay_steps-""} \
--lr_decay_ratio ${lr_decay_ratio-0.1} \ --lr_decay_ratio ${lr_decay_ratio-0.1} \
--layer_decay_rate ${layer_decay_rate-0.0} \
--text_init_layers ${text_init_layers-18} \
--n_layers ${n_layers-30} \
--margin ${margin-0.3} \
--num_train_steps ${num_train_steps} \ --num_train_steps ${num_train_steps} \
--checkpoints $output_model_path \ --checkpoints $output_model_path \
--save_steps ${SAVE_STEPS} \ --save_steps ${SAVE_STEPS} \
...@@ -53,7 +59,6 @@ python finetune.py --use_cuda "True" \ ...@@ -53,7 +59,6 @@ python finetune.py --use_cuda "True" \
--warmup_steps ${WARMUP_STEPS} \ --warmup_steps ${WARMUP_STEPS} \
--weight_decay ${WEIGHT_DECAY:-0} \ --weight_decay ${WEIGHT_DECAY:-0} \
--max_seq_len ${MAX_LEN} \ --max_seq_len ${MAX_LEN} \
--validation_steps ${VALID_STEPS} \
--skip_steps 10 --skip_steps 10
...@@ -13,6 +13,8 @@ RES_FILE=$8 ...@@ -13,6 +13,8 @@ RES_FILE=$8
source $CONF_FILE source $CONF_FILE
#export LD_LIBRARY_PATH=/home/work/cuda-9.0/lib64:/home/work/cudnn/cudnn_v7/cuda/lib64:$LD_LIBRARY_PATH
#export LD_LIBRARY_PATH=./nccl_2.3.5/lib/:$LD_LIBRARY_PATH
#configure your cuda and cudnn #configure your cuda and cudnn
#configure nccl #configure nccl
......
import sys
ans_dict = {}
text_ans_dict = {}
filename = './data/flickr/flickr.dev.data'
with open(filename) as f:
for line in f:
line = line.strip().split('\t')
image_id, sent_id = line[0], line[1]
ans_dict[sent_id.strip(' ')] = image_id.strip(' ')
text_ans_dict.setdefault(image_id.strip(' '), [])
text_ans_dict[image_id.strip(' ')].append(sent_id.strip(' '))
if len(sys.argv) > 1:
res_file = sys.argv[1]
else:
res_file = "./result"
print ('=============== IMAGE RETRIEVAL ==================')
with open(res_file) as f:
r1, r5, r10 = 0, 0, 0
cnt = 0
res_dict = {}
text_res_dict = {}
idx_all = 0.0
for line in f:
line = line.strip().split('\t')
if len(line) != 3:
break
score, image_id, sent_id = float(line[0]), line[1], line[2]
res_dict.setdefault(sent_id, [])
res_dict[sent_id].append((score, image_id))
text_res_dict.setdefault(image_id, [])
text_res_dict[image_id].append((score, sent_id))
if len(res_dict[sent_id]) == 1000:
res_list = res_dict[sent_id]
res_list = sorted(res_list, reverse = True)
ans = ans_dict[sent_id]
image_id_sort = list(zip(*res_list)[1])
ans_idx = image_id_sort.index(ans.strip())
if ans_idx < 1:
r1 += 1.0
if ans_idx < 5:
r5 += 1.0
if ans_idx < 10:
r10 += 1.0
idx_all += (ans_idx + 1)
cnt += 1
if cnt % 100 == 0:
print cnt, round(r1/cnt, 4), round(r5/cnt, 4), round(r10/cnt, 4), round(idx_all/cnt, 4)
print '-----------------------------'
print "instance %d r1:%.4f, r5:%.4f, r10:%.4f, avg_rank:%.4f" % (cnt, r1/cnt, r5/cnt, r10/cnt, idx_all/cnt)
print ('\n=============== TEXT RETRIEVAL ==================')
cnt = 0
r1, r5, r10 = 0, 0, 0
idx_all = 0.0
for image_id in text_res_dict:
res_list = text_res_dict[image_id]
res_list = sorted(res_list, reverse = True)
ans = text_ans_dict[image_id]
text_id_sort = list(zip(*res_list)[1])
ans_idx_all = []
for item in ans:
ans_idx_all.append(text_id_sort.index(item.strip()))
ans_idx = min(ans_idx_all)
if ans_idx < 1:
r1 += 1.0
if ans_idx < 5:
r5 += 1.0
if ans_idx < 10:
r10 += 1.0
idx_all += (ans_idx + 1)
cnt += 1
if cnt % 500 == 0:
print cnt, round(r1/cnt, 4), round(r5/cnt, 4), round(r10/cnt, 4), round(idx_all/cnt, 4)
print '-----------------------------'
print "instance %d r1:%.4f, r5:%.4f, r10:%.4f, avg_rank:%.4f" % (cnt, r1/cnt, r5/cnt, r10/cnt, idx_all/cnt)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""parameters init function implementations"""
from __future__ import print_function
import os
import six
import numpy as np
import paddle.fluid as fluid
def circle_loss(sp, sn, m, scale_circle):
"""
sp: score list of positive samples, shape [B * L]
sn: score list of negative samples, shape [B * K]
m: relaxation factor in circle loss function
scale: scale factor in circle loss function
return: circle loss value, shape [1]
"""
op = 1. + m
on = 0. - m
delta_p = 1 - m
delta_n = m
ap = fluid.layers.relu(op - sp)
ap.stop_gradient = True
an = fluid.layers.relu(sn - on)
an.stop_gradient = True
logit_p = ap * (sp - delta_p)
logit_p = -1. * scale_circle * logit_p
logit_p = fluid.layers.cast(x=logit_p, dtype=np.float64)
loss_p = fluid.layers.reduce_sum(fluid.layers.exp(logit_p), dim=1, keep_dim=False)
logit_n = an * (sn - delta_n)
logit_n = scale_circle * logit_n
logit_n = fluid.layers.cast(x=logit_n, dtype=np.float64)
loss_n = fluid.layers.reduce_sum(fluid.layers.exp(logit_n), dim=1, keep_dim=False)
circle_loss = fluid.layers.log(1 + loss_n * loss_p)
circle_loss = fluid.layers.cast(x=circle_loss, dtype=np.float32)
return fluid.layers.mean(circle_loss)
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册