English| [简体中文](./README_zh.md) ## _ERNIE-ViL_: Knowledge Enhanced Vision-Language Representations Through Scene Graph - [Framework](#framework) - [Pre-trained models](#pre-trained-models) - [Downstream tasks](#downstream-tasks) * [VCR](#VCR) - [Usage](#usage) * [Install PaddlePaddle](#install-paddlepaddle) * [Fine-tuning on ERNIE-ViL](#fine-tuning-on-ernie-vil) * [Inference](#inference) - [Citation](#citation) For technical description of the algorithm, please see our paper: >[_**ERNIE-ViL:Knowledge Enhanced Vision-Language Representations Through Scene Graph**_](https://arxiv.org/abs/2006.16934) > >Fei Yu\*, Jiji Tang\*, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution) > >Preprint June 2020 > ![ERNIE-ViL](https://img.shields.io/badge/Pretraining-vision_and_language_joint_representions-green) ![VQA](https://img.shields.io/badge/VQA-Visual_Question_Answering-yellow) ![VCR](https://img.shields.io/badge/VCR-Visual_Commensense_Reasoning-blue) ![RefCOCO+](https://img.shields.io/badge/RefCOCO+-Region_to_Phrase_Grounding-green) ![IRTR](https://img.shields.io/badge/IR_&TR-Image_Retrieval&_Text_Retrieval-yellowgreen) **[ERNIE-ViL](https://arxiv.org/abs/2006.16934) is a knowledge-enhanced joint representations for vision-language tasks**, which is the first work that has **introduced structured knowledge to enhance vision-language pre-training**. Utilizing structured knowledge obtained from scene graphs, ERNIE-ViL constructs three **Scene Graph Prediction tasks**, i.e., **Object Prediction**, **Attribute Prediction** and **Relationship Prediction** tasks. Thus, ERNIE-ViL can learn the better joint vision-language representations characterizing the alignments of the detailed semantics across vision and language. ## Framework Based on the scene graph parsed from the text using Scene Graph Parser, we construct Object Prediction, Attribute Prediction and Relationship Prediction tasks: - **Object Prediction:** We randomly select a set of the objects in the scene graph, then mask and predict the corresponding words in the sentence. - **Attribute Prediction:** For the object-attribute pairs in the scene graph, we randomly select a part of them to mask and predict the words related to the attribute nodes in the sentence. - **Realtionship Prediction:** For the object-relationship-object triplets in the scene graph, we randomly select a part of realtionship nodes to mask and predict them. ![ernie_vil_struct](.meta/ernie_vil_struct.png) Model Architecture of ERNIE-ViL ## Pre-trained Models ERNIE-ViL adopts large-scale image-text aligned datasets as the pre-training data. We provide ERNIE-ViL models of two scale settings which are pretrained on [**Conceptual Captions**](https://www.aclweb.org/anthology/P18-1238.pdf) and [**SBU Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio). - [**ERNIE-ViL _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-en.1.tar.gz) (_lowercased | 12-text-stream-layer, 6-visual-stream-layer_) - [**ERNIE-ViL _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_) ## Downstream tasks We finetune ERNIE-ViL on five vision-langage downstream tasks, i.e., Visual Commensense Reasoning([**VCR**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.pdf)), Visual Question Answering([**VQA**](https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf)), Cross-modal Image Retrieval([**IR**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)), Cross-modal Text Retrieval([**TR**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)) and Region_to_Phrase_Grounding([**RefCOCO+**](https://www.aclweb.org/anthology/D14-1086.pdf)). _Code and pre-trained models related to VCR task are made public now, and those of more downstream tasks are planed to be public._ ### VCR * datasets * The training, validation and testing data of VCR task are provided by [**VCR Website**](https://visualcommonsense.com/download/). * Organization of visual features is modified from [**ViLBERT**](https://github.com/jiasenlu/vilbert_beta), we directly use the data from it. Data can be downloaded [here](https://github.com/jiasenlu/vilbert_beta/tree/master/data). * Put all downloaded files under diretory "data/vcr". * Task pre-training: We perform task-pretraining on VCR task, which is also known as task-specific-pretraining. The trained models are as follows: * [**ERNIE-ViL-VCR-task-pretrain _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-VCR-task-pre-en.1.tar.gz) * [**ERNIE-ViL-VCR-task-pretrain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-VCR-task-pre-en.1.tar.gz) * Performance: Results of VCR task for ERNIE-ViL model, compared with previous state-of-the-art pre-trained models([**VILLA**](https://arxiv.org/pdf/2006.06195.pdf)). | Models | Q->A | QA->R | Q->AR | | :--------------------------------------| :---------------------------: | :----------------------------: | :-----------------------------: | | VILLA (task-pretrain) _base_ | 75.54(76.4) | 78.78(79.1) | 59.75(60.6) | | ERNIE-ViL (task-pretrain) _base_ | 76.37(77.0) | 79.65(80.3) | 61.24(62.1) | | VILLA (task-pretrain) _large_ | 78.45(78.9) | 82.57(82.8) | 65.18(65.7) | | ERNIE-ViL (task-pretrain) _large_ | 78.52(79.2) | 83.37(83.5) | 65.81(66.3) | _Numerical results outside and inside parentheses represent the dev and test performance of VCR task respectively. Test results are obtained from the [**VCR leadborad**](https://visualcommonsense.com/leaderboard/)._ ## Usage ### Install PaddlePaddle This code has been tested with Paddle Fluid 1.8 with Python 2.7. Other dependencies of ERNIE-ViL are listed in `requirements.txt`, you can install them by ```script pip install -r requirements.txt ``` ### Fine-tuning on ERNIE-ViL Please update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2 before fine-tuning. You can easily run fine-tuning through configuration files. For example, you can finetune ERNIE-ViL model on VCR task by ```script sh run_finetuning.sh vcr conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $pretrain_models ``` Files which are needed by fine-tuning can be found in our given download links, incluing vocabulary dictionary, configuration file and pre-trained parameters. Note that our fine-tuning experiments on VCR are carried on 4 NVIDIA V100 (32GB) GPUs. If your GPU memory is not enough, you can reduce the batch size in the corresponding configuration file, e.g., "conf/vcr/model_conf_vcr". ### Inference You can use the following command to infer fine-tuned models. For example, you can infer VCR models by the following commands for different sub-tasks: **Task Q->A** ```script sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file ``` **Task QA->R** ```script sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file ``` ## Citation You can cite the paper as below: ``` @article{yu2020ernie, title={ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph}, author={Yu, Fei and Tang, Jiji and Yin, Weichong and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng}, journal={arXiv preprint arXiv:2006.16934}, year={2020} } ```