#

`UNIMO`

Code for the main conference of ACL 2021 long paper [UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning](https://arxiv.org/pdf/2012.15409.pdf) ## Abstract Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e., text or image) or limited multi-modal data (i.e., image-text pairs). In this work, we propose a UNIfied-MOdal pre-training architecture, namely `UNIMO`, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections are utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space over a corpus of image-text pairs augmented with related images and texts. With the help of rich non-paired single-modal data, our model is able to learn more generalizable representations, by allowing textual knowledge and visual knowledge to enhance each other in the unified semantic space. The experimental results show that `UNIMO` greatly improves the performance of several single-modal and multi-modal downstream tasks. ![UNIMO](images/framework.png#pic_center) ## Performance Results on multi-modal understanding and generation tasks: ![UNIMO](images/multiple.png#pic_center) Results on single-modal understanding and generation tasks: ![UNIMO](images/single.png#pic_center) --- ## TODOs - [] Add VQA tasks ## Dependencies python 3.7.4\ paddlepaddle-gpu==1.8.4.post107\ pyrouge==0.1.3 regex==2020.7.14 ## Pre-trained Models `UNIMO` adopts large-scale text corpus, image collections and image-text aligned datasets as the pre-training data. We provide `UNIMO` pre-trained models below: [UNIMO base](https://unimo.bj.bcebos.com/model/unimo_base_en.tar.gz) (lowercased | 12 layers) [UNIMO-mnli base](https://unimo.bj.bcebos.com/model/unimo_mnli_base_en.tar.gz) (lowercased | 12 layers) [UNIMO large](https://unimo.bj.bcebos.com/model/unimo_large_en.tar.gz) (lowercased | 24 layers) [UNIMO-mnli large](https://unimo.bj.bcebos.com/model/unimo_mnli_large_en.tar.gz) (lowercased | 24 layers) ``` MODEL_SIZE=base # base | mnli_base | large | mnli_large cd /path/to/model_files wget --no-check-certificate -q https://unimo.bj.bcebos.com/model/unimo_${MODEL_SIZE}_en.tar.gz tar -zxf unimo_${MODEL_SIZE}_en.tar.gz ``` ## Experiments Our fine-tuning experiments are carried on V100 GPU. The following are the startup methods and basic settings of all downstream tasks:
Task Type
Datatset
Pre-trained Models
Start Command
V100 GPU Cards
Running Time
Text Understanding
SST-2
UNIMO base
sh ./script/classification/SST-2/run.sh
8
9h
UNIMO large
sh ./script/classification/SST-2_large/run.sh
8
14h
CoLA
UNIMO base
sh ./script/classification/CoLA/run.sh
4
2h
UNIMO large
sh ./script/classification/CoLA_large/run.sh
4
4h
MNLI-AX
UNIMO base
sh ./script/classification/MNLI-AX/run.sh
8
1d20h
UNIMO large
sh ./script/classification/MNLI-AX_large/run.sh
8
2d13h
STS-B
UNIMO-mnli base
sh ./script/regression/STS-B/run.sh
8
2h
UNIMO-mnli large
sh ./script/regression/STS-B_large/run.sh
8
4h
Text Generation
CNN/DailyMail
UNIMO base
sh ./script/seq2seq/cnndm/run.sh
4
1d8h
UNIMO large
sh ./script/seq2seq/cnndm_large/run.sh
4
3d18h
Gigaword
UNIMO base
sh ./script/seq2seq/gigaword/run.sh
4
1d3h
UNIMO large
sh ./script/seq2seq/gigaword_large/run.sh
4
2d3h
CoQA
UNIMO base
sh ./script/seq2seq/coqa/run.sh
4
7h
UNIMO large
sh ./script/seq2seq/coqa_large/run.sh
4
22h
Squad_QG
UNIMO base
sh ./script/seq2seq/squad_qg/run.sh
4
4h
UNIMO large
sh ./script/seq2seq/squad_qg_large/run.sh
4
8h
Multi-Modal Understanding
Flickr30k
UNIMO base
sh ./script/retrieval/Flickr30k/run.sh
16
3d
UNIMO large
sh ./script/retrieval/Flickr30k_large/run.sh
16
3d
SNLI-VE
UNIMO base
sh ./script/visual_entailment/SNLI-VE/run.sh
16
16h
UNIMO large
sh ./script/visual_entailment/SNLI-VE_large/run.sh
16
2d
VQA
UNIMO base
-
-
-
UNIMO large
-
-
-
Multi-Modal Generation
COCO Caption
UNIMO base
sh ./script/img2txt/coco/run.sh
16
3d
UNIMO large
sh ./script/img2txt/coco_large/run.sh
16
4d
--- ## Text Understanding Tasks ### (1) Sentiment Classification #### Download SST-2 dataset: ``` cd /path/to/data wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/SST-2.tar.gz tar -zxf SST.tar.gz ``` #### Run the following common to train and evaluate on the SST-2 dataset: For base model: ``` bash ./script/classification/SST-2/run.sh ``` For large model: ``` bash ./script/classification/SST-2_large/run.sh ``` #### Evaluation Results:
Model
Acc
UNIMO-base
95.1
UNIMO-large
96.8
### (2) Natural Language Inference #### Download MNLI-AX dataset: ``` cd /path/to/data wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/MNLI-AX.tar.gz tar -zxf MNLI-AX.tar.gz ``` #### Run the following common to train and evaluate on the MNLI-AX dataset: For base model: ``` bash ./script/classification/MNLI-AX/run.sh ``` For large model: ``` bash ./script/classification/MNLI-AX_large/run.sh ``` #### Evaluation Results:
Model
Acc-(m/mm)
UNIMO-base
86.8/86.7
UNIMO-large
89.8/89.5
### (3) Similarity Tasks #### Download STS-B dataset: ``` cd /path/to/data wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/STS-B.tar.gz tar -zxf STS-B.tar.gz ``` #### Run the following common to train and evaluate on the STS-B dataset: For base model: ``` bash ./script/regression/STS-B/run.sh ``` For large model: ``` bash ./script/regression/STS-B_large/run.sh ``` #### Evaluation Results:
Model
Pearson correlation
UNIMO-base
91.0
UNIMO-large
92.6
### (4) Linguistic Acceptability Judgments #### Download CoLA dataset: ``` cd /path/to/data wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/CoLA.tar.gz tar -zxf CoLA.tar.gz ``` #### Run the following common to train and evaluate on the CoLA dataset: For base model: ``` bash ./script/classification/CoLA/run.sh ``` For large model: ``` bash ./script/classification/CoLA_large/run.sh ``` #### Evaluation Results:
Model
Matthews correlation
UNIMO-base
65.4
UNIMO-large
68.5
## Text Generation Tasks ### (1) Document Summarization #### Download CNN/DailyMail dataset: ``` cd /path/to/data wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/cnndm.tar.gz tar -zxf cnndm.tar.gz ``` #### Download evaluation script: ``` cd src/eval/tasks wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/cnndm.tar.gz tar -zxf cnndm.tar.gz ``` #### Run the following common to train and evaluate on the CNN/DailyMail dataset: For base model: ``` bash ./script/seq2seq/cnndm/run.sh ``` For large model: ``` bash ./script/seq2seq/cnndm_large/run.sh ``` #### Evaluation Results:
Model
ROUGE-1
ROUGE-2
ROUGE-L
UNIMO-base
42.42
20.12
39.61
UNIMO-large
43.51
20.65
40.63
### (2) Sentence Compression #### Download Gigaword dataset: ``` cd /path/to/data wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/gigaword.tar.gz tar -zxf gigaword.tar.gz ``` #### Download evaluation script: ``` cd src/eval/tasks wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/gigaword.tar.gz tar -zxf gigaword.tar.gz ``` #### Run the following common to train and evaluate on the Gigaword dataset: For base model: ``` bash ./script/seq2seq/gigaword/run.sh ``` For large model: ``` bash ./script/seq2seq/gigaword_large/run.sh ``` #### Evaluation Results:
Model
ROUGE-1
ROUGE-2
ROUGE-L
UNIMO-base
38.80
19.99
36.27
UNIMO-large
39.71
20.37
36.88
### (3) Question Generation #### Download Squad dataset: ``` cd /path/to/data wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/squad_qg.tar.gz tar -zxf squad_qg.tar.gz ``` #### Download evaluation script: ``` cd src/eval/tasks wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/squad_qg.tar.gz tar -zxf squad_qg.tar.gz ``` #### Run the following common to train and evaluate on the Squad dataset: For base model: ``` bash ./script/seq2seq/squad_qg/run.sh ``` For large model: ``` bash ./script/seq2seq/squad_qg_large/run.sh ``` #### Evaluation Results:
Model
BLUE4
METEOR
ROUGE-L
UNIMO-base
22.78
25.24
51.34
UNIMO-large
24.59
26.39
52.47
### (4) Conversation Question Answering #### Download CoQA dataset: ``` cd /path/to/data wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/coqa.tar.gz tar -zxf coqa.tar.gz ``` #### Download evaluation script: ``` cd src/eval/tasks wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/coqa.tar.gz tar -zxf coqa.tar.gz ``` #### Run the following common to train and evaluate on the CoQA dataset: For base model: ``` bash ./script/seq2seq/coqa/run.sh ``` For large model: ``` bash ./script/seq2seq/coqa_large/run.sh ``` #### Evaluation Results:
Model
Acc
UNIMO-base
80.2
UNIMO-large
84.9
## Multi-Modal Understanding Tasks ### (1) Image-Text Retrieval #### Download Flickr30k dataset: ##### Note: Visual features are extracted by [bottom-up-attention](https://github.com/peteanderson80/bottom-up-attention) ``` cd /path/to/data wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/Flickr30k.tar.gz # occupies about 37G disk space tar -zxf Flickr30k.tar.gz ``` #### Run the following common to train and evaluate on the Flickr30k dataset: For base model: ``` bash ./script/retrieval/Flickr30k/run.sh ``` For large model: ``` bash ./script/retrieval/Flickr30k_large/run.sh ``` #### Evaluation Results: Results of Image Retrieval task on Flickr30k dataset
Model
R@1
R@5
R@10
UNIMO-base
74.66
93.40
96.08
UNIMO-large
78.04
94.24
97.12
Results of Text Retrieval task on Flickr30k dataset
Model
R@1
R@5
R@10
UNIMO-base
89.70
98.40
99.10
UNIMO-large
89.40
98.90
99.80
### (2) Visual Entailment #### Download SNLI-VE dataset: ##### Note: Visual features are extracted by [bottom-up-attention](https://github.com/peteanderson80/bottom-up-attention) ``` cd /path/to/data wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/SNLI-VE.tar.gz tar -zxf SNLI-VE.tar.gz ``` #### Run the following common to train and evaluate on the SNLI-VE dataset: For base model: ``` bash ./script/visual_entailment/SNLI-VE/run.sh ``` For large model: ``` bash ./script/visual_entailment/SNLI-VE_large/run.sh ``` #### Evaluation Results: Results of Visual Entailment task on SNLI-VE dataset
Model
dev
test
UNIMO-base
80.00
79.10
UNIMO-large
81.11
80.63
## Multi-Modal Generation Tasks ### (1) Image Caption Generation #### Download COCO Caption dataset: ##### Note: Visual features are extracted by [bottom-up-attention](https://github.com/peteanderson80/bottom-up-attention) ``` cd /path/to/data wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/coco.tar.gz tar -zxf coco.tar.gz ``` #### Download evaluation script: ``` cd src/eval/tasks wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/coco.tar.gz tar -zxf coco.tar.gz ``` #### Run the following common to train and evaluate on the COCO Caption dataset: For base model: ``` bash ./script/img2txt/coco/run.sh ``` For large model: ``` bash ./script/img2txt/coco_large/run.sh ``` #### Evaluation Results:
Model
BLUE4
CIDEr
UNIMO-base
38.8
124.4
UNIMO-large
39.6
127.7
--- Citation --- If you find our paper and code useful, please cite the following paper: ``` @article{li2020unimo, title={UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning}, author={Li, Wei and Gao, Can and Niu, Guocheng and Xiao, Xinyan and Liu, Hao and Liu, Jiachen and Wu, Hua and Wang, Haifeng}, journal={arXiv preprint arXiv:2012.15409}, year={2020} } ``` Contact information --- For help or issues using `UNIMO`, please submit a GitHub issue. For personal communication related to `UNIMO`, please contact Wei Li (liwei85@baidu.com), Guocheng Niu (niuguocheng@baidu.com) , Can Gao (gaocan01@baidu.com).