In recent years, the cross modal model based on large-scale data pre training has made remarkable achievements. The two tower pre training framework based on **constrastive learning** can make full use of large-scale text alignment data, and show great improvement in tasks such as cross modal retrieval. At the same time, it has received wide attention due to its high computational efficiency, such as [CLIP](https://arxiv.org/pdf/2103.00020.pdf), [ALIGN](https://arxiv.org/pdf/2102.05918.pdf) Etc. However, the traditional vision language pre training technology is based on single perspective comparative learning, and cannot learn the correlation between and within multiple modes.
**ERNIE-ViL 2.0** proposes a pre-training framework based on multiple perspective contrastive learning. By building rich visual/text views, it can learn multiple correlations between and within modes at the same time, so as to learn more robust cross modal alignment, and achieve an industry-leading level in tasks such as cross modal retrieval.
# ERNIE-ViL 2.0: Multi-View Contrastive Learning for Image-Text Pre-training
>[_**ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training**_](https://arxiv.org/pdf/2209.15270.pdf)
* Data preparation: we have built in a [COCO-CN](http://bj.bcebos.com/wenxin-models/test.coco_cn.data) test set. The data format (UTF-8 encoding by default) is three columns separated by \t. The first column is text, the second column is the image ID in coco, and the third column is the image encoded by Base64.
* Data preparation: we have built in a [COCO-CN](https://github.com/li-xirong/coco-cn) test set. The data format (UTF-8 encoding by default) is three columns separated by \t. The first column is text, the second column is the image ID in coco, and the third column is the image encoded by Base64.
* First, install the environment and install [paddle>=2.1.3](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.HTML) and [requirements.txt](requirements.txt),
* Then, for ./packages/configs/ernie_vil_base.yaml performs various configurations. For details, please refer to the notes in the configuration (including input/output path location and model parameter path).
* Finally, run the following command to get cross modal graphic embeddings
...
...
@@ -45,11 +50,10 @@ By define in /packages/configs/ernie_vil_base.yaml The location of the output re