未验证 提交 7ce23f46 编写于 作者: K Kennycao123 提交者: GitHub

Update readme_en.md

上级 5cdc8cac
Simplified Chinese|[English](./readme_en.md)
Simplified [Chinese](./readme.md) |English
# ERNIE-ViL 2.0
In recent years, the cross modal model based on large-scale data pre training has made remarkable achievements. The two tower pre training framework based on **constrastive learning** can make full use of large-scale text alignment data, and show great improvement in tasks such as cross modal retrieval. At the same time, it has received wide attention due to its high computational efficiency, such as [CLIP](https://arxiv.org/pdf/2103.00020.pdf), [ALIGN](https://arxiv.org/pdf/2102.05918.pdf) Etc. However, the traditional vision language pre training technology is based on single perspective comparative learning, and cannot learn the correlation between and within multiple modes.
**ERNIE-ViL 2.0** proposes a pre-training framework based on multiple perspective contrastive learning. By building rich visual/text views, it can learn multiple correlations between and within modes at the same time, so as to learn more robust cross modal alignment, and achieve an industry-leading level in tasks such as cross modal retrieval.
# ERNIE-ViL 2.0: Multi-View Contrastive Learning for Image-Text Pre-training
>[_**ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training**_](https://arxiv.org/pdf/2209.15270.pdf)
>
>Bin Shan, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
>
>
## Methods
ERNIE-ViL 2.0's multi-view contrastive learning includes:
......@@ -10,27 +13,29 @@ ERNIE-ViL 2.0's multi-view contrastive learning includes:
- Modal contrast learning: iamge-image, text-text
![ERNIE-ViL2.0](./packages/src/framework.png)
## Cross-modal retrieval performance (Zero-Shot)
* **ERNIE-ViL 2.0 base (ViT)**: ViT-B-16 (visual backbone) + ERNIE 3.0 base (text backbone)
## Cross modal retrieval effect
The following is the zero shot results of Chinese and English models in Flickr30K and COCO-CN. See the paper for other details.
|Dataset | T2I R@1 | I2T R@1 | meanRecall |
|------------|-------|--------|----|
| [COCO-CN]( https://arxiv.org/pdf/1805.08661.pdf ) | 66.00 | 65.90 | 84.28 |
| [AIC-ICC]( https://arxiv.org/pdf/1711.06475.pdf ) | 17.93 | 30.41 | 38.57 |
* **ERNIE-ViL 2.0 large (ViT)**: ViT-L-14 (visual backbone) + ERNIE 3.0 large (text backbone)
* **ERNIE-ViL 2.0 (English)on Flickr30k**:
|Dataset | T2I R@1 | I2T R@1 | meanRecall |
|------------|-------|--------|----|
| [COCO-CN]( https://arxiv.org/pdf/1805.08661.pdf ) | 70.30 | 68.80| 86.32 |
| [AIC-ICC]( https://arxiv.org/pdf/1711.06475.pdf ) | 20.17 | 32.29 | 41.08 |
* Here, AIC-ICC is the effect of the first 10000 lines of the validation set
| Name | R@1 | R@5 | R@10 |
|------------|-------|-------|--------|
| Text2Image | 85.0 | 97.0 | 98.3 |
| Image2Text | 96.1 | 99.9 | 100.0 |
* **ERNIE-ViL 2.0 (Chinese) on COCO-CN**:
| Name | R@1 | R@5 | R@10 |
|------------|-------|-------|--------|
| Text2Image | 69.6 | 91.2 | 96.9 |
| Image2Text | 69.1 | 92.9 | 97.1 |
## Examples
Here, ERNIE-ViL 2.0 base (ViT) (open source) is used as an example to perform the text retrieval task of zero-shot on COCO-CN:
Here, ERNIE-ViL 2.0 base (ViT) (open source)(chinese model) is used as an example to perform the text retrieval task of zero-shot on COCO-CN:
* Model Download:
[ERNIE-ViL 2.0 Base(ViT)]( http://bj.bcebos.com/wenxin-models/ERNIE_VIL2_BASE_ViT.pdparams)
* Data preparation: we have built in a [COCO-CN](http://bj.bcebos.com/wenxin-models/test.coco_cn.data) test set. The data format (UTF-8 encoding by default) is three columns separated by \t. The first column is text, the second column is the image ID in coco, and the third column is the image encoded by Base64.
* Data preparation: we have built in a [COCO-CN](https://github.com/li-xirong/coco-cn) test set. The data format (UTF-8 encoding by default) is three columns separated by \t. The first column is text, the second column is the image ID in coco, and the third column is the image encoded by Base64.
* First, install the environment and install [paddle>=2.1.3](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.HTML) and [requirements.txt](requirements.txt),
* Then, for ./packages/configs/ernie_vil_base.yaml performs various configurations. For details, please refer to the notes in the configuration (including input/output path location and model parameter path).
* Finally, run the following command to get cross modal graphic embeddings
......@@ -45,11 +50,10 @@ By define in /packages/configs/ernie_vil_base.yaml The location of the output re
# Usage: python $0 output-embedding-path
$ python eval_retri.py test_out/cross_modal_embeddings.out
```
The following is the results of ERNIE-ViL 2.0 Base model in COCO-CN. See the paper for detailed results
The following is the result of ERNIE-ViL 2.0 base model in COCO-CN
| Name | R@1 | R@5 | R@10 | meanRecall |
| Name | R@1 | R@5 | R@10 | meanRecall |
|------------|-------|-------|--------|--------------|
| Text2Image | 66.00 | 90.00 | 96.10 | 84.03 |
| Image2Text | 65.90 | 91.40 | 96.30 | 84.53 |
| MeanRecall | 65.95 | 90.70 | 96.20 | 84.28 |
| Text2Image | 65.9 | 90.1 | 96.1 | 84.0 |
| Image2Text | 66.5 | 91.6 | 96.2 | 84.8 |
| MeanRecall | 66.2 | 90.9 | 96.2 | 84.4 |
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册