未验证 提交 67bc1ace 编写于 作者: K Kennycao123 提交者: GitHub

Update readme_en.md

上级 bb30cc5f
简体中文|[English](./readme_en.md)
# ERNIE-ViL 2.0
Cross-modal pretraining is one of the important research directions in the field of artificial intelligence. How to make machines have the ability to understand and think like humans requires the integration of multimodal information such as language, speech, and vision.
In recent years, significant progress has been made in single-modal semantic understanding technologies such as vision, language, and speech. But more AI real-world scenarios actually involve information from multiple modalities at the same time. For example, an ideal AI assistant needs to communicate with humans based on multimodal information such as language, voice, and actions, which requires machines to have multimodal semantic understanding capabilities.
Cross-modal pre-training models based on cross-encoders (such as ViLBERT, ERNIE-ViL, etc.) have achieved great results in many cross-modal tasks, especially in complex cross-modalities such as visual common sense reasoning. The improvement in state tasks is even greater. However, the cross-modal attention mechanism between modalities brings a lot of computational cost, and the application in online systems such as large-scale cross-modal retrieval faces huge challenges. Recently, the dual-tower pre-training framework (dual encoder) based on contrastive learning can make full use of large-scale image-text alignment data, and has shown a great improvement in tasks such as cross-modal retrieval. Widespread concern.
The traditional vision-language pre-training technology is based on single-view contrastive learning, which cannot learn the correlation between multiple modalities and within modalities. ERNIE-ViL 2.0 proposes a pre-training framework based on multi-view contrastive learning. By constructing With rich visual/textual perspectives, it is able to simultaneously learn multiple correlations between modalities and within modalities, thereby learning more robust cross-modal alignment, and achieving state-of-the-art on tasks such as cross-modal retrieval.
Simplified Chinese|[English](./readme_en.md)
## Method
# ERNIE-ViL 2.0
In recent years, the cross modal model based on large-scale data pre training has made remarkable achievements. The two tower pre training framework based on **constrastive learning** can make full use of large-scale text alignment data, and show great improvement in tasks such as cross modal retrieval. At the same time, it has received wide attention due to its high computational efficiency, such as [CLIP](https://arxiv.org/pdf/2103.00020.pdf), [ALIGN](https://arxiv.org/pdf/2102.05918.pdf) Etc. However, the traditional vision language pre training technology is based on single perspective comparative learning, and cannot learn the correlation between and within multiple modes.
**ERNIE-ViL 2.0** proposes a pre-training framework based on multiple perspective contrastive learning. By building rich visual/text views, it can learn multiple correlations between and within modes at the same time, so as to learn more robust cross modal alignment, and achieve an industry-leading level in tasks such as cross modal retrieval.
## Methods
ERNIE-ViL 2.0's multi-view contrastive learning includes:
- Cross modal contrastive learning: image-caption, image-objects
- Modal contrast learning: iamge-image, text-text
![ERNIE-ViL2.0](./packages/src/framework.png)
## Cross-modal retrieval performance (Zero-Shot)
* **ERNIE-ViL 2.0 base (ViT)**: ViT-B-16 (visual backbone) + ERNIE 3.0 base (text backbone)
|Dataset | T2I R@1 | I2T R@1 | meanRecall |
|------------|-------|--------|----|
| [COCO-CN]( https://arxiv.org/pdf/1805.08661.pdf ) | 66.00 | 65.90 | 84.28 |
| [AIC-ICC]( https://arxiv.org/pdf/1711.06475.pdf ) | 17.93 | 30.41 | 38.57 |
* **ERNIE-ViL 2.0 large (ViT)**: ViT-L-14 (visual backbone) + ERNIE 3.0 large (text backbone)
|Dataset | T2I R@1 | I2T R@1 | meanRecall |
|------------|-------|--------|----|
| [COCO-CN]( https://arxiv.org/pdf/1805.08661.pdf ) | 70.30 | 68.80| 86.32 |
| [AIC-ICC]( https://arxiv.org/pdf/1711.06475.pdf ) | 20.17 | 32.29 | 41.08 |
* Here, AIC-ICC is the effect of the first 10000 lines of the validation set
## Examples
Here, ERNIE-ViL 2.0 base (ViT) (open source) is used as an example to perform the text retrieval task of zero-shot on COCO-CN:
* Model Download:
[ERNIE-ViL 2.0 Base(ViT)]( http://bj.bcebos.com/wenxin-models/ERNIE_VIL2_BASE_ViT.pdparams)
* Data preparation: we have built in a [COCO-CN](http://bj.bcebos.com/wenxin-models/test.coco_cn.data) test set. The data format (UTF-8 encoding by default) is three columns separated by \t. The first column is text, the second column is the image ID in coco, and the third column is the image encoded by Base64.
* First, install the environment and install [paddle>=2.1.3](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.HTML) and [requirements.txt](requirements.txt),
* Then, for ./packages/configs/ernie_vil_base.yaml performs various configurations. For details, please refer to the notes in the configuration (including input/output path location and model parameter path).
* Finally, run the following command to get cross modal graphic embeddings
## Performance
Chinese model:
AIC-ICC dataset
| Model | Structure | T2I R@1 | I2T R@1 | meanRecall |
|------------|---------|-------|--------|----|
| ERNIE-ViL 2.0 Base (ViT)| ViT-B-16 + ERNIE 3.0 Base| 17.93 | 30.41 | 38.57 |
| ERNIE-ViL 2.0 Base (CNN) | EfficientNET-B5 + ERNIE 2.0 Base | 14.77 | 26.05 | 34.47 |
| ERNIE-ViL 2.0 Large (ViT) | ViT-L-14 + ERNIE 3.0 Large | **20.17** | 32.29 | **41.08** |
| ERNIE-ViL 2.0 Large (CNN) | EfficientNET-L2 + ERNIE 2.0 Large | 19.01 | **33.65** | 40.58 |
COCO-CN dataset
| Model | Structure | T2I R@1 | I2T R@1 | meanRecall |
|------------|---------|-------|--------|----|
| ERNIE-ViL 2.0 Base (ViT) | ViT-B-16 + ERNIE 3.0 Base | 66.00 | 65.90 | 84.28 |
| ERNIE-ViL 2.0 Base (CNN) | EfficientNET-B5 + ERNIE 2.0 Base | 62.70 | 65.30 | 83.17 |
| ERNIE-ViL 2.0 Large (ViT) | ViT-L-14 + ERNIE 3.0 Large| **70.30** | 68.80| **86.32** |
| ERNIE-ViL 2.0 Large (CNN) | EfficientNET-L2 + ERNIE 2.0 Large | 69.80 | **69.50** | 86.28 |
## Example
Here is an example of the image and text retrieval task of ZERO-SHOT on COCO-CN using ERNIE-ViL 2.0 Base (ViT):
* Model download:
[ERNIE-ViL 2.0 Base (ViT)]()
* Data preparation: We built a [COCO-CN test set](./packages/coco/test.coco_cn.data), the data format (the default is UTF-8 encoding), is three columns, separated by \t , the first column is the text, the second column is the image ID in coco, and the third column is the image encoded by base64.
* First install the environment, install [paddle>=2.1.0](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html), and the packages required by [requirements.txt](requirements.txt)
* Then, configure ./packages/configs/ernie_vil_base.yaml, please refer to the notes in the configuration for details.
* Finally, run the following command to get cross-modal graphics and text embeddings
```bash
$ bash run_infer.sh 2 ./packages/configs/ernie_vil_base.yaml
````
Use the following script to evaluate by defining the output location in ./packages/configs/ernie_vil_base.yaml:
# Usage: bash $0 gpu-card-index config-path
$ bash run_infer.sh 2 ./packages/configs/ernie_vil_base.yaml
```
By define in /packages/configs/ernie_vil_base.yaml The location of the output result defined by base.yaml is evaluated using the following script:
```bash
# Usage: python $0 output-embedding-path
$ python eval_retri.py test_out/cross_modal_embeddings.out
```
The following are the results of the ERNIE-ViL 2.0 Base model in COCO-CN
| Name | R@1 | R@5 | R@10 | meanRecall |
|------------|-------|-------|--------|------------|
| Text2Image | 66.00 | 90.00 | 96.10 | 84.03 |
| Image2Text | 65.90 | 91.40 | 96.30 | 84.53 |
| MeanRecall | 65.95 | 90.70 | 96.20 | 84.28 |
The following is the result of ERNIE-ViL 2.0 base model in COCO-CN
## Other
The image data storage format adopted by ERNIE-ViL is [base64](https://www.base64decode.org/) format.
\ No newline at end of file
| Name | R@1 | R@5 | R@10 | meanRecall |
|------------|-------|-------|--------|--------------|
| Text2Image | 66.00 | 90.00 | 96.10 | 84.03 |
| Image2Text | 65.90 | 91.40 | 96.30 | 84.53 |
| MeanRecall | 65.95 | 90.70 | 96.20 | 84.28 |
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册