Update readme_en.md

7ce23f46 · Kennycao123 · GitHub · 5cdc8cac · 7ce23f46
隐藏空白更改
内联并排

Showing with 28 addition and 24 deletion

Research/ERNIE-ViL2/readme_en.md Research/ERNIE-ViL2/readme_en.md +28 -24

未找到文件。
--- a/Research/ERNIE-ViL2/readme_en.md
+++ b/Research/ERNIE-ViL2/readme_en.md
-Simplified Chinese|[English](./readme_en.md)
+Simplified [Chinese](./readme.md) |English

-# ERNIE-ViL 2.0 
-In recent years, the cross modal model based on large-scale data pre training has made remarkable achievements. The two tower pre training framework based on **constrastive learning** can make full use of large-scale text alignment data, and show great improvement in tasks such as cross modal retrieval. At the same time, it has received wide attention due to its high computational efficiency, such as [CLIP](https://arxiv.org/pdf/2103.00020.pdf), [ALIGN](https://arxiv.org/pdf/2102.05918.pdf) Etc. However, the traditional vision language pre training technology is based on single perspective comparative learning, and cannot learn the correlation between and within multiple modes.   
-**ERNIE-ViL 2.0** proposes a pre-training framework based on multiple perspective contrastive learning. By building rich visual/text views, it can learn multiple correlations between and within modes at the same time, so as to learn more robust cross modal alignment, and achieve an industry-leading level in tasks such as cross modal retrieval.  
+# ERNIE-ViL 2.0: Multi-View Contrastive Learning for Image-Text Pre-training
+>[_**ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training**_](https://arxiv.org/pdf/2209.15270.pdf)
+>
+>Bin Shan, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
+>
+>

 ## Methods
 ERNIE-ViL 2.0's multi-view  contrastive learning includes:
@@ -10,27 +13,29 @@ ERNIE-ViL 2.0's multi-view  contrastive learning includes:
 - Modal contrast learning: iamge-image, text-text

 ![ERNIE-ViL2.0](./packages/src/framework.png)
-## Cross-modal retrieval performance (Zero-Shot)
-* **ERNIE-ViL 2.0 base (ViT)**: ViT-B-16 (visual backbone) + ERNIE 3.0 base (text backbone)  
+## Cross modal retrieval effect
+The following is the zero shot results of Chinese and English models in Flickr30K and COCO-CN. See the paper for other details.

-|Dataset | T2I R@1  |   I2T  R@1  |   meanRecall  |  
-|------------|-------|--------|----|
-| [COCO-CN]( https://arxiv.org/pdf/1805.08661.pdf ) | 66.00 | 65.90 |  84.28 |    
-| [AIC-ICC]( https://arxiv.org/pdf/1711.06475.pdf ) | 17.93 | 30.41 |  38.57 |
-* **ERNIE-ViL 2.0 large (ViT)**: ViT-L-14 (visual backbone) + ERNIE 3.0 large (text backbone)  
+* **ERNIE-ViL 2.0 （English）on Flickr30k**:

-|Dataset | T2I R@1  |   I2T  R@1  |   meanRecall  |  
-|------------|-------|--------|----|
-| [COCO-CN]( https://arxiv.org/pdf/1805.08661.pdf ) | 70.30 | 68.80| 86.32 |    
-| [AIC-ICC]( https://arxiv.org/pdf/1711.06475.pdf ) | 20.17 | 32.29 | 41.08 |
-* Here, AIC-ICC is the effect of the first 10000 lines of the validation set  
+| Name       |   R@1 |   R@5 |   R@10 |  
+|------------|-------|-------|--------|
+| Text2Image | 85.0 | 97.0 |  98.3 |      
+| Image2Text | 96.1 | 99.9 |  100.0 |  
+* **ERNIE-ViL 2.0 （Chinese） on COCO-CN**:
+
+| Name       |   R@1 |   R@5 |   R@10 |   
+|------------|-------|-------|--------|  
+| Text2Image | 69.6 | 91.2 |  96.9 |    
+| Image2Text | 69.1 | 92.9 |  97.1 |
+ 

 ## Examples
-Here, ERNIE-ViL 2.0 base (ViT) (open source) is used as an example to perform the text retrieval task of zero-shot on COCO-CN:  
+Here, ERNIE-ViL 2.0 base (ViT) (open source)（chinese model） is used as an example to perform the text retrieval task of zero-shot on COCO-CN:  

 * Model Download:
 [ERNIE-ViL 2.0 Base（ViT)]( http://bj.bcebos.com/wenxin-models/ERNIE_VIL2_BASE_ViT.pdparams)
-* Data preparation: we have built in a [COCO-CN](http://bj.bcebos.com/wenxin-models/test.coco_cn.data) test set. The data format (UTF-8 encoding by default) is three columns separated by \t. The first column is text, the second column is the image ID in coco, and the third column is the image encoded by Base64.
+* Data preparation: we have built in a [COCO-CN](https://github.com/li-xirong/coco-cn) test set. The data format (UTF-8 encoding by default) is three columns separated by \t. The first column is text, the second column is the image ID in coco, and the third column is the image encoded by Base64.
 * First, install the environment and install [paddle>=2.1.3](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.HTML) and [requirements.txt](requirements.txt),
 * Then, for ./packages/configs/ernie_vil_base.yaml performs various configurations. For details, please refer to the notes in the configuration (including input/output path location and model parameter path).
 * Finally, run the following command to get cross modal graphic embeddings
@@ -45,11 +50,10 @@ By define in /packages/configs/ernie_vil_base.yaml The location of the output re
 # Usage: python $0 output-embedding-path
 $ python eval_retri.py test_out/cross_modal_embeddings.out
 ```
+The following is the results of ERNIE-ViL 2.0 Base model in COCO-CN. See the paper for detailed results

-The following is the result of ERNIE-ViL 2.0 base model in COCO-CN
-
-| Name       |    R@1  |    R@5  |    R@10  |   meanRecall |
+| Name       |   R@1 |   R@5 |   R@10 |   meanRecall |
 |------------|-------|-------|--------|--------------|
-| Text2Image | 66.00 | 90.00 |  96.10 |        84.03 |
-| Image2Text | 65.90 | 91.40 |  96.30 |        84.53 | 
-| MeanRecall | 65.95 | 90.70 |  96.20 |        84.28 |  
+| Text2Image | 65.9 | 90.1 |  96.1 |        84.0 |
+| Image2Text | 66.5 | 91.6 |  96.2 |        84.8 | 
+| MeanRecall | 66.2 | 90.9 |  96.2 |        84.4 |