{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. ERNIE-M Introduction\n", "\n", "ERNIE-M, proposed by Baidu, is a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance. The insight is to integrate back-translation into the pre-training process by generating pseudo-parallel sentence pairs on a monolingual corpus to enable the learning of semantic alignments between different languages, thereby enhancing the semantic modeling of cross-lingual models. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results in various cross-lingual downstream tasks.\n", "\n", "This project is a PaddlePaddle dynamic graph implementation of ERNIE-M, including model training, model validation, etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Model Effects and Application Scenarios\n", "\n", "### 2.1 Natural Language Inference\n", "\n", "#### 2.1.1 Dataset\n", "\n", "XNLI is a subset of MNLI and has been translated into 14 different kinds of languages including some low-resource languages. The goal of the task is to predict testual entailment (whether sentence A implies / contradicts / neither sentence B).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. How to Use the Model\n", "\n", "## 3.1 Model Training\n", "\n", "* prepare module\n", "\n", "```shell\n", "pip install --upgrade paddlenlp\n", "pip install datasets\n", "```\n", "\n", "* download script file.\n", "\n", "```shell\n", "# download from gitee\n", "wget https://gitee.com/paddlepaddle/PaddleNLP/blob/develop/model_zoo/ernie-m/run_classifier.py\n", "```\n", "\n", "* training with single gpu\n", "\n", "```shell\n", "python run_classifier.py \\\n", " --task_type cross-lingual-transfer \\\n", " --batch_size 16 \\\n", " --model_name_or_path ernie-m-base \\\n", " --save_steps 12272 \\\n", " --output_dir output\n", "```\n", "\n", "* training with multiple gpu\n", "\n", "```shell\n", "python -m paddle.distributed.launch --gpus 0,1 --log_dir output run_classifier.py \\\n", " --task_type cross-lingual-transfer \\\n", " --batch_size 16 \\\n", " --model_name_or_path ernie-m-base \\\n", " --save_steps 12272 \\\n", " --output_dir output\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.2 Parameter Description\n", "\n", "- `task_type` the type of Natural Language Inference. Supporting types include \"cross-lingual-transfer\", \"translate-train-all\", imply fine-tune the model with an English training set and evaluate the foreign language XNLI test and ine-tune the model on the concatenation of all other languages and evaluate on each language test set, respectively.\n", "- `model_name_or_path` Path to pre-trained model. Supporting \"ernie-m-base\", \"ernie-m-large\" or the directory to the model saved on the local device, e.g. \"./checkpoint/model_xx/\"。\n", "- `output_dir` The output directory where the model predictions and checkpoints will be written.\n", "- `max_seq_length` The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.\n", "- `learning_rate` The initial learning rate for AdamW.\n", "- `num_train_epochs` Total number of training epochs to perform.\n", "- `logging_steps` Log every X updates steps.\n", "- `save_steps` Save checkpoint every X updates steps.\n", "- `batch_size` Batch size per GPU/CPU/XPU for training.\n", "- `weight_decay` Weight decay ratio for AdamW.\n", "- `layerwise_decay` Layerwise decay ratio.\n", "- `warmup_proportion` Linear warmup over warmup_steps. If > 0: Override warmup_proportion.\n", "- `max_steps` If > 0: set total number of training steps to perform. Override num_train_epochs.\n", "- `seed` random seed for initialization.\n", "- `device` The device to select to train the model, is must be cpu/gpu/xpu.\n", "- `use_amp` Enable mixed precision training.\n", "- `scale_loss` The value of scale_loss for fp16." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Model Principles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We proposed two novel methods to align the representation of multiple languages:\n", "\n", "- **Cross-Attention Masked Language Modeling(CAMLM)**: In CAMLM, we learn the multilingual semantic representation by restoring the MASK tokens in the input sentences.\n", "- **Back-Translation masked language modeling(BTMLM)**: We use BTMLM to train our model to generate pseudo-parallel sentences from the monolingual sentences. The generated pairs are then used as the input of the model to further align the cross-lingual semantics, thus enhancing the multilingual representation.\n", "\n", "![](https://foruda.gitee.com/images/1668157409546826003/f78cb949_5218658.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Related papers and citations\n", "\n", "```bibtex\n", "@article{Ouyang2021ERNIEMEM,\n", " title={ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora},\n", " author={Xuan Ouyang and Shuohuan Wang and Chao Pang and Yu Sun and Hao Tian and Hua Wu and Haifeng Wang},\n", " journal={ArXiv},\n", " year={2021},\n", " volume={abs/2012.15674}\n", "}\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.7.13 ('model_center')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "de1ffcbce2b3061b5001e2c22f3a27594f323d4a49b789ebdbef6534581834bd" } } }, "nbformat": 4, "nbformat_minor": 2 }