update the readme for the word_embedding (#5050)

update the readme for the word_embedding (#5050)

update the readme for the word_embedding (#5050)
33d65b31 · wawltor · GitHub · ec17d938 · 33d65b31 · 33d65b31
Showing with 16 addition and 10 deletion

PaddleNLP/examples/word_embedding/README.md PaddleNLP/examples/word_embedding/README.md +9 -8

PaddleNLP/paddlenlp/embeddings/token_embedding.py PaddleNLP/paddlenlp/embeddings/token_embedding.py +7 -2

未找到文件。
--- a/PaddleNLP/examples/word_embedding/README.md
+++ b/PaddleNLP/examples/word_embedding/README.md
@@ -35,14 +35,14 @@ wget https://paddlenlp.bj.bcebos.com/data/dict.txt

 ### 启动训练

-我们以中文情感分类公开数据集ChnSentiCorp为示例数据集，可以运行下面的命令，在训练集（train.tsv）上进行模型训练，并在开发集（dev.tsv）验证。实验输出的日志保存在use_token_embedding.txt和use_normal_embedding.txt。
+我们以中文情感分类公开数据集ChnSentiCorp为示例数据集，可以运行下面的命令，在训练集（train.tsv）上进行模型训练，并在开发集（dev.tsv）验证。实验输出的日志保存在use_token_embedding.txt和use_normal_embedding.txt。使用PaddlePaddle框架的Embedding在ChnSentiCorp下非常容易过拟合，因此调低了它的学习率。

 CPU 启动：

 ```
 nohup python train.py --vocab_path='./dict.txt' --use_gpu=False --lr=5e-4 --batch_size=64 --epochs=20 --use_token_embedding=True --vdl_dir='./vdl_dir' >use_token_embedding.txt 2>&1 &

-nohup python train.py --vocab_path='./dict.txt' --use_gpu=False --lr=5e-4 --batch_size=64 --epochs=20 --use_token_embedding=False --vdl_dir='./vdl_dir'>use_normal_embedding.txt 2>&1 &
+nohup python train.py --vocab_path='./dict.txt' --use_gpu=False --lr=1e-4 --batch_size=64 --epochs=20 --use_token_embedding=False --vdl_dir='./vdl_dir'>use_normal_embedding.txt 2>&1 &
 ```

 GPU 启动：
@@ -52,7 +52,7 @@ export CUDA_VISIBLE_DEVICES=0
 nohup python train.py --vocab_path='./dict.txt' --use_gpu=True --lr=5e-4 --batch_size=64 --epochs=20 --use_token_embedding=True --vdl_dir='./vdl_dir' > use_token_embedding.txt 2>&1 &

 # 如显存不足，可以先等第一个训练完成再启动该训练
-nohup python train.py --vocab_path='./dict.txt' --use_gpu=True --lr=5e-4 --batch_size=64 --epochs=20 --use_token_embedding=False --vdl_dir='./vdl_dir' > use_normal_embedding.txt 2>&1 &
+nohup python train.py --vocab_path='./dict.txt' --use_gpu=True --lr=1e-4 --batch_size=64 --epochs=20 --use_token_embedding=False --vdl_dir='./vdl_dir' > use_normal_embedding.txt 2>&1 &
 ```

 以上参数表示：
@@ -83,12 +83,13 @@ nohup visualdl --logdir ./vdl_dir --port 8888 --host 0.0.0.0 &

 在Chrome浏览器输入 `ip:8888` (ip为启动VisualDL机器的IP)。

-以下为示例实验效果对比图，蓝色是使用`paddle.embeddings.TokenEmbedding`进行的实验，绿色是使用没有加载预训练模型的Embedding进行的实验。可以看到，使用`paddle.embeddings.TokenEmbedding`的训练，其验证acc变化趋势上升，并收敛于0.90左右，而没有使用`paddle.embeddings.TokenEmbedding`的训练，其验证acc变化趋势向下，并收敛于0.86左右。从示例实验可以观察到，使用`paddle.embedding.TokenEmbedding`能提升训练效果。
+以下为示例实验效果对比图，蓝色是使用`paddle.embeddings.TokenEmbedding`进行的实验，绿色是使用没有加载预训练模型的Embedding进行的实验。可以看到，使用`paddle.embeddings.TokenEmbedding`的训练，其验证acc变化趋势上升，并收敛于0.90左右，收敛后相对平稳，不容易过拟合。而没有使用`paddle.embeddings.TokenEmbedding`的训练，其验证acc变化趋势向下，并收敛于0.86左右。从示例实验可以观察到，使用`paddle.embedding.TokenEmbedding`能提升训练效果。

 Eval Acc：

-![eval acc](https://user-images.githubusercontent.com/10826371/102055579-0a743780-3e26-11eb-9025-99ffd06ecb68.png)
+![eval acc](https://user-images.githubusercontent.com/16698950/102076935-79ac5480-3e43-11eb-81f8-6e509c394fbf.png)

-Eval Loss：
-
-![eval loss](https://user-images.githubusercontent.com/10826371/102055669-28da3300-3e26-11eb-8034-ee902931b7cf.png)
+|                                     |    Best Acc    |
+| ------------------------------------| -------------  |
+| paddle.nn.Embedding                 |    0.8965      |
+| paddelnlp.embeddings.TokenEmbedding |    0.9082      |
--- a/PaddleNLP/paddlenlp/embeddings/token_embedding.py
+++ b/PaddleNLP/paddlenlp/embeddings/token_embedding.py
@@ -121,8 +121,13 @@ class TokenEmbedding(nn.Embedding):
        # update idx_to_word
        self._idx_to_word = extend_vocab_list
        self._word_to_idx = self._construct_word_to_idx(self._idx_to_word)
-        embedding_table = np.random.normal(
-            scale=0.02,
+
+        # use the Xavier init the embedding
+        xavier_scale = np.sqrt(
+            6.0 / float(len(self._idx_to_word) + self.embedding_dim))
+        embedding_table = np.random.uniform(
+            low=-1.0 * xavier_scale,
+            high=xavier_scale,
            size=(len(self._idx_to_word),
                  self.embedding_dim)).astype(paddle.get_default_dtype())