提交 e693156e 编写于 作者: S smallv0221

Merge branch 'develop' of https://github.com/PaddlePaddle/models into yxp1216

...@@ -31,14 +31,14 @@ PaddleNLP提供多个开源的预训练Embedding模型,用户仅需在使用`p ...@@ -31,14 +31,14 @@ PaddleNLP提供多个开源的预训练Embedding模型,用户仅需在使用`p
| Co-occurrence 类型 | 目标词向量 | 上下文词向量 | | Co-occurrence 类型 | 目标词向量 | 上下文词向量 |
| --------------------------- | ------ | ---- | | --------------------------- | ------ | ---- |
| Word → Word | w2v.baidu_encyclopedia.target.word-word.dim300 | w2v.baidu_encyclopedia.context.word-word.dim300 | | Word → Word | w2v.baidu_encyclopedia.target.word-word.dim300 | w2v.baidu_encyclopedia.context.word-word.dim300 |
| Word → Ngram (1-2) | w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300 | 暂无 | | Word → Ngram (1-2) | w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300 | w2v.baidu_encyclopedia.context.word-ngram.1-2.dim300 |
| Word → Ngram (1-3) | 暂无 | 暂无 | | Word → Ngram (1-3) | w2v.baidu_encyclopedia.target.word-ngram.1-3.dim300 | w2v.baidu_encyclopedia.context.word-ngram.1-3.dim300 |
| Ngram (1-2) → Ngram (1-2)| 暂无 | 暂无 | | Ngram (1-2) → Ngram (1-2)| w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300 | w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300 |
| Word → Character (1) | w2v.baidu_encyclopedia.target.word-character.char1-1.dim300 | w2v.baidu_encyclopedia.context.word-character.char1-1.dim300 | | Word → Character (1) | w2v.baidu_encyclopedia.target.word-character.char1-1.dim300 | w2v.baidu_encyclopedia.context.word-character.char1-1.dim300 |
| Word → Character (1-2) | w2v.baidu_encyclopedia.target.word-character.char1-2.dim300 | w2v.baidu_encyclopedia.context.word-character.char1-2.dim300 | | Word → Character (1-2) | w2v.baidu_encyclopedia.target.word-character.char1-2.dim300 | w2v.baidu_encyclopedia.context.word-character.char1-2.dim300 |
| Word → Character (1-4) | w2v.baidu_encyclopedia.target.word-character.char1-4.dim300 | w2v.baidu_encyclopedia.context.word-character.char1-4.dim300 | | Word → Character (1-4) | w2v.baidu_encyclopedia.target.word-character.char1-4.dim300 | w2v.baidu_encyclopedia.context.word-character.char1-4.dim300 |
| Word → Word (left/right) | 暂无 | 暂无 | | Word → Word (left/right) | w2v.baidu_encyclopedia.target.word-wordLR.dim300 | w2v.baidu_encyclopedia.context.word-wordLR.dim300 |
| Word → Word (distance) | 暂无 | 暂无 | | Word → Word (distance) | w2v.baidu_encyclopedia.target.word-wordPosition.dim300 | w2v.baidu_encyclopedia.context.word-wordPosition.dim300 |
## 英文词向量 ## 英文词向量
......
...@@ -117,6 +117,7 @@ python predict.py \ ...@@ -117,6 +117,7 @@ python predict.py \
--max_grad_norm 5.0 \ --max_grad_norm 5.0 \
--dataset yahoo \ --dataset yahoo \
--use_gpu True \ --use_gpu True \
--infer_output_file infer_output.txt \
--init_from_ckpt yahoo_model/49 \ --init_from_ckpt yahoo_model/49 \
``` ```
......
...@@ -58,7 +58,7 @@ nohup python train.py --vocab_path='./dict.txt' --use_gpu=True --lr=1e-4 --batch ...@@ -58,7 +58,7 @@ nohup python train.py --vocab_path='./dict.txt' --use_gpu=True --lr=1e-4 --batch
以上参数表示: 以上参数表示:
* `vocab_path`: 词汇表文件路径。 * `vocab_path`: 词汇表文件路径。
* `use_gpu`: 是否使用GPU进行训练, 默认为`False` * `use_gpu`: 是否使用GPU进行训练, 默认为`True`
* `lr`: 学习率, 默认为5e-4。 * `lr`: 学习率, 默认为5e-4。
* `batch_size`: 运行一个batch大小,默认为64。 * `batch_size`: 运行一个batch大小,默认为64。
* `epochs`: 训练轮次,默认为5。 * `epochs`: 训练轮次,默认为5。
...@@ -96,3 +96,7 @@ Eval Acc: ...@@ -96,3 +96,7 @@ Eval Acc:
## 致谢 ## 致谢
- 感谢 [Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors)提供Word2Vec中文Embedding来源。 - 感谢 [Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors)提供Word2Vec中文Embedding来源。
## 参考论文
- Li, Shen, et al. "Analogical reasoning on chinese morphological and semantic relations." arXiv preprint arXiv:1805.06504 (2018).
- Qiu, Yuanyuan, et al. "Revisiting correlations between intrinsic and extrinsic evaluations of word embeddings." Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2018. 209-221.
...@@ -25,20 +25,21 @@ token_embedding = TokenEmbedding(embedding_name="w2v.baidu_encyclopedia.target.w ...@@ -25,20 +25,21 @@ token_embedding = TokenEmbedding(embedding_name="w2v.baidu_encyclopedia.target.w
# 查看token_embedding详情 # 查看token_embedding详情
print(token_embedding) print(token_embedding)
Object type: <paddlenlp.embeddings.token_embedding.TokenEmbedding object at 0x7f67fd192e30> Object type: <paddlenlp.embeddings.token_embedding.TokenEmbedding object at 0x7fda7eb5f290>
Unknown index: 1 Unknown index: 635963
Unknown token: [UNK] Unknown token: [UNK]
Padding index: 0 Padding index: 635964
Padding token: [PAD] Padding token: [PAD]
Parameter containing: Parameter containing:
Tensor(shape=[636015, 300], dtype=float32, place=CPUPlace, stop_gradient=False, Tensor(shape=[635965, 300], dtype=float32, place=CPUPlace, stop_gradient=False,
[[ 0. , 0. , 0. , ..., 0. , 0. , 0. ], [[-0.24200200, 0.13931701, 0.07378800, ..., 0.14103900, 0.05592300, -0.08004800],
[ 0.00372404, 0.01534354, 0.01341010, ..., -0.00605236, -0.02150303, 0.02372430], [-0.08671700, 0.07770800, 0.09515300, ..., 0.11196400, 0.03082200, -0.12893000],
[-0.24200200, 0.13931701, 0.07378800, ..., 0.14103900, 0.05592300, -0.08004800], [-0.11436500, 0.12201900, 0.02833000, ..., 0.11068700, 0.03607300, -0.13763499],
..., ...,
[ 0.01615800, -0.00266300, -0.00628300, ..., 0.01484100, 0.00196600, -0.01032000], [ 0.02628800, -0.00008300, -0.00393500, ..., 0.00654000, 0.00024600, -0.00662600],
[ 0.01705700, 0.00040400, -0.01222000, ..., 0.02837200, 0.02402500, -0.00814800], [-0.00924490, 0.00652097, 0.01049327, ..., -0.01796000, 0.03498908, -0.02209341],
[ 0.02628800, -0.00008300, -0.00393500, ..., 0.00654000, 0.00024600, -0.00662600]]) [ 0. , 0. , 0. , ..., 0. , 0. , 0. ]])
``` ```
## 查询embedding结果 ## 查询embedding结果
...@@ -93,5 +94,5 @@ words = tokenizer.cut("中国人民") ...@@ -93,5 +94,5 @@ words = tokenizer.cut("中国人民")
print(words) # ['中国人', '民'] print(words) # ['中国人', '民']
tokens = tokenizer.encode("中国人民") tokens = tokenizer.encode("中国人民")
print(tokens) # [12532, 1336] print(tokens) # [12530, 1334]
``` ```
...@@ -22,17 +22,27 @@ PAD_TOKEN = '[PAD]' ...@@ -22,17 +22,27 @@ PAD_TOKEN = '[PAD]'
UNK_TOKEN = '[UNK]' UNK_TOKEN = '[UNK]'
EMBEDDING_NAME_LIST = [ EMBEDDING_NAME_LIST = [
# Word2Vec
# baidu_encyclopedia # baidu_encyclopedia
"w2v.baidu_encyclopedia.target.word-word.dim300", "w2v.baidu_encyclopedia.target.word-word.dim300",
"w2v.baidu_encyclopedia.target.word-character.char1-1.dim300", "w2v.baidu_encyclopedia.target.word-character.char1-1.dim300",
"w2v.baidu_encyclopedia.target.word-character.char1-2.dim300", "w2v.baidu_encyclopedia.target.word-character.char1-2.dim300",
"w2v.baidu_encyclopedia.target.word-character.char1-4.dim300", "w2v.baidu_encyclopedia.target.word-character.char1-4.dim300",
"w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300", "w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300",
"w2v.baidu_encyclopedia.target.word-ngram.1-3.dim300",
"w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300",
"w2v.baidu_encyclopedia.target.word-wordLR.dim300",
"w2v.baidu_encyclopedia.target.word-wordPosition.dim300",
"w2v.baidu_encyclopedia.target.bigram-char.dim300", "w2v.baidu_encyclopedia.target.bigram-char.dim300",
"w2v.baidu_encyclopedia.context.word-word.dim300", "w2v.baidu_encyclopedia.context.word-word.dim300",
"w2v.baidu_encyclopedia.context.word-character.char1-1.dim300", "w2v.baidu_encyclopedia.context.word-character.char1-1.dim300",
"w2v.baidu_encyclopedia.context.word-character.char1-2.dim300", "w2v.baidu_encyclopedia.context.word-character.char1-2.dim300",
"w2v.baidu_encyclopedia.context.word-character.char1-4.dim300", "w2v.baidu_encyclopedia.context.word-character.char1-4.dim300",
"w2v.baidu_encyclopedia.context.word-ngram.1-2.dim300",
"w2v.baidu_encyclopedia.context.word-ngram.1-3.dim300",
"w2v.baidu_encyclopedia.context.word-ngram.2-2.dim300",
"w2v.baidu_encyclopedia.context.word-wordLR.dim300",
"w2v.baidu_encyclopedia.context.word-wordPosition.dim300",
# wikipedia # wikipedia
"w2v.wiki.target.bigram-char.dim300", "w2v.wiki.target.bigram-char.dim300",
"w2v.wiki.target.word-char.dim300", "w2v.wiki.target.word-char.dim300",
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册