提交 3ab94d5b 编写于 作者: Q qiaolongfei

update marddown, use numpy.savetext and numpy.loadtxt to save/load embedding table

上级 66f866f1
...@@ -207,6 +207,26 @@ hiddensize = 256 # 隐层维度 ...@@ -207,6 +207,26 @@ hiddensize = 256 # 隐层维度
N = 5 # 训练5-Gram N = 5 # 训练5-Gram
``` ```
用于保存和加载word_dict和embedding table的函数
```python
def wordemb(inlayer):
wordemb = paddle.layer.table_projection(
input=inlayer,
size=embsize,
param_attr=paddle.attr.Param(
name="_proj", initial_std=0.001, learning_rate=1, l2_rate=0))
return wordemb
# save and load word dict and embedding table
def save_dict_and_embedding(word_dict, embeddings):
with open("word_dict", "w") as f:
for key in word_dict:
f.write(key + " " + str(word_dict[key]) + "\n")
with open("embedding_table", "w") as f:
numpy.savetxt(f, embeddings, delimiter=',', newline='\n')
```
接着,定义网络结构: 接着,定义网络结构:
- 将$w_t$之前的$n-1$个词 $w_{t-n+1},...w_{t-1}$,通过$|V|\times D$的矩阵映射到D维词向量(本例中取D=32)。 - 将$w_t$之前的$n-1$个词 $w_{t-n+1},...w_{t-1}$,通过$|V|\times D$的矩阵映射到D维词向量(本例中取D=32)。
...@@ -333,6 +353,16 @@ Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Te ...@@ -333,6 +353,16 @@ Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Te
经过30个pass,我们将得到平均错误率为classification_error_evaluator=0.735611。 经过30个pass,我们将得到平均错误率为classification_error_evaluator=0.735611。
## 保存词典和embedding
训练完成之后,我们可以把词典和embedding table单独保存下来,后面可以直接使用
```python
# save word dict and embedding table
embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
save_dict_and_embedding(word_dict, embeddings)
```
## 应用模型 ## 应用模型
训练模型后,我们可以加载模型参数,用训练出来的词向量初始化其他模型,也可以将模型查看参数用来做后续应用。 训练模型后,我们可以加载模型参数,用训练出来的词向量初始化其他模型,也可以将模型查看参数用来做后续应用。
......
...@@ -224,6 +224,27 @@ hiddensize = 256 # hidden layer dimension ...@@ -224,6 +224,27 @@ hiddensize = 256 # hidden layer dimension
N = 5 # train 5-gram N = 5 # train 5-gram
``` ```
- functions used to save and load word dict and embedding table
```python
def wordemb(inlayer):
wordemb = paddle.layer.table_projection(
input=inlayer,
size=embsize,
param_attr=paddle.attr.Param(
name="_proj", initial_std=0.001, learning_rate=1, l2_rate=0))
return wordemb
# save and load word dict and embedding table
def save_dict_and_embedding(word_dict, embeddings):
with open("word_dict", "w") as f:
for key in word_dict:
f.write(key + " " + str(word_dict[key]) + "\n")
with open("embedding_table", "w") as f:
numpy.savetxt(f, embeddings, delimiter=',', newline='\n')
```
- Map the $n-1$ words $w_{t-n+1},...w_{t-1}$ before $w_t$ to a D-dimensional vector though matrix of dimention $|V|\times D$ (D=32 in this example). - Map the $n-1$ words $w_{t-n+1},...w_{t-1}$ before $w_t$ to a D-dimensional vector though matrix of dimention $|V|\times D$ (D=32 in this example).
```python ```python
...@@ -343,6 +364,16 @@ Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Te ...@@ -343,6 +364,16 @@ Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Te
After 30 passes, we can get average error rate around 0.735611. After 30 passes, we can get average error rate around 0.735611.
## Save word dict and embedding table
after training, we can save the word dict and embedding table for the future usage.
```python
# save word dict and embedding table
embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
save_dict_and_embedding(word_dict, embeddings)
```
## Model Application ## Model Application
......
...@@ -249,6 +249,26 @@ hiddensize = 256 # 隐层维度 ...@@ -249,6 +249,26 @@ hiddensize = 256 # 隐层维度
N = 5 # 训练5-Gram N = 5 # 训练5-Gram
``` ```
用于保存和加载word_dict和embedding table的函数
```python
def wordemb(inlayer):
wordemb = paddle.layer.table_projection(
input=inlayer,
size=embsize,
param_attr=paddle.attr.Param(
name="_proj", initial_std=0.001, learning_rate=1, l2_rate=0))
return wordemb
# save and load word dict and embedding table
def save_dict_and_embedding(word_dict, embeddings):
with open("word_dict", "w") as f:
for key in word_dict:
f.write(key + " " + str(word_dict[key]) + "\n")
with open("embedding_table", "w") as f:
numpy.savetxt(f, embeddings, delimiter=',', newline='\n')
```
接着,定义网络结构: 接着,定义网络结构:
- 将$w_t$之前的$n-1$个词 $w_{t-n+1},...w_{t-1}$,通过$|V|\times D$的矩阵映射到D维词向量(本例中取D=32)。 - 将$w_t$之前的$n-1$个词 $w_{t-n+1},...w_{t-1}$,通过$|V|\times D$的矩阵映射到D维词向量(本例中取D=32)。
...@@ -375,6 +395,16 @@ Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Te ...@@ -375,6 +395,16 @@ Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Te
经过30个pass,我们将得到平均错误率为classification_error_evaluator=0.735611。 经过30个pass,我们将得到平均错误率为classification_error_evaluator=0.735611。
## 保存词典和embedding
训练完成之后,我们可以把词典和embedding table单独保存下来,后面可以直接使用
```python
# save word dict and embedding table
embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
save_dict_and_embedding(word_dict, embeddings)
```
## 应用模型 ## 应用模型
训练模型后,我们可以加载模型参数,用训练出来的词向量初始化其他模型,也可以将模型查看参数用来做后续应用。 训练模型后,我们可以加载模型参数,用训练出来的词向量初始化其他模型,也可以将模型查看参数用来做后续应用。
......
...@@ -266,6 +266,27 @@ hiddensize = 256 # hidden layer dimension ...@@ -266,6 +266,27 @@ hiddensize = 256 # hidden layer dimension
N = 5 # train 5-gram N = 5 # train 5-gram
``` ```
- functions used to save and load word dict and embedding table
```python
def wordemb(inlayer):
wordemb = paddle.layer.table_projection(
input=inlayer,
size=embsize,
param_attr=paddle.attr.Param(
name="_proj", initial_std=0.001, learning_rate=1, l2_rate=0))
return wordemb
# save and load word dict and embedding table
def save_dict_and_embedding(word_dict, embeddings):
with open("word_dict", "w") as f:
for key in word_dict:
f.write(key + " " + str(word_dict[key]) + "\n")
with open("embedding_table", "w") as f:
numpy.savetxt(f, embeddings, delimiter=',', newline='\n')
```
- Map the $n-1$ words $w_{t-n+1},...w_{t-1}$ before $w_t$ to a D-dimensional vector though matrix of dimention $|V|\times D$ (D=32 in this example). - Map the $n-1$ words $w_{t-n+1},...w_{t-1}$ before $w_t$ to a D-dimensional vector though matrix of dimention $|V|\times D$ (D=32 in this example).
```python ```python
...@@ -385,6 +406,16 @@ Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Te ...@@ -385,6 +406,16 @@ Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Te
After 30 passes, we can get average error rate around 0.735611. After 30 passes, we can get average error rate around 0.735611.
## Save word dict and embedding table
after training, we can save the word dict and embedding table for the future usage.
```python
# save word dict and embedding table
embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
save_dict_and_embedding(word_dict, embeddings)
```
## Model Application ## Model Application
......
import math, os import math
import numpy import os
import numpy
import paddle.v2 as paddle import paddle.v2 as paddle
with_gpu = os.getenv('WITH_GPU', '0') != '0' with_gpu = os.getenv('WITH_GPU', '0') != '0'
...@@ -25,22 +26,17 @@ def save_dict_and_embedding(word_dict, embeddings): ...@@ -25,22 +26,17 @@ def save_dict_and_embedding(word_dict, embeddings):
for key in word_dict: for key in word_dict:
f.write(key + " " + str(word_dict[key]) + "\n") f.write(key + " " + str(word_dict[key]) + "\n")
with open("embedding_table", "w") as f: with open("embedding_table", "w") as f:
for line in embeddings: numpy.savetxt(f, embeddings, delimiter=',', newline='\n')
f.write(",".join([str(x) for x in line]) + "\n")
def load_dict_and_embedding(): def load_dict_and_embedding():
word_dict = dict() word_dict = dict()
embeddings = []
with open("word_dict", "r") as f: with open("word_dict", "r") as f:
for line in f: for line in f:
key, value = line.strip().split(" ") key, value = line.strip().split(" ")
word_dict[key] = value word_dict[key] = value
with open("embedding_table", "r") as f:
for line in f: embeddings = numpy.loadtxt("embedding_table", delimiter=",")
embeddings.append(
numpy.array([float(x) for x in line.strip().split(',')]))
return word_dict, embeddings return word_dict, embeddings
...@@ -102,7 +98,7 @@ def main(): ...@@ -102,7 +98,7 @@ def main():
trainer = paddle.trainer.SGD(cost, parameters, adagrad) trainer = paddle.trainer.SGD(cost, parameters, adagrad)
trainer.train( trainer.train(
paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32), paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
num_passes=1, num_passes=100,
event_handler=event_handler) event_handler=event_handler)
# save word dict and embedding table # save word dict and embedding table
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册