提交 517a228a 编写于 作者: L liaogang

Merge branch 'develop' of https://github.com/PaddlePaddle/book into rerere

- repo: https://github.com/Lucas-C/pre-commit-hooks.git
sha: c25201a00e6b0514370501050cf2a8538ac12270
hooks:
- id: remove-crlf
- repo: https://github.com/reyoung/mirrors-yapf.git - repo: https://github.com/reyoung/mirrors-yapf.git
sha: v0.13.2 sha: v0.13.2
hooks: hooks:
- id: yapf - id: yapf
files: (.*\.(py|bzl)|BUILD|.*\.BUILD|WORKSPACE)$ # Bazel BUILD files follow Python syntax. files: (.*\.(py|bzl)|BUILD|.*\.BUILD|WORKSPACE)$ # Bazel BUILD files follow Python syntax.
- repo: https://github.com/pre-commit/pre-commit-hooks - repo: https://github.com/pre-commit/pre-commit-hooks
sha: 7539d8bd1a00a3c1bfd34cdb606d3a6372e83469 sha: v0.7.1
hooks: hooks:
- id: check-merge-conflict - id: check-merge-conflict
- id: check-symlinks - id: check-symlinks
- id: detect-private-key - id: detect-private-key
- id: end-of-file-fixer - id: end-of-file-fixer
- id: trailing-whitespace
- repo: git://github.com/Lucas-C/pre-commit-hooks
sha: v1.0.1
hooks:
- id: forbid-crlf
- id: remove-crlf
- id: forbid-tabs
- id: remove-tabs
此差异已折叠。
...@@ -59,9 +59,8 @@ import paddle.v2.dataset.uci_housing as uci_housing ...@@ -59,9 +59,8 @@ import paddle.v2.dataset.uci_housing as uci_housing
其中,在uci_housing模块中封装了: 其中,在uci_housing模块中封装了:
1. 数据下载的过程<br> 1. 数据下载的过程。下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data。
下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data<br> 2. [数据预处理](#数据预处理)的过程。
2. [数据预处理](#数据预处理)的过程<br>
### 数据集介绍 ### 数据集介绍
...@@ -105,25 +104,23 @@ import paddle.v2.dataset.uci_housing as uci_housing ...@@ -105,25 +104,23 @@ import paddle.v2.dataset.uci_housing as uci_housing
我们将数据集分割为两份:一份用于调整模型的参数,即进行模型的训练,模型在这份数据集上的误差被称为**训练误差**;另外一份被用来测试,模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据,所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素:更多的训练数据会降低参数估计的方差,从而得到更可信的模型;而更多的测试数据会降低测试误差的方差,从而得到更可信的测试误差。我们这个例子中设置的分割比例为$8:2$ 我们将数据集分割为两份:一份用于调整模型的参数,即进行模型的训练,模型在这份数据集上的误差被称为**训练误差**;另外一份被用来测试,模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据,所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素:更多的训练数据会降低参数估计的方差,从而得到更可信的模型;而更多的测试数据会降低测试误差的方差,从而得到更可信的测试误差。我们这个例子中设置的分割比例为$8:2$
在更复杂的模型训练过程中,我们往往还会多使用一种数据集:验证集。因为复杂的模型中常常还有一些超参数([Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization))需要调节,所以我们会尝试多种超参数的组合来分别训练多个模型,然后对比它们在验证集上的表现选择相对最好的一组超参数,最后才使用这组参数下训练的模型在测试集上评估测试误差。由于本章训练的模型比较简单,我们暂且忽略掉这个过程。 在更复杂的模型训练过程中,我们往往还会多使用一种数据集:验证集。因为复杂的模型中常常还有一些超参数([Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization))需要调节,所以我们会尝试多种超参数的组合来分别训练多个模型,然后对比它们在验证集上的表现选择相对最好的一组超参数,最后才使用这组参数下训练的模型在测试集上评估测试误差。由于本章训练的模型比较简单,我们暂且忽略掉这个过程。
## 训练 ## 训练
fit_a_line下trainer.py演示了训练的整体过程
### 初始化paddlepaddle `fit_a_line/trainer.py`演示了训练的整体过程。
### 初始化PaddlePaddle
```python ```python
# init
paddle.init(use_gpu=False, trainer_count=1) paddle.init(use_gpu=False, trainer_count=1)
``` ```
### 模型配置 ### 模型配置
使用`fc_layer``LinearActivation`来表示线性回归的模型本身。 线性回归的模型其实就是一个采用线性激活函数(linear activation,`LinearActivation`)的全连接层(fully-connected layer,`fc_layer`):
```python ```python
#输入数据,13维的房屋信息
x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13)) x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13))
y_predict = paddle.layer.fc(input=x, y_predict = paddle.layer.fc(input=x,
size=1, size=1,
...@@ -134,14 +131,12 @@ cost = paddle.layer.regression_cost(input=y_predict, label=y) ...@@ -134,14 +131,12 @@ cost = paddle.layer.regression_cost(input=y_predict, label=y)
### 创建参数 ### 创建参数
```python ```python
# create parameters
parameters = paddle.parameters.create(cost) parameters = paddle.parameters.create(cost)
``` ```
### 创建trainer ### 创建Trainer
```python ```python
# create optimizer
optimizer = paddle.optimizer.Momentum(momentum=0) optimizer = paddle.optimizer.Momentum(momentum=0)
trainer = paddle.trainer.SGD(cost=cost, trainer = paddle.trainer.SGD(cost=cost,
...@@ -150,13 +145,19 @@ trainer = paddle.trainer.SGD(cost=cost, ...@@ -150,13 +145,19 @@ trainer = paddle.trainer.SGD(cost=cost,
``` ```
### 读取数据且打印训练的中间信息 ### 读取数据且打印训练的中间信息
在程序中,我们通过reader接口来获取训练或者测试的数据,通过eventhandler来打印训练的中间信息
feeding中设置了训练数据和测试数据的下标,reader通过下标区分训练和测试数据。 PaddlePaddle提供一个
[reader机制](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader)
来读取数据。 Reader返回的数据可以包括多列,我们需要一个Python dict把列
序号映射到网络里的数据层。
```python ```python
feeding={'x': 0, feeding={'x': 0, 'y': 1}
'y': 1} ```
此外,我们还可以提供一个 event handler,来打印训练的进度:
```python
# event_handler to print training and testing info # event_handler to print training and testing info
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
...@@ -171,10 +172,10 @@ def event_handler(event): ...@@ -171,10 +172,10 @@ def event_handler(event):
feeding=feeding) feeding=feeding)
print "Test %d, Cost %f" % (event.pass_id, result.cost) print "Test %d, Cost %f" % (event.pass_id, result.cost)
``` ```
### 开始训练 ### 开始训练
```python ```python
# training
trainer.train( trainer.train(
reader=paddle.batch( reader=paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
...@@ -185,13 +186,6 @@ trainer.train( ...@@ -185,13 +186,6 @@ trainer.train(
num_passes=30) num_passes=30)
``` ```
## bash中执行训练程序
**注意设置好paddle的安装包路径**
```bash
python train.py
```
## 总结 ## 总结
在这章里,我们借助波士顿房价这一数据集,介绍了线性回归模型的基本概念,以及如何使用PaddlePaddle实现训练和测试的过程。很多的模型和技巧都是从简单的线性回归模型演化而来,因此弄清楚线性模型的原理和局限非常重要。 在这章里,我们借助波士顿房价这一数据集,介绍了线性回归模型的基本概念,以及如何使用PaddlePaddle实现训练和测试的过程。很多的模型和技巧都是从简单的线性回归模型演化而来,因此弄清楚线性模型的原理和局限非常重要。
......
...@@ -22,34 +22,20 @@ Standard SRL system mostly builds on top of Syntactic Analysis and contains five ...@@ -22,34 +22,20 @@ Standard SRL system mostly builds on top of Syntactic Analysis and contains five
<div align="center"> <div align="center">
<img src="image/dependency_parsing.png" width = "80%" align=center /><br> <img src="image/dependency_parsing_en.png" width = "80%" align=center /><br>
Fig 1. Syntactic parse tree Fig 1. Syntactic parse tree
</div> </div>
核心关系-> HED
定中关系-> ATT
主谓关系-> SBV
状中结构-> ADV
介宾关系-> POB
右附加关系-> RAD
动宾关系-> VOB
标点-> WP
However, complete syntactic analysis requires identifying the relation among all constitutes and the performance of SRL is sensitive to the precision of syntactic analysis, which makes SRL a very challenging task. To reduce the complexity and obtain some syntactic structure information, we often use shallow syntactic analysis. Shallow Syntactic Analysis is also called partial parsing or chunking. Unlike complete syntactic analysis which requires the construction of the complete parsing tree, Shallow Syntactic Analysis only need to identify some independent components with relatively simple structure, such as verb phrases (chunk). To avoid difficulties in constructing a syntactic tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking based SRL methods, which convert SRL as a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using BIO representation. For syntactic chunks forming a chunk of type A, the first chunk receives the B-A tag (Begin), the remaining ones receive the tag I-A (Inside), and all chunks outside receive the tag O-A. However, complete syntactic analysis requires identifying the relation among all constitutes and the performance of SRL is sensitive to the precision of syntactic analysis, which makes SRL a very challenging task. To reduce the complexity and obtain some syntactic structure information, we often use shallow syntactic analysis. Shallow Syntactic Analysis is also called partial parsing or chunking. Unlike complete syntactic analysis which requires the construction of the complete parsing tree, Shallow Syntactic Analysis only need to identify some independent components with relatively simple structure, such as verb phrases (chunk). To avoid difficulties in constructing a syntactic tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking based SRL methods, which convert SRL as a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using BIO representation. For syntactic chunks forming a chunk of type A, the first chunk receives the B-A tag (Begin), the remaining ones receive the tag I-A (Inside), and all chunks outside receive the tag O-A.
The BIO representation of above example is shown in Fig.1. The BIO representation of above example is shown in Fig.1.
<div align="center"> <div align="center">
<img src="image/bio_example.png" width = "90%" align=center /><br> <img src="image/bio_example_en.png" width = "90%" align=center /><br>
Fig 2. BIO represention Fig 2. BIO represention
</div> </div>
输入序列-> input sequence
语块-> chunk
标注序列-> label sequence
角色-> role
This example illustrates the simplicity of sequence tagging because (1) shallow syntactic analysis reduces the precision requirement of syntactic analysis; (2) pruning candidate arguments is removed; 3) argument identification and tagging are finished at the same time. Such unified methods simplify the procedure, reduce the risk of accumulating errors and boost the performance further. This example illustrates the simplicity of sequence tagging because (1) shallow syntactic analysis reduces the precision requirement of syntactic analysis; (2) pruning candidate arguments is removed; 3) argument identification and tagging are finished at the same time. Such unified methods simplify the procedure, reduce the risk of accumulating errors and boost the performance further.
In this tutorial, our SRL system is built as an end-to-end system via a neural network. We take only text sequences, without using any syntactic parsing results or complex hand-designed features. We give public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) as an example to illustrate: given a sentence with predicates marked, identify the corresponding arguments and their semantic roles by sequence tagging method. In this tutorial, our SRL system is built as an end-to-end system via a neural network. We take only text sequences, without using any syntactic parsing results or complex hand-designed features. We give public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) as an example to illustrate: given a sentence with predicates marked, identify the corresponding arguments and their semantic roles by sequence tagging method.
...@@ -71,13 +57,10 @@ The operation of a single LSTM cell contain 3 parts: (1) input-to-hidden: map in ...@@ -71,13 +57,10 @@ The operation of a single LSTM cell contain 3 parts: (1) input-to-hidden: map in
Fig.3 illustrate the final stacked recurrent neural networks. Fig.3 illustrate the final stacked recurrent neural networks.
<p align="center"> <p align="center">
<img src="./image/stacked_lstm.png" width = "40%" align=center><br> <img src="./image/stacked_lstm_en.png" width = "40%" align=center><br>
Fig 3. Stacked Recurrent Neural Networks Fig 3. Stacked Recurrent Neural Networks
</p> </p>
线性变换-> linear transformation
输入层到隐层-> input-to-hidden
### Bidirectional Recurrent Neural Network ### Bidirectional Recurrent Neural Network
LSTMs can summarize the history of previous inputs seen up to now, but can not see the future. In most of NLP (natural language processing) tasks, the entire sentences are ready to use. Therefore, sequential learning might be much efficient if the future can be encoded as well like histories. LSTMs can summarize the history of previous inputs seen up to now, but can not see the future. In most of NLP (natural language processing) tasks, the entire sentences are ready to use. Therefore, sequential learning might be much efficient if the future can be encoded as well like histories.
...@@ -86,15 +69,10 @@ To address the above drawbacks, we can design bidirectional recurrent neural net ...@@ -86,15 +69,10 @@ To address the above drawbacks, we can design bidirectional recurrent neural net
<p align="center"> <p align="center">
<img src="./image/bidirectional_stacked_lstm.png" width = "60%" align=center><br> <img src="./image/bidirectional_stacked_lstm_en.png" width = "60%" align=center><br>
Fig 4. Bidirectional LSTMs Fig 4. Bidirectional LSTMs
</p> </p>
线性变换-> linear transformation
输入层到隐层-> input-to-hidden
正向处理输出序列->process sequence in the forward direction
反向处理上一层序列-> process sequence from the previous layer in backward direction
Note that, this bidirectional RNNs is different with the one proposed by Bengio et al. in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following tasks[machine translation](https://github.com/PaddlePaddle/book/blob/develop/machine_translation/README.md) Note that, this bidirectional RNNs is different with the one proposed by Bengio et al. in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following tasks[machine translation](https://github.com/PaddlePaddle/book/blob/develop/machine_translation/README.md)
### Conditional Random Field ### Conditional Random Field
...@@ -156,18 +134,10 @@ After modification, the model is as follows: ...@@ -156,18 +134,10 @@ After modification, the model is as follows:
<div align="center"> <div align="center">
<img src="image/db_lstm_network.png" width = "60%" align=center /><br> <img src="image/db_lstm_en.png" width = "60%" align=center /><br>
Fig 6. DB-LSTM for SRL tasks Fig 6. DB-LSTM for SRL tasks
</div> </div>
论元-> argu
谓词-> pred
谓词上下文-> ctx-p
谓词上下文区域标记-> $m_r$
输入-> input
原句-> sentence
反向LSTM-> LSTM Reverse
## Data Preparation ## Data Preparation
In the tutorial, we use [CoNLL 2005](http://www.cs.upc.edu/~srlconll/) SRL task open dataset as an example. It is important to note that the training set and development set of the CoNLL 2005 SRL task are not free to download after the competition. Currently, only the test set can be obtained, including 23 sections of the Wall Street Journal and three sections of the Brown corpus. In this tutorial, we use the WSJ corpus as the training dataset to explain the model. However, since the training set is small, if you want to train a usable neural network SRL system, consider paying for the full corpus. In the tutorial, we use [CoNLL 2005](http://www.cs.upc.edu/~srlconll/) SRL task open dataset as an example. It is important to note that the training set and development set of the CoNLL 2005 SRL task are not free to download after the competition. Currently, only the test set can be obtained, including 23 sections of the Wall Street Journal and three sections of the Brown corpus. In this tutorial, we use the WSJ corpus as the training dataset to explain the model. However, since the training set is small, if you want to train a usable neural network SRL system, consider paying for the full corpus.
......
label_semantic_roles/image/bio_example.png

379.1 KB | W: | H:

label_semantic_roles/image/bio_example.png

31.7 KB | W: | H:

label_semantic_roles/image/bio_example.png
label_semantic_roles/image/bio_example.png
label_semantic_roles/image/bio_example.png
label_semantic_roles/image/bio_example.png
  • 2-up
  • Swipe
  • Onion skin
label_semantic_roles/image/dependency_parsing.png

186.0 KB | W: | H:

label_semantic_roles/image/dependency_parsing.png

58.2 KB | W: | H:

label_semantic_roles/image/dependency_parsing.png
label_semantic_roles/image/dependency_parsing.png
label_semantic_roles/image/dependency_parsing.png
label_semantic_roles/image/dependency_parsing.png
  • 2-up
  • Swipe
  • Onion skin
此差异已折叠。
import paddle.v2 as paddle
def seqToseq_net(source_dict_dim, target_dict_dim):
### Network Architecture
word_vector_dim = 512 # dimension of word vector
decoder_size = 512 # dimension of hidden unit in GRU Decoder network
encoder_size = 512 # dimension of hidden unit in GRU Encoder network
#### Encoder
src_word_id = paddle.layer.data(
name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim))
src_embedding = paddle.layer.embedding(
input=src_word_id,
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
src_forward = paddle.networks.simple_gru(
input=src_embedding, size=encoder_size)
src_backward = paddle.networks.simple_gru(
input=src_embedding, size=encoder_size, reverse=True)
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
#### Decoder
with paddle.layer.mixed(size=decoder_size) as encoded_proj:
encoded_proj += paddle.layer.full_matrix_projection(
input=encoded_vector)
backward_first = paddle.layer.first_seq(input=src_backward)
with paddle.layer.mixed(
size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot:
decoder_boot += paddle.layer.full_matrix_projection(
input=backward_first)
def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
decoder_mem = paddle.layer.memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
context = paddle.networks.simple_attention(
encoded_sequence=enc_vec,
encoded_proj=enc_proj,
decoder_state=decoder_mem)
with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += paddle.layer.full_matrix_projection(input=context)
decoder_inputs += paddle.layer.full_matrix_projection(
input=current_word)
gru_step = paddle.layer.gru_step(
name='gru_decoder',
input=decoder_inputs,
output_mem=decoder_mem,
size=decoder_size)
with paddle.layer.mixed(
size=target_dict_dim,
bias_attr=True,
act=paddle.activation.Softmax()) as out:
out += paddle.layer.full_matrix_projection(input=gru_step)
return out
decoder_group_name = "decoder_group"
group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2]
trg_embedding = paddle.layer.embedding(
input=paddle.layer.data(
name='target_language_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim)),
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
group_inputs.append(trg_embedding)
# For decoder equipped with attention mechanism, in training,
# target embeding (the groudtruth) is the data input,
# while encoded source sequence is accessed to as an unbounded memory.
# Here, the StaticInput defines a read-only memory
# for the recurrent_group.
decoder = paddle.layer.recurrent_group(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs)
lbl = paddle.layer.data(
name='target_language_next_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim))
cost = paddle.layer.classification_cost(input=decoder, label=lbl)
return cost
def main():
paddle.init(use_gpu=False, trainer_count=1)
# source and target dict dim.
dict_size = 30000
source_dict_dim = target_dict_dim = dict_size
# define network topology
cost = seqToseq_net(source_dict_dim, target_dict_dim)
parameters = paddle.parameters.create(cost)
# define optimize method and trainer
optimizer = paddle.optimizer.Adam(learning_rate=1e-4)
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
# define data reader
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
# define event_handler callback
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
# start to train
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
num_passes=10000,
feeding=feeding)
if __name__ == '__main__':
main()
.idea
.ipynb_checkpoints
此差异已折叠。
此差异已折叠。
# 词向量 # 词向量
本教程源代码目录在[book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/word2vec), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html) 本教程源代码目录在[book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/word2vec), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html)
...@@ -225,38 +226,38 @@ def wordemb(inlayer): ...@@ -225,38 +226,38 @@ def wordemb(inlayer):
- 定义输入层接受的数据类型以及名字。 - 定义输入层接受的数据类型以及名字。
```python ```python
def main(): paddle.init(use_gpu=False, trainer_count=3) # 初始化PaddlePaddle
paddle.init(use_gpu=False, trainer_count=1) # 初始化PaddlePaddle word_dict = paddle.dataset.imikolov.build_dict()
word_dict = paddle.dataset.imikolov.build_dict() dict_size = len(word_dict)
dict_size = len(word_dict) # 每个输入层都接受整形数据,这些数据的范围是[0, dict_size)
# 每个输入层都接受整形数据,这些数据的范围是[0, dict_size) firstword = paddle.layer.data(
firstword = paddle.layer.data(
name="firstw", type=paddle.data_type.integer_value(dict_size)) name="firstw", type=paddle.data_type.integer_value(dict_size))
secondword = paddle.layer.data( secondword = paddle.layer.data(
name="secondw", type=paddle.data_type.integer_value(dict_size)) name="secondw", type=paddle.data_type.integer_value(dict_size))
thirdword = paddle.layer.data( thirdword = paddle.layer.data(
name="thirdw", type=paddle.data_type.integer_value(dict_size)) name="thirdw", type=paddle.data_type.integer_value(dict_size))
fourthword = paddle.layer.data( fourthword = paddle.layer.data(
name="fourthw", type=paddle.data_type.integer_value(dict_size)) name="fourthw", type=paddle.data_type.integer_value(dict_size))
nextword = paddle.layer.data( nextword = paddle.layer.data(
name="fifthw", type=paddle.data_type.integer_value(dict_size)) name="fifthw", type=paddle.data_type.integer_value(dict_size))
Efirst = wordemb(firstword) Efirst = wordemb(firstword)
Esecond = wordemb(secondword) Esecond = wordemb(secondword)
Ethird = wordemb(thirdword) Ethird = wordemb(thirdword)
Efourth = wordemb(fourthword) Efourth = wordemb(fourthword)
``` ```
- 将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。 - 将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。
```python ```python
contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth]) contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
``` ```
- 将历史文本特征经过一个全连接得到文本隐层特征。 - 将历史文本特征经过一个全连接得到文本隐层特征。
```python ```python
hidden1 = paddle.layer.fc(input=contextemb, hidden1 = paddle.layer.fc(input=contextemb,
size=hiddensize, size=hiddensize,
act=paddle.activation.Sigmoid(), act=paddle.activation.Sigmoid(),
layer_attr=paddle.attr.Extra(drop_rate=0.5), layer_attr=paddle.attr.Extra(drop_rate=0.5),
...@@ -269,7 +270,7 @@ def main(): ...@@ -269,7 +270,7 @@ def main():
- 将文本隐层特征,再经过一个全连接,映射成一个$|V|$维向量,同时通过softmax归一化得到这`|V|`个词的生成概率。 - 将文本隐层特征,再经过一个全连接,映射成一个$|V|$维向量,同时通过softmax归一化得到这`|V|`个词的生成概率。
```python ```python
predictword = paddle.layer.fc(input=hidden1, predictword = paddle.layer.fc(input=hidden1,
size=dict_size, size=dict_size,
bias_attr=paddle.attr.Param(learning_rate=2), bias_attr=paddle.attr.Param(learning_rate=2),
act=paddle.activation.Softmax()) act=paddle.activation.Softmax())
...@@ -288,11 +289,11 @@ cost = paddle.layer.classification_cost(input=predictword, label=nextword) ...@@ -288,11 +289,11 @@ cost = paddle.layer.classification_cost(input=predictword, label=nextword)
- 正则化(regularization): 是防止网络过拟合的一种手段,此处采用L2正则化。 - 正则化(regularization): 是防止网络过拟合的一种手段,此处采用L2正则化。
```python ```python
parameters = paddle.parameters.create(cost) parameters = paddle.parameters.create(cost)
adam_optimizer = paddle.optimizer.Adam( adam_optimizer = paddle.optimizer.Adam(
learning_rate=3e-3, learning_rate=3e-3,
regularization=paddle.optimizer.L2Regularization(8e-4)) regularization=paddle.optimizer.L2Regularization(8e-4))
trainer = paddle.trainer.SGD(cost, parameters, adam_optimizer) trainer = paddle.trainer.SGD(cost, parameters, adam_optimizer)
``` ```
下一步,我们开始训练过程。`paddle.dataset.imikolov.train()``paddle.dataset.imikolov.test()`分别做训练和测试数据集。这两个函数各自返回一个reader——PaddlePaddle中的reader是一个Python函数,每次调用的时候返回一个Python generator。 下一步,我们开始训练过程。`paddle.dataset.imikolov.train()``paddle.dataset.imikolov.test()`分别做训练和测试数据集。这两个函数各自返回一个reader——PaddlePaddle中的reader是一个Python函数,每次调用的时候返回一个Python generator。
...@@ -300,112 +301,94 @@ cost = paddle.layer.classification_cost(input=predictword, label=nextword) ...@@ -300,112 +301,94 @@ cost = paddle.layer.classification_cost(input=predictword, label=nextword)
`paddle.batch`的输入是一个reader,输出是一个batched reader —— 在PaddlePaddle里,一个reader每次yield一条训练数据,而一个batched reader每次yield一个minbatch。 `paddle.batch`的输入是一个reader,输出是一个batched reader —— 在PaddlePaddle里,一个reader每次yield一条训练数据,而一个batched reader每次yield一个minbatch。
```python ```python
def event_handler(event): import gzip
def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0: if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass):
result = trainer.test( result = trainer.test(
paddle.batch( paddle.batch(
paddle.dataset.imikolov.test(word_dict, N), 32)) paddle.dataset.imikolov.test(word_dict, N), 32))
print "Pass %d, Batch %d, Cost %f, %s, Testing metrics %s" % ( print "Pass %d, Testing metrics %s" % (event.pass_id, result.metrics)
event.pass_id, event.batch_id, event.cost, event.metrics, with gzip.open("model_%d.tar.gz"%event.pass_id, 'w') as f:
result.metrics) parameters.to_tar(f)
trainer.train( trainer.train(
paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32), paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
num_passes=30, num_passes=100,
event_handler=event_handler) event_handler=event_handler)
``` ```
训练过程是完全自动的,event_handler里打印的日志类似如下所示: ...
Pass 0, Batch 25000, Cost 4.251861, {'classification_error_evaluator': 0.84375}
Pass 0, Batch 25100, Cost 4.847692, {'classification_error_evaluator': 0.8125}
Pass 0, Testing metrics {'classification_error_evaluator': 0.7417652606964111}
训练过程是完全自动的,event_handler里打印的日志类似如上所示:
```text
.............................
I1222 09:27:16.477841 12590 TrainerInternal.cpp:162] Batch=3000 samples=300000 AvgCost=5.36135 CurrentCost=5.36135 Eval: classification_error_evaluator=0.818653 CurrentEval: class
ification_error_evaluator=0.818653
.............................
I1222 09:27:22.416700 12590 TrainerInternal.cpp:162] Batch=6000 samples=600000 AvgCost=5.29301 CurrentCost=5.22467 Eval: classification_error_evaluator=0.814542 CurrentEval: class
ification_error_evaluator=0.81043
.............................
I1222 09:27:28.343756 12590 TrainerInternal.cpp:162] Batch=9000 samples=900000 AvgCost=5.22494 CurrentCost=5.08876 Eval: classification_error_evaluator=0.810088 CurrentEval: class
ification_error_evaluator=0.80118
..I1222 09:27:29.128582 12590 TrainerInternal.cpp:179] Pass=0 Batch=9296 samples=929600 AvgCost=5.21786 Eval: classification_error_evaluator=0.809647
I1222 09:27:29.627616 12590 Tester.cpp:111] Test samples=73760 cost=4.9594 Eval: classification_error_evaluator=0.79676
I1222 09:27:29.627713 12590 GradientMachine.cpp:112] Saving parameters to model/pass-00000
```
经过30个pass,我们将得到平均错误率为classification_error_evaluator=0.735611。 经过30个pass,我们将得到平均错误率为classification_error_evaluator=0.735611。
## 应用模型 ## 应用模型
训练模型后,我们可以加载模型参数,用训练出来的词向量初始化其他模型,也可以将模型参数从二进制格式转换成文本格式进行后续应用。 训练模型后,我们可以加载模型参数,用训练出来的词向量初始化其他模型,也可以将模型查看参数用来做后续应用。
### 初始化其他模型
训练好的模型参数可以用来初始化其他模型。具体方法如下:
在PaddlePaddle 训练命令行中,用`--init_model_path` 来定义初始化模型的位置,用`--load_missing_parameter_strategy`指定除了词向量以外的新模型其他参数的初始化策略。注意,新模型需要和原模型共享被初始化参数的参数名。
### 查看词向量 ### 查看词向量
PaddlePaddle训练出来的参数为二进制格式,存储在对应训练pass的文件夹下。这里我们提供了文件`format_convert.py`用来互转PaddlePaddle训练结果的二进制文件和文本格式特征文件。
```bash PaddlePaddle训练出来的参数可以直接使用`parameters.get()`获取出来。例如查看单词的word的词向量,即为
python format_convert.py --b2t -i INPUT -o OUTPUT -d DIM
```
其中,INPUT是输入的(二进制)词向量模型名称,OUTPUT是输出的文本模型名称,DIM是词向量参数维度。
用法如:
```bash ```python
python format_convert.py --b2t -i model/pass-00029/_proj -o model/pass-00029/_proj.txt -d 32 embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
```
转换后得到的文本文件如下:
```text print embeddings[word_dict['word']]
0,4,62496
-0.7444070,-0.1846171,-1.5771370,0.7070392,2.1963732,-0.0091410, ......
-0.0721337,-0.2429973,-0.0606297,0.1882059,-0.2072131,-0.7661019, ......
......
``` ```
其中,第一行是PaddlePaddle 输出文件的格式说明,包含3个属性:<br/> [-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435
1) PaddlePaddle的版本号,本例中为0;<br/> -0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083
2) 浮点数占用的字节数,本例中为4;<br/> 0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236
3) 总计的参数个数, 本例中为62496(即1953*32);<br/> -0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992
第二行及之后的每一行都按顺序表示字典里一个词的特征,用逗号分隔。 0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323
0.19072419 -0.24286366]
### 修改词向量
我们可以对词向量进行修改,并转换成PaddlePaddle参数二进制格式,方法:
```bash
python format_convert.py --t2b -i INPUT -o OUTPUT
```
其中,INPUT是输入的输入的文本词向量模型名称,OUTPUT是输出的二进制词向量模型名称 ### 修改词向量
输入的文本格式如下(注意,不包含上面二进制转文本后第一行的格式说明): 获得到的embedding为一个标准的numpy矩阵。我们可以对这个numpy矩阵进行修改,然后赋值回去。
```text
-0.7444070,-0.1846171,-1.5771370,0.7070392,2.1963732,-0.0091410, ......
-0.0721337,-0.2429973,-0.0606297,0.1882059,-0.2072131,-0.7661019, ......
......
```
```python
def modify_embedding(emb):
# Add your modification here.
pass
modify_embedding(embeddings)
parameters.set("_proj", embeddings)
```
### 计算词语之间的余弦距离 ### 计算词语之间的余弦距离
两个向量之间的距离可以用余弦值来表示,余弦值在$[-1,1]$的区间内,向量间余弦值越大,其距离越近。这里我们在`calculate_dis.py`中实现不同词语的距离度量。 两个向量之间的距离可以用余弦值来表示,余弦值在$[-1,1]$的区间内,向量间余弦值越大,其距离越近。这里我们在`calculate_dis.py`中实现不同词语的距离度量。
用法如下: 用法如下:
```bash
python calculate_dis.py VOCABULARY EMBEDDINGLAYER`
```
其中,`VOCABULARY`是字典,`EMBEDDINGLAYER`是词向量模型,示例如下: ```python
from scipy import spatial
emb_1 = embeddings[word_dict['world']]
emb_2 = embeddings[word_dict['would']]
```bash print spatial.distance.cosine(emb_1, emb_2)
python calculate_dis.py data/vocabulary.txt model/pass-00029/_proj.txt
``` ```
0.99375076448
## 总结 ## 总结
本章中,我们介绍了词向量、语言模型和词向量的关系、以及如何通过训练神经网络模型获得词向量。在信息检索中,我们可以根据向量间的余弦夹角,来判断query和文档关键词这二者间的相关性。在句法分析和语义分析中,训练好的词向量可以用来初始化模型,以得到更好的效果。在文档分类中,有了词向量之后,可以用聚类的方法将文档中同义词进行分组。希望大家在本章后能够自行运用词向量进行相关领域的研究。 本章中,我们介绍了词向量、语言模型和词向量的关系、以及如何通过训练神经网络模型获得词向量。在信息检索中,我们可以根据向量间的余弦夹角,来判断query和文档关键词这二者间的相关性。在句法分析和语义分析中,训练好的词向量可以用来初始化模型,以得到更好的效果。在文档分类中,有了词向量之后,可以用聚类的方法将文档中同义词进行分组。希望大家在本章后能够自行运用词向量进行相关领域的研究。
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册