README.md 32.0 KB
Newer Older
C
choijulie 已提交
1
# Semantic Role Labeling
C
caoying03 已提交
2

C
choijulie 已提交
3
The source code of this chapter is live on [book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles).
L
Luo Tao 已提交
4

L
Luo Tao 已提交
5
For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book).
C
caoying03 已提交
6

C
choijulie 已提交
7
## Background
8

C
choijulie 已提交
9
Natural language analysis techniques consist of lexical, syntactic, and semantic analysis. **Semantic Role Labeling (SRL)** is an instance of **Shallow Semantic Analysis**.
C
caoying03 已提交
10

C
choijulie 已提交
11
In a sentence, a **predicate** states a property or a characterization of a *subject*, such as what it does and what it is like. The predicate represents the core of an event, whereas the words accompanying the predicate are **arguments**. A **semantic role** refers to the abstract role an argument of a predicate take on in the event, including *agent*, *patient*, *theme*, *experiencer*, *beneficiary*, *instrument*, *location*, *goal*, and *source*.
C
caoying03 已提交
12

C
choijulie 已提交
13
In the following example of a Chinese sentence, "to encounter" is the predicate (*pred*); "Ming" is the *agent*; "Hong" is the *patient*; "yesterday" and "evening" are the *time*; finally, "the park" is the *location*.
14

C
choijulie 已提交
15 16 17 18 19 20 21 22 23 24 25
$$\mbox{[小明 Ming]}_{\mbox{Agent}}\mbox{[昨天 yesterday]}_{\mbox{Time}}\mbox{[晚上 evening]}_\mbox{Time}\mbox{在[公园 a park]}_{\mbox{Location}}\mbox{[遇到 to encounter]}_{\mbox{Predicate}}\mbox{了[小红 Hong]}_{\mbox{Patient}}\mbox{。}$$

Instead of analyzing the semantic information, **Semantic Role Labeling** (**SRL**) identifies the relation between the predicate and the other constituents surrounding it. The predicate-argument structures are labeled as specific semantic roles. A wide range of natural language understanding tasks, including *information extraction*, *discourse analysis*, and *deepQA*. Research usually assumes a predicate of a sentence to be specified; the only task is to identify its arguments and their semantic roles.

Conventional SRL systems mostly build on top of syntactic analysis, usually consisting of five steps:

1. Construct a syntax tree, as shown in Fig. 1
2. Identity the candidate arguments of the given predicate on the tree.
3. Prune the most unlikely candidate arguments.
4. Identify the real arguments, often by a binary classifier.
5. Multi-classify on results from step 4 to label the semantic roles. Steps 2 and 3 usually introduce hand-designed features based on syntactic analysis (step 1).
C
caoying03 已提交
26

27

C
caoying03 已提交
28
<div  align="center">
C
choijulie 已提交
29 30
<img src="image/dependency_parsing_en.png" width = "80%" align=center /><br>
Fig 1. Syntax tree
C
caoying03 已提交
31
</div>
C
caoying03 已提交
32 33


C
choijulie 已提交
34 35 36
However, a complete syntactic analysis requires identifying the relation among all constituents. Thus, the accuracy of SRL is sensitive to the preciseness of the syntactic analysis, making SRL challenging. To reduce its complexity and obtain some information on the syntactic structures, we often use *shallow syntactic analysis* a.k.a. partial parsing or chunking. Unlike complete syntactic analysis, which requires the construction of the complete parsing tree, *Shallow Syntactic Analysis* only requires identifying some independent constituents with relatively simple structures, such as verb phrases (chunk). To avoid difficulties in constructing a syntax tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking-based SRL methods, which reduces SRL into a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using **BIO representation**. For syntactic chunks forming role A, its first chunk receives the B-A tag (Begin) and the remaining ones receive the tag I-A (Inside); in the end, the chunks left out receive the tag O.

The BIO representation of above example is shown in Fig.1.
C
caoying03 已提交
37

C
caoying03 已提交
38
<div  align="center">
C
choijulie 已提交
39 40
<img src="image/bio_example_en.png" width = "90%"  align=center /><br>
Fig 2. BIO representation
C
caoying03 已提交
41
</div>
C
caoying03 已提交
42

C
choijulie 已提交
43 44 45 46 47 48 49
This example illustrates the simplicity of sequence tagging, since

1. It only relies on shallow syntactic analysis, reduces the precision requirement of syntactic analysis;
2. Pruning the candidate arguments is no longer necessary;
3. Arguments are identified and tagged at the same time. Simplifying the workflow reduces the risk of accumulating errors; oftentimes, methods that unify multiple steps boost performance.

In this tutorial, our SRL system is built as an end-to-end system via a neural network. The system takes only text sequences as input, without using any syntactic parsing results or complex hand-designed features. The public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) is used for the following task: given a sentence with predicates marked, identify the corresponding arguments and their semantic roles through sequence tagging.
C
caoying03 已提交
50

C
choijulie 已提交
51
## Model
C
caoying03 已提交
52

C
choijulie 已提交
53
**Recurrent Neural Networks** (*RNN*) are important tools for sequence modeling and have been successfully used in some natural language processing tasks. Unlike feed-forward neural networks, RNNs can model the dependencies between elements of sequences. As a variant of RNNs', LSTMs aim model long-term dependency in long sequences. We have introduced this in [understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/05.understand_sentiment). In this chapter, we continue to use LSTMs to solve SRL problems.
C
caoying03 已提交
54

C
choijulie 已提交
55
### Stacked Recurrent Neural Network
C
caoying03 已提交
56

C
choijulie 已提交
57
*Deep Neural Networks* can extract hierarchical representations. The higher layers can form relatively abstract/complex representations, based on primitive features discovered through the lower layers. Unfolding LSTMs through time results in a deep feed-forward neural network. This is because any computational path between the input at time $k < t$ to the output at time $t$ crosses several nonlinear layers. On the other hand, due to parameter sharing over time, LSTMs are also *shallow*; that is, the computation carried out at each time-step is just a linear transformation. Deep LSTM networks are typically constructed by stacking multiple LSTM layers on top of each other and taking the output from lower LSTM layer at time $t$ as the input of upper LSTM layer at time $t$. Deep, hierarchical neural networks can be efficient at representing some functions and modeling varying-length dependencies\[[2](#Reference)\].
C
caoying03 已提交
58

59

C
choijulie 已提交
60
However, in a deep LSTM network, any gradient propagated back in depth needs to traverse a large number of nonlinear steps. As a result, while LSTMs of 4 layers can be trained properly, those with 4-8 have much worse performance. Conventional LSTMs prevent back-propagated errors from vanishing or exploding by introducing shortcut connections to skip the intermediate nonlinear layers. Therefore, deep LSTMs can consider shortcut connections in depth as well.
C
caoying03 已提交
61

62

C
choijulie 已提交
63 64 65 66 67 68 69 70 71
A single LSTM cell has three operations:

1. input-to-hidden: map input $x$ to the input of the forget gates, input gates, memory cells and output gates by linear transformation (i.e., matrix mapping);
2. hidden-to-hidden: calculate forget gates, input gates, output gates and update memory cell, this is the main part of LSTMs;
3. hidden-to-output: this part typically involves an activation operation on hidden states.

Based on the stacked LSTMs, we add shortcut connections: take the input-to-hidden from the previous layer as a new input and learn another linear transformation.

Fig.3 illustrates the final stacked recurrent neural networks.
72

73
<p align="center">  
C
choijulie 已提交
74 75
<img src="./image/stacked_lstm_en.png" width = "40%"  align=center><br>
Fig 3. Stacked Recurrent Neural Networks
76
</p>
C
caoying03 已提交
77

C
choijulie 已提交
78
### Bidirectional Recurrent Neural Network
C
caoying03 已提交
79

C
choijulie 已提交
80 81 82
While LSTMs can summarize the history -- all the previous input seen up until now -- they can not see the future. Because most NLP (natural language processing) tasks provide the entirety of sentences, sequential learning can benefit from having the future encoded as well as the history.

To address, we can design a bidirectional recurrent neural network by making a minor modification. A higher LSTM layer can process the sequence in reversed direction with regards to its immediate lower LSTM layer, i.e., deep LSTM layers take turns to train on input sequences from left-to-right and right-to-left. Therefore, LSTM layers at time-step $t$ can see both histories and the future, starting from the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks.
C
caoying03 已提交
83

84

85
<p align="center">  
C
choijulie 已提交
86 87
<img src="./image/bidirectional_stacked_lstm_en.png" width = "60%" align=center><br>
Fig 4. Bidirectional LSTMs
88
</p>
C
caoying03 已提交
89

90
Note that, this bidirectional RNNs is different with the one proposed by Bengio et al. in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following tasks [machine translation](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md)
C
choijulie 已提交
91 92

### Conditional Random Field (CRF)
C
caoying03 已提交
93

C
choijulie 已提交
94
Typically, a neural network's lower layers learn representations while its very top layer learns the final task. These principles can guide our problem-solving approaches. In SRL tasks, a **Conditional Random Field** (*CRF*) is built on top of the network in order to perform the final prediction to tag sequences. It takes as input the representations provided by the last LSTM layer.
C
caoying03 已提交
95

C
caoying03 已提交
96

C
choijulie 已提交
97
The CRF is an undirected probabilistic graph with nodes denoting random variables and edges denoting dependencies between these variables. In essence, CRFs learn the conditional probability $P(Y|X)$, where $X = (x_1, x_2, ... , x_n)$ are sequences of input and $Y = (y_1, y_2, ... , y_n)$ are label sequences; to decode, simply search through $Y$ for a sequence that maximizes the conditional probability $P(Y|X)$, i.e., $Y^* = \mbox{arg max}_{Y} P(Y | X)$。
C
caoying03 已提交
98

C
choijulie 已提交
99
Sequence tagging tasks do not assume a lot of conditional independence, because they are only concerned with the input and the output being linear sequences. Thus, the graph model of sequence tagging tasks is usually a simple chain or line, which results in a **Linear-Chain Conditional Random Field**, shown in Fig.5.
C
caoying03 已提交
100

101
<p align="center">  
Y
Yu Yang 已提交
102
<img src="./image/linear_chain_crf.png" width = "35%" align=center><br>
C
choijulie 已提交
103
Fig 5. Linear Chain Conditional Random Field used in SRL tasks
C
caoying03 已提交
104
</p>
C
caoying03 已提交
105

C
choijulie 已提交
106
By the fundamental theorem of random fields \[[5](#Reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form:
C
caoying03 已提交
107

C
caoying03 已提交
108
$$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$
C
caoying03 已提交
109

C
choijulie 已提交
110 111

where, $Z(X)$ is normalization constant, ${t_j}$ represents the feature functions defined on edges called the *transition feature*, which denotes the transition probabilities from $y_{i-1}$ to $y_i$ given input sequence $X$. ${s_k}$ represents the feature function defined on nodes, called the state feature, denoting the probability of $y_i$ given input sequence $X$. In addition, $\lambda_j$ and $\mu_k$ are weights corresponding to $t_j$ and $s_k$. Alternatively, $t$ and $s$ can be written in the same form that depends on $y_{i - 1}$, $y_i$, $X$, and $i$. Taking its summation over all nodes $i$, we have: $f_{k}(Y, X) = \sum_{i=1}^{n}f_k({y_{i - 1}, y_i, X, i})$, which defines the *feature function* $f$. Thus, $P(Y|X)$ can be written as:
C
caoying03 已提交
112 113

$$p(Y|X, W) = \frac{1}{Z(X)}\text{exp}\sum_{k}\omega_{k}f_{k}(Y, X)$$
C
caoying03 已提交
114

C
choijulie 已提交
115 116
where $\omega$ are the weights to the feature function that the CRF learns. While training, given input sequences and label sequences $D = \left[(X_1,  Y_1), (X_2 , Y_2) , ... , (X_N, Y_N)\right]$, by maximum likelihood estimation (**MLE**), we construct the following objective function:

C
caoying03 已提交
117

M
Mimee 已提交
118
$$\DeclareMathOperator*{\argmax}{arg\,max} L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \frac{1}{2}\lVert W\rVert^{2}$$
C
caoying03 已提交
119 120


121
This objective function can be solved via back-propagation in an end-to-end manner. While decoding, given input sequences $X$, search for sequence $\bar{Y}$ to maximize the conditional probability $\bar{P}(Y|X)$ via decoding methods (such as *Viterbi*, or [Beam Search Algorithm](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md#beam-search-algorithm)).
C
choijulie 已提交
122 123 124 125

### Deep Bidirectional LSTM (DB-LSTM) SRL model

Given predicates and a sentence, SRL tasks aim to identify arguments of the given predicate and their semantic roles. If a sequence has $n$ predicates, we will process this sequence $n$ times. Here is the breakdown of a straight-forward model:
C
update.  
caoying03 已提交
126

C
choijulie 已提交
127 128 129 130 131 132
1. Construct inputs;
 - input 1: predicate, input 2: sentence
 - expand input 1 into a sequence of the same length with input 2's sentence, using one-hot representation;
2. Convert the one-hot sequences from step 1 to vector sequences via a word embedding's lookup table;
3. Learn the representation of input sequences by taking vector sequences from step 2 as inputs;
4. Take the representation from step 3 as input, label sequence as supervisory signal, and realize sequence tagging tasks.
C
update.  
caoying03 已提交
133

C
choijulie 已提交
134
Here, we propose some improvements by introducing two simple but effective features:
135

C
choijulie 已提交
136
- predicate context (**ctx-p**): A single predicate word may not describe all the predicate information, especially when the same words appear multiple times in a sentence. With the expanded context, the ambiguity can be largely eliminated. Thus, we extract $n$ words before and after predicate to construct a window chunk.
C
update.  
caoying03 已提交
137

C
choijulie 已提交
138
- region mark ($m_r$): The binary marker on a word, $m_r$, takes the value of $1$ when the word is in the predicate context region, and $0$ if not.
C
update.  
caoying03 已提交
139

C
choijulie 已提交
140 141 142 143 144 145 146 147
After these modifications, the model is as follows, as illustrated in Figure 6:

1. Construct inputs
 - Input 1: word sequence. Input 2: predicate. Input 3: predicate context, extract $n$ words before and after predicate. Input 4: region mark sequence, where an entry is 1 if word is located in the predicate context region, 0 otherwise.
 - expand input 2~3 into sequences with the same length with input 1
2. Convert input 1~4 to vector sequences via word embedding lookup tables; While input 1 and 3 shares the same lookup table, input 2 and 4 have separate lookup tables.
3. Take the four vector sequences from step 2 as inputs to bidirectional LSTMs; Train the LSTMs to update representations.
4. Take the representation from step 3 as input to CRF, label sequence as supervisory signal, and complete sequence tagging tasks.
C
update.  
caoying03 已提交
148 149


150
<div  align="center">  
C
choijulie 已提交
151 152
<img src="image/db_lstm_network_en.png" width = "60%"  align=center /><br>
Fig 6. DB-LSTM for SRL tasks
C
caoying03 已提交
153
</div>
C
caoying03 已提交
154

C
choijulie 已提交
155
## Data Preparation
C
caoying03 已提交
156

C
choijulie 已提交
157
In the tutorial, we use [CoNLL 2005](http://www.cs.upc.edu/~srlconll/) SRL task open dataset as an example. Note that the training set and development set of the CoNLL 2005 SRL task are not free to download after the competition. Currently, only the test set can be obtained, including 23 sections of the Wall Street Journal and three sections of the Brown corpus. In this tutorial, we use the WSJ corpus as the training dataset to explain the model. However, since the training set is small, for a usable neural network SRL system, please consider paying for the full corpus.
C
caoying03 已提交
158

C
choijulie 已提交
159
The original data includes a variety of information such as POS tagging, naming entity recognition, syntax tree, etc. In this tutorial, we only use the data under `test.wsj/words/` (text sequence) and `test.wsj/props/` (label results). The data directory used in this tutorial is as follows:
C
caoying03 已提交
160 161 162 163

```text
conll05st-release/
└── test.wsj
C
caoying03 已提交
164
    ├── props  # 标注结果
C
caoying03 已提交
165
    └── words  # 输入文本序列
C
caoying03 已提交
166
```
C
caoying03 已提交
167

C
choijulie 已提交
168
The annotation information is derived from the results of Penn TreeBank\[[7](#references)\] and PropBank \[[8](# references)\]. The labeling of the PropBank is different from the labeling methods mentioned before, but shares with it the same underlying principle. For descriptions of the labeling, please refer to the paper \[[9](#references)\].
C
caoying03 已提交
169

C
choijulie 已提交
170
The raw data needs to be preprocessed into formats that PaddlePaddle can handle. The preprocessing consists of the following steps:
D
dangqingqing 已提交
171

C
choijulie 已提交
172 173 174 175 176
1. Merge the text sequence and the tag sequence into the same record;
2. If a sentence contains $n$ predicates, the sentence will be processed $n$ times into $n$ separate training samples, each sample with a different predicate;
3. Extract the predicate context and construct the predicate context region marker;
4. Construct the markings in BIO format;
5. Obtain the integer index corresponding to the word according to the dictionary.
D
dangqingqing 已提交
177 178 179

```python
# import paddle.v2.dataset.conll05 as conll05
C
choijulie 已提交
180 181 182
# conll05.corpus_reader does step 1 and 2 as mentioned above.
# conll05.reader_creator does step 3 to 5.
# conll05.test gets preprocessed training instances.
D
dangqingqing 已提交
183
```
C
caoying03 已提交
184

C
choijulie 已提交
185
After preprocessing, a training sample contains nine features, namely: word sequence, predicate, predicate context (5 columns), region mark sequence, label sequence. The following table is an example of a training sample.
C
caoying03 已提交
186

C
choijulie 已提交
187
| word sequence | predicate | predicate context(5 columns) | region mark sequence | label sequence|
C
caoying03 已提交
188 189 190 191 192 193 194 195 196
|---|---|---|---|---|
| A | set | n't been set . × | 0 | B-A1 |
| record | set | n't been set . × | 0 | I-A1 |
| date | set | n't been set . × | 0 | I-A1 |
| has | set | n't been set . × | 0 | O |
| n't | set | n't been set . × | 1 | B-AM-NEG |
| been | set | n't been set . × | 1 | O |
| set | set | n't been set . × | 1 | B-V |
| . | set | n't been set . × | 1 | O |
C
caoying03 已提交
197

C
choijulie 已提交
198
In addition to the data, we provide following resources:
C
update.  
caoying03 已提交
199

C
choijulie 已提交
200
| filename | explanation |
D
dangqingqing 已提交
201
|---|---|
C
choijulie 已提交
202 203 204 205
| word_dict | dictionary of input sentences, total 44068 words |
| label_dict | dictionary of labels, total 106 labels |
| predicate_dict | predicate dictionary, total 3162 predicates |
| emb | a pre-trained word vector lookup table, 32-dimentional |
C
caoying03 已提交
206

C
choijulie 已提交
207
We trained a language model on the English Wikipedia to get a word vector lookup table used to initialize the SRL model. While training the SRL model, the word vector lookup table is no longer updated. To learn more about the language model and the word vector lookup table, please refer to the tutorial [word vector](https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/README.md). There are 995,000,000 tokens in the training corpus, and the dictionary size is 4900,000 words. In the CoNLL 2005 training corpus, 5% of the words are not in the 4900,000 words, and we see them all as unknown words, represented by `<unk>`.
C
caoying03 已提交
208

C
choijulie 已提交
209
Here we fetch the dictionary, and print its size:
C
caoying03 已提交
210

C
caoying03 已提交
211
```python
D
dangqingqing 已提交
212 213
import math
import numpy as np
214
import gzip
D
dangqingqing 已提交
215 216
import paddle.v2 as paddle
import paddle.v2.dataset.conll05 as conll05
217
import paddle.v2.evaluator as evaluator
C
update.  
caoying03 已提交
218

D
dangqingqing 已提交
219 220
paddle.init(use_gpu=False, trainer_count=1)

D
dangqingqing 已提交
221 222 223 224
word_dict, verb_dict, label_dict = conll05.get_dict()
word_dict_len = len(word_dict)
label_dict_len = len(label_dict)
pred_len = len(verb_dict)
C
caoying03 已提交
225

D
dangqingqing 已提交
226 227 228
print word_dict_len
print label_dict_len
print pred_len
C
caoying03 已提交
229 230
```

C
choijulie 已提交
231
## Model Configuration
C
update.  
caoying03 已提交
232

C
choijulie 已提交
233
- Define input data dimensions and model hyperparameters.
C
update.  
caoying03 已提交
234

D
dangqingqing 已提交
235
```python
C
choijulie 已提交
236 237 238 239 240 241 242 243 244 245 246 247
mark_dict_len = 2    # value range of region mark. Region mark is either 0 or 1, so range is 2
word_dim = 32        # word vector dimension
mark_dim = 5         # adjacent dimension
hidden_dim = 512     # the dimension of LSTM hidden layer vector is 128 (512/4)
depth = 8            # depth of stacked LSTM

# There are 9 features per sample, so we will define 9 data layers.
# They type for each layer is integer_value_sequence.
def d_type(value_range):
    return paddle.data_type.integer_value_sequence(value_range)

# word sequence
D
dangqingqing 已提交
248
word = paddle.layer.data(name='word_data', type=d_type(word_dict_len))
C
choijulie 已提交
249
# predicate
250
predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len))
D
dangqingqing 已提交
251

C
choijulie 已提交
252
# 5 features for predicate context
253
ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len))
D
dangqingqing 已提交
254 255 256 257 258
ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len))
ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len))
ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len))
ctx_p2 = paddle.layer.data(name='ctx_p2_data', type=d_type(word_dict_len))

C
choijulie 已提交
259
# region marker sequence
D
dangqingqing 已提交
260 261
mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len))

C
choijulie 已提交
262
# label sequence
D
dangqingqing 已提交
263 264
target = paddle.layer.data(name='target', type=d_type(label_dict_len))
```
265

C
choijulie 已提交
266
Note that `hidden_dim = 512` means a LSTM hidden vector of 128 dimension (512/4). Please refer to PaddlePaddle's official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)
C
update.  
caoying03 已提交
267

C
choijulie 已提交
268
- Transform the word sequence itself, the predicate, the predicate context, and the region mark sequence into embedded vector sequences.
C
update.  
caoying03 已提交
269

270
```python  
D
dangqingqing 已提交
271

C
choijulie 已提交
272 273
# Since word vectorlookup table is pre-trained, we won't update it this time.
# is_static being True prevents updating the lookup table during training.
D
dangqingqing 已提交
274
emb_para = paddle.attr.Param(name='emb', initial_std=0., is_static=True)
C
choijulie 已提交
275
# hyperparameter configurations
D
dangqingqing 已提交
276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295
default_std = 1 / math.sqrt(hidden_dim) / 3.0
std_default = paddle.attr.Param(initial_std=default_std)
std_0 = paddle.attr.Param(initial_std=0.)

predicate_embedding = paddle.layer.embedding(
    size=word_dim,
    input=predicate,
    param_attr=paddle.attr.Param(
        name='vemb', initial_std=default_std))
mark_embedding = paddle.layer.embedding(
    size=mark_dim, input=mark, param_attr=std_0)

word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
emb_layers = [
    paddle.layer.embedding(
        size=word_dim, input=x, param_attr=emb_para) for x in word_input
]
emb_layers.append(predicate_embedding)
emb_layers.append(mark_embedding)
```
C
update.  
caoying03 已提交
296

C
choijulie 已提交
297
- 8 LSTM units are trained through alternating left-to-right / right-to-left order denoted by the variable `reverse`.
C
update.  
caoying03 已提交
298

D
dangqingqing 已提交
299 300
```python  
hidden_0 = paddle.layer.mixed(
301 302 303 304 305 306
    size=hidden_dim,
    bias_attr=std_default,
    input=[
        paddle.layer.full_matrix_projection(
            input=emb, param_attr=std_default) for emb in emb_layers
    ])
D
dangqingqing 已提交
307 308 309 310 311 312 313 314 315 316 317 318 319 320

mix_hidden_lr = 1e-3
lstm_para_attr = paddle.attr.Param(initial_std=0.0, learning_rate=1.0)
hidden_para_attr = paddle.attr.Param(
    initial_std=default_std, learning_rate=mix_hidden_lr)

lstm_0 = paddle.layer.lstmemory(
    input=hidden_0,
    act=paddle.activation.Relu(),
    gate_act=paddle.activation.Sigmoid(),
    state_act=paddle.activation.Sigmoid(),
    bias_attr=std_0,
    param_attr=lstm_para_attr)

C
choijulie 已提交
321
# stack L-LSTM and R-LSTM with direct edges
D
dangqingqing 已提交
322 323 324 325
input_tmp = [hidden_0, lstm_0]

for i in range(1, depth):
    mix_hidden = paddle.layer.mixed(
D
dangqingqing 已提交
326 327 328 329
        size=hidden_dim,
        bias_attr=std_default,
        input=[
            paddle.layer.full_matrix_projection(
D
dangqingqing 已提交
330 331 332
                input=input_tmp[0], param_attr=hidden_para_attr),
            paddle.layer.full_matrix_projection(
                input=input_tmp[1], param_attr=lstm_para_attr)
D
dangqingqing 已提交
333 334
        ])

D
dangqingqing 已提交
335 336
    lstm = paddle.layer.lstmemory(
        input=mix_hidden,
D
dangqingqing 已提交
337 338 339
        act=paddle.activation.Relu(),
        gate_act=paddle.activation.Sigmoid(),
        state_act=paddle.activation.Sigmoid(),
D
dangqingqing 已提交
340
        reverse=((i % 2) == 1),
D
dangqingqing 已提交
341 342 343
        bias_attr=std_0,
        param_attr=lstm_para_attr)

D
dangqingqing 已提交
344 345
    input_tmp = [mix_hidden, lstm]
```
C
update.  
caoying03 已提交
346

C
choijulie 已提交
347
- In PaddlePaddle, state features and transition features of a CRF are implemented by a fully connected layer and a CRF layer seperately. The fully connected layer with linear activation learns the state features, here we use paddle.layer.mixed (paddle.layer.fc can be uesed as well), and the CRF layer in PaddlePaddle: paddle.layer.crf only learns the transition features, which is a cost layer and is the last layer of the network. paddle.layer.crf outputs the log probability of true tag sequence as the cost by given the input sequence and it requires the true tag sequence as target in the learning process.
C
update.  
caoying03 已提交
348

D
dangqingqing 已提交
349
```python
C
caoying03 已提交
350

C
choijulie 已提交
351 352 353
# The output of the top LSTM unit and its input are feed into a fully connected layer,
# size of which equals to size of tag labels.
# The fully connected layer learns the state features
C
caoying03 已提交
354

D
dangqingqing 已提交
355
feature_out = paddle.layer.mixed(
356 357 358 359 360 361
    size=label_dict_len,
    bias_attr=std_default,
    input=[
        paddle.layer.full_matrix_projection(
            input=input_tmp[0], param_attr=hidden_para_attr),
        paddle.layer.full_matrix_projection(
C
choijulie 已提交
362
            input=input_tmp[1], param_attr=lstm_para_attr)], )
C
update.  
caoying03 已提交
363

D
dangqingqing 已提交
364 365 366 367 368 369 370 371 372
crf_cost = paddle.layer.crf(
    size=label_dict_len,
    input=feature_out,
    label=target,
    param_attr=paddle.attr.Param(
        name='crfw',
        initial_std=default_std,
        learning_rate=mix_hidden_lr))
```
C
update.  
caoying03 已提交
373

C
choijulie 已提交
374
- The CRF decoding layer is used for evaluation and inference. It shares weights with CRF layer.  The sharing of parameters among multiple layers is specified by using the same parameter name in these layers. If true tag sequence is provided in training process, `paddle.layer.crf_decoding` calculates labelling error for each input token and `evaluator.sum` sum the error over the entire sequence. Otherwise, `paddle.layer.crf_decoding`  generates the labelling tags.
D
dangqingqing 已提交
375

D
dangqingqing 已提交
376 377 378 379 380 381
```python
crf_dec = paddle.layer.crf_decoding(
   size=label_dict_len,
   input=feature_out,
   label=target,
   param_attr=paddle.attr.Param(name='crfw'))
382
evaluator.sum(input=crf_dec)
D
dangqingqing 已提交
383
```
D
dangqingqing 已提交
384

C
choijulie 已提交
385
## Train model
D
dangqingqing 已提交
386

C
choijulie 已提交
387
### Create Parameters
D
dangqingqing 已提交
388

C
choijulie 已提交
389
All necessary parameters will be traced created given output layers that we need to use.
D
dangqingqing 已提交
390 391

```python
392
parameters = paddle.parameters.create(crf_cost)
C
caoying03 已提交
393
```
C
caoying03 已提交
394

C
choijulie 已提交
395
We can print out parameter name. It will be generated if not specified.
396

D
dangqingqing 已提交
397 398 399
```python
print parameters.keys()
```
C
caoying03 已提交
400

C
choijulie 已提交
401
Now we load the pre-trained word lookup tables from word embeddings trained on the English language Wikipedia.
D
dangqingqing 已提交
402 403 404 405

```python
def load_parameter(file_name, h, w):
    with open(file_name, 'rb') as f:
D
dangqingqing 已提交
406 407
        f.read(16)
        return np.fromfile(f, dtype=np.float32).reshape(h, w)
D
dangqingqing 已提交
408
parameters.set('emb', load_parameter(conll05.get_embedding(), 44068, 32))
C
caoying03 已提交
409 410
```

C
choijulie 已提交
411
### Create Trainer
C
caoying03 已提交
412

C
choijulie 已提交
413
We will create trainer given model topology, parameters, and optimization method. We will use the most basic **SGD** method, which is a momentum optimizer with 0 momentum. Meanwhile, we will set learning rate and regularization.
C
caoying03 已提交
414

D
dangqingqing 已提交
415 416 417
```python
optimizer = paddle.optimizer.Momentum(
    momentum=0,
418
    learning_rate=1e-3,
D
dangqingqing 已提交
419 420 421 422 423 424
    regularization=paddle.optimizer.L2Regularization(rate=8e-4),
    model_average=paddle.optimizer.ModelAverage(
        average_window=0.5, max_average_window=10000), )

trainer = paddle.trainer.SGD(cost=crf_cost,
                             parameters=parameters,
425 426
                             update_equation=optimizer,
                             extra_layers=crf_dec)
D
dangqingqing 已提交
427 428
```

C
choijulie 已提交
429
### Trainer
D
dangqingqing 已提交
430

C
choijulie 已提交
431
As mentioned in data preparation section, we will use CoNLL 2005 test corpus as the training data set. `conll05.test()` outputs one training instance at a time. It is shuffled and batched into mini batches, and used as input.
C
caoying03 已提交
432 433

```python
D
dangqingqing 已提交
434 435
reader = paddle.batch(
    paddle.reader.shuffle(
436
        conll05.test(), buf_size=8192), batch_size=2)
C
caoying03 已提交
437 438
```

C
choijulie 已提交
439
`feeding` is used to specify the correspondence between data instance and data layer. For example, according to following `feeding`, the 0th column of data instance produced by`conll05.test()` is matched to the data layer named `word_data`.
D
dangqingqing 已提交
440 441

```python
D
dangqingqing 已提交
442
feeding = {
D
dangqingqing 已提交
443 444 445 446 447 448 449 450 451 452
    'word_data': 0,
    'ctx_n2_data': 1,
    'ctx_n1_data': 2,
    'ctx_0_data': 3,
    'ctx_p1_data': 4,
    'ctx_p2_data': 5,
    'verb_data': 6,
    'mark_data': 7,
    'target': 8
}
C
caoying03 已提交
453 454
```

C
choijulie 已提交
455
`event_handler` can be used as callback for training events, it will be used as an argument for the `train` method. Following `event_handler` prints cost during training.
C
caoying03 已提交
456

D
dangqingqing 已提交
457 458 459
```python
def event_handler(event):
    if isinstance(event, paddle.event.EndIteration):
460
        if event.batch_id and event.batch_id % 10 == 0:
461 462
            print "Pass %d, Batch %d, Cost %f, %s" % (
                event.pass_id, event.batch_id, event.cost, event.metrics)
463
        if event.batch_id % 400 == 0:
464 465 466 467 468 469
            result = trainer.test(reader=reader, feeding=feeding)
            print "\nTest with Pass %d, Batch %d, %s" % (event.pass_id, event.batch_id, result.metrics)

    if isinstance(event, paddle.event.EndPass):
        # save parameters
        with gzip.open('params_pass_%d.tar.gz' % event.pass_id, 'w') as f:
470
            parameters.to_tar(f)
471 472 473

        result = trainer.test(reader=reader, feeding=feeding)
        print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
D
dangqingqing 已提交
474 475
```

C
choijulie 已提交
476
`trainer.train` will train the model.
D
dangqingqing 已提交
477 478 479 480 481

```python
trainer.train(
    reader=reader,
    event_handler=event_handler,
C
choijulie 已提交
482
    num_passes=10000,
D
dangqingqing 已提交
483
    feeding=feeding)
C
caoying03 已提交
484 485
```

C
choijulie 已提交
486
### Application
487

C
choijulie 已提交
488
Aftern training is done, we need to select an optimal model based one performance index to do inference. In this task, one can simply select the model with the least number of marks on the test set. The `paddle.layer.crf_decoding` layer is used in the inference, but its inputs does not include the ground truth label.
489 490 491 492 493 494 495 496

```python
predict = paddle.layer.crf_decoding(
    size=label_dict_len,
    input=feature_out,
    param_attr=paddle.attr.Param(name='crfw'))
```

C
choijulie 已提交
497
Here, using one testing sample as an example.
498 499 500 501 502 503 504 505 506 507

```python
test_creator = paddle.dataset.conll05.test()
test_data = []
for item in test_creator():
    test_data.append(item[0:8])
    if len(test_data) == 1:
        break
```

C
choijulie 已提交
508 509
The inference interface `paddle.infer` returns the index of predicting labels. Then printing the tagging results based dictionary `labels_reverse`.

510 511 512 513 514 515 516 517 518 519 520 521

```python
labs = paddle.infer(
    output_layer=predict, parameters=parameters, input=test_data, field='id')
assert len(labs) == len(test_data[0][0])
labels_reverse={}
for (k,v) in label_dict.items():
    labels_reverse[v]=k
pre_lab = [labels_reverse[i] for i in labs]
print pre_lab
```

C
choijulie 已提交
522
## Conclusion
C
update.  
caoying03 已提交
523

C
choijulie 已提交
524
Semantic Role Labeling is an important intermediate step in a wide range of natural language processing tasks. In this tutorial, we use SRL as an example to illustrate using PaddlePaddle to do sequence tagging tasks. The models proposed are from our published paper\[[10](#Reference)\]. We only use test data for illustration since the training data on the CoNLL 2005 dataset is not completely public. This aims to propose an end-to-end neural network model with fewer dependencies on natural language processing tools but is comparable, or even better than traditional models in terms of performance. Please check out our paper for more information and discussions.
C
update.  
caoying03 已提交
525

C
choijulie 已提交
526
## Reference
527 528 529 530 531 532 533 534 535 536
1. Sun W, Sui Z, Wang M, et al. [Chinese semantic role labeling with shallow parsing](http://www.aclweb.org/anthology/D09-1#page=1513)[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3. Association for Computational Linguistics, 2009: 1475-1483.
2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013.
3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014.
4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[J]. arXiv preprint arXiv:1409.0473, 2014.
5. Lafferty J, McCallum A, Pereira F. [Conditional random fields: Probabilistic models for segmenting and labeling sequence data](http://www.jmlr.org/papers/volume15/doppa14a/source/biblio.bib.old)[C]//Proceedings of the eighteenth international conference on machine learning, ICML. 2001, 1: 282-289.
6. 李航. 统计学习方法[J]. 清华大学出版社, 北京, 2012.
7. Marcus M P, Marcinkiewicz M A, Santorini B. [Building a large annotated corpus of English: The Penn Treebank](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1246&context=cis_reports)[J]. Computational linguistics, 1993, 19(2): 313-330.
8. Palmer M, Gildea D, Kingsbury P. [The proposition bank: An annotated corpus of semantic roles](http://www.mitpressjournals.org/doi/pdfplus/10.1162/0891201053630264)[J]. Computational linguistics, 2005, 31(1): 71-106.
9. Carreras X, Màrquez L. [Introduction to the CoNLL-2005 shared task: Semantic role labeling](http://www.cs.upc.edu/~srlconll/st05/papers/intro.pdf)[C]//Proceedings of the Ninth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2005: 152-164.
10. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
L
Luo Tao 已提交
537 538

<br/>
L
Luo Tao 已提交
539
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.