@@ -168,12 +168,11 @@ Fig 6. DB-LSTM for SRL tasks
原句-> sentence
反向LSTM-> LSTM Reverse
## 数据准备
### 数据介绍与下载
## Data Preparation
在此教程中,我们选用[CoNLL 2005](http://www.cs.upc.edu/~srlconll/)SRL任务开放出的数据集作为示例。运行 `sh ./get_data.sh` 会自动从官方网站上下载原始数据。需要特别说明的是,CoNLL 2005 SRL任务的训练数集和开发集在比赛之后并非免费进行公开,目前,能够获取到的只有测试集,包括Wall Street Journal的23节和Brown语料集中的3节。在本教程中,我们以测试集中的WSJ数据为训练集来讲解模型。但是,由于测试集中样本的数量远远不够,如果希望训练一个可用的神经网络SRL系统,请考虑付费获取全量数据。
In the tutorial, we use[CoNLL 2005](http://www.cs.upc.edu/~srlconll/)SRL task open dataset as an example. Run `sh ./get_data.sh` will automatically download raw data from the official website. It is important to note that the training set and development set of the CoNLL 2005 SRL task are not open for free after the competition. Currently, only the test set can be obtained, including 23 sections of the Wall Street Journal and 3 sections of the Brown corpus. In this tutorial, we use the WSJ corpus as training set to explain the model. However, since the training set is small, if you want to train an usable neural network SRL system, consider paying for the full data corpus.
The original data includes a variety of information such as POS tagging, naming entity recognition, parsing tree, and so on. In this tutorial, we only use the data under the words folder (text sequence) and the props folder (label results) inside test.wsj parent folder. The data directory used in this tutorial is as follows:
The annotation information is derived from the results of Penn TreeBank\[[7](#references)\] and PropBank \[[8](# references)\]. The label of the PropBank is different from the label that we used in the example at the beginning of the article, but the principle is the same. For the description of the label, please refer to the paper \[[9](#references)\].
除数据之外,`get_data.sh`同时下载了以下资源:
The raw data needs to be preprocessed before used by PaddlePaddle. The preprocessing consists of the following steps:
1. Merge the text sequence and the tag sequence into the same record;
2. If a sentence contains $n$ predicates, the sentence will be processed $n$ times into $n$ separate training samples, each sample with a different predicate;
3. Extract the predicate context and construct the predicate context region marker;
4. Construct the markings in BIO format;
5. Obtain the integer index corresponding to the word according to the dictionary.
After preprocessing completes, a training sample contains 9 features, namely: word sequence, predicate, predicate context (5 columns), predicate upper and lower area markers, sequence label. Following table is an example of one training sample.
| 句子序列 | 谓词 | 谓词上下文(窗口 = 5) | 谓词上下文区域标记 | 标注序列 |
| word sequence | predicate | predicate context(5 columns) | predicate upper and lower area markers | sequence label |
label_slot = [settings.label_dict.get(w) for w in label_list]
yield {
'word_data': word_slot,
'ctx_n2_data': ctx_n2_slot,
'ctx_n1_data': ctx_n1_slot,
'ctx_0_data': ctx_0_slot,
'ctx_p1_data': ctx_p1_slot,
'ctx_p2_data': ctx_p2_slot,
'verb_data': predicate_slot,
'mark_data': mark_slot,
'target': label_slot
}
```
In addition to the data, `get_data.sh` also downloads the following resources:
## 模型配置说明
| filename | explanation |
|---|---|
| word_dict | dictionary of input sentences, total 44068 words |
| label_dict | dictionary of labels, total 106 labels |
| predicate_dict | predicate dictionary, total 3162 predicates |
| emb | a pre-trained word embedding, 32-dimentional |
### 数据定义
We trained in the English Wikipedia language model to get a word vector used to initialize the SRL model. During the SRL model training process, the word vector is no longer updated. About the language model and the word vector can refer to [word vector](https://github.com/PaddlePaddle/book/blob/develop/word2vec/README.md) tutorial. There are 995,000,000 token in training corpus, and the dictionary size is 4900,000 words. In the CoNLL 2005 training corpus, 5% of the words are not in the 4900,000 words, and we see them all as unknown words, represented by `<unk>`.
4. We will concat the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it. We will get the final vector representation.
```python
# std_0 指定的参数以均值为0的高斯分布初始化,用在LSTM的bias初始化中
std_0 = ParameterAttribute(initial_std=0.)
hidden_0 = mixed_layer(
name='hidden0',
size=hidden_dim,
bias_attr=std_default,
input=[
full_matrix_projection(
input=emb, param_attr=std_default) for emb in emb_layers
We will create trainer according to model topology, parameters and optimization method. We will use most basic SGD method (special case for momentum optimizer when momentum is 0). In the mean time, we will set learning rate and regularization.
As mentioned in data preparation section, we will use CoNLL 2005 test corpus as training data set. `conll05.test()` outputs one training instance at a time. It will be shuffled, and batched into mini batch as training input.
`reader_dict` is used to specify relationship between data instance and layer layer. For example, according to following `reader_dict`, the 0th column of data instance produced by`conll05.test()` correspond to data layer named `word_data`.
```python
crf_dec_l=crf_decoding_layer(
name='crf_dec_l',
size=label_dict_len,
input=feature_out,
param_attr=ParameterAttribute(name='crfw'))
reader_dict={
'word_data':0,
'ctx_n2_data':1,
'ctx_n1_data':2,
'ctx_0_data':3,
'ctx_p1_data':4,
'ctx_p2_data':5,
'verb_data':6,
'mark_data':7,
'target':8
}
```
运行`python predict.py`脚本,便可使用指定的模型进行预测。
```bash
python predict.py
-c db_lstm.py # 指定配置文件
-w output/pass-00100 # 指定预测使用的模型所在的路径
-l data/targetDict.txt # 指定标记的字典
-p data/verbDict.txt # 指定谓词的词典
-d data/wordDict.txt # 指定输入文本序列的字典
-i data/feature # 指定输入数据的路径
-o predict.res # 指定标记结果输出到文件的路径
`event_handle` can be used as callback for training events, it will be an argument for `train`. We will print cost during training.
```python
defevent_handler(event):
ifisinstance(event,paddle.event.EndIteration):
ifevent.batch_id%100==0:
print"Pass %d, Batch %d, Cost %f"%(
event.pass_id,event.batch_id,event.cost)
```
预测结束后,在 - o 参数所指定的标记结果文件中,我们会得到如下格式的输出:每行是一条样本,以 “\t” 分隔的 2 列,第一列是输入文本,第二列是标记的结果。通过BIO标记可以直接得到论元的语义角色标签。
`trainer.train` will train the model.
```text
The interest-only securities were priced at 35 1\/2 to yield 10.72 % . B-A0 I-A0 I-A0 O O O O O O B-V B-A1 I-A1 O