Note that `hidden_dim = 512` means a LSTM hidden vector of 128 dimension (512/4). Please refer to PaddlePaddle's official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)。
-Transform the word sequence itself, the predicate, the predicate context, and the region mark sequence into embedded vector sequences.
-Define a parameter loader method to load the pre-trained word lookup tables from word embeddings trained on the English language Wikipedia.
```python
# Since word vectorlookup table is pre-trained, we won't update it this time.
# is_static being True prevents updating the lookup table during training.
- In PaddlePaddle, state features and transition features of a CRF are implemented by a fully connected layer and a CRF layer seperately. The fully connected layer with linear activation learns the state features, here we use paddle.layer.mixed (paddle.layer.fc can be uesed as well), and the CRF layer in PaddlePaddle: paddle.layer.crf only learns the transition features, which is a cost layer and is the last layer of the network. paddle.layer.crf outputs the log probability of true tag sequence as the cost by given the input sequence and it requires the true tag sequence as target in the learning process.
```python
# The output of the top LSTM unit and its input are feed into a fully connected layer,
# size of which equals to size of tag labels.
# The fully connected layer learns the state features
feature_out=paddle.layer.mixed(
size=label_dict_len,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=input_tmp[0],param_attr=hidden_para_attr),
paddle.layer.full_matrix_projection(
input=input_tmp[1],param_attr=lstm_para_attr)],)
crf_cost=paddle.layer.crf(
size=label_dict_len,
input=feature_out,
label=target,
param_attr=paddle.attr.Param(
name='crfw',
initial_std=default_std,
learning_rate=mix_hidden_lr))
```
- The CRF decoding layer is used for evaluation and inference. It shares weights with CRF layer. The sharing of parameters among multiple layers is specified by using the same parameter name in these layers. If true tag sequence is provided in training process, `paddle.layer.crf_decoding` calculates labelling error for each input token and `evaluator.sum` sum the error over the entire sequence. Otherwise, `paddle.layer.crf_decoding` generates the labelling tags.
- In the `train` method, we will create trainer given model topology, parameters, and optimization method. We will use the most basic **SGD** method, which is a momentum optimizer with 0 momentum. Meanwhile, we will set learning rate and decay.
All necessary parameters will be traced created given output layers that we need to use.
- As mentioned in data preparation section, we will use CoNLL 2005 test corpus as the training data set. `conll05.test()` outputs one training instance at a time. It is shuffled and batched into mini batches, and used as input.
```python
parameters=paddle.parameters.create(crf_cost)
```
We can print out parameter name. It will be generated if not specified.
```python
printparameters.keys()
```
Now we load the pre-trained word lookup tables from word embeddings trained on the English language Wikipedia.
We will create trainer given model topology, parameters, and optimization method. We will use the most basic **SGD** method, which is a momentum optimizer with 0 momentum. Meanwhile, we will set learning rate and regularization.
-`feeding` is used to specify the correspondence between data instance and data layer. For example, according to the `feeding`, the 0th column of data instance produced by`conll05.test()` is matched to the data layer named `word_data`.
### Trainer
-`event_handler` can be used as callback for training events, it will be used as an argument for the `train` method. Following `event_handler` prints cost during training.
As mentioned in data preparation section, we will use CoNLL 2005 test corpus as the training data set. `conll05.test()` outputs one training instance at a time. It is shuffled and batched into mini batches, and used as input.
`feeding` is used to specify the correspondence between data instance and data layer. For example, according to following `feeding`, the 0th column of data instance produced by`conll05.test()` is matched to the data layer named `word_data`.
`event_handler` can be used as callback for training events, it will be used as an argument for the `train` method. Following `event_handler` prints cost during training.
print"\nTest with Pass %d, %s"%(event.pass_id,result.metrics)
```
## Application
`trainer.train` will train the model.
- When training is completed, we need to select an optimal model based one performance index to do inference. In this task, one can simply select the model with the least number of marks on the test set. We demonstrate doing an inference using the trained model.
# Setup inputs by creating LoDTensors to represent sequences of words.
# Here each word is the basic element of these LoDTensors and the shape of
# each word (base_shape) should be [1] since it is simply an index to
# look up for the corresponding word vector.
# Suppose the length_based level of detail (lod) info is set to [[3, 4, 2]],
# which has only one lod level. Then the created LoDTensors will have only
# one higher level structure (sequence of words, or sentence) than the basic
# element (word). Hence the LoDTensor will hold data for three sentences of
# length 3, 4 and 2, respectively.
# Note that lod info should be a list of lists.
lod=[[3,4,2]]
base_shape=[1]
# The range of random integers is [low, high]
word=fluid.create_random_int_lodtensor(
lod,base_shape,place,low=0,high=word_dict_len-1)
pred=fluid.create_random_int_lodtensor(
lod,base_shape,place,low=0,high=pred_dict_len-1)
ctx_n2=fluid.create_random_int_lodtensor(
lod,base_shape,place,low=0,high=word_dict_len-1)
ctx_n1=fluid.create_random_int_lodtensor(
lod,base_shape,place,low=0,high=word_dict_len-1)
ctx_0=fluid.create_random_int_lodtensor(
lod,base_shape,place,low=0,high=word_dict_len-1)
ctx_p1=fluid.create_random_int_lodtensor(
lod,base_shape,place,low=0,high=word_dict_len-1)
ctx_p2=fluid.create_random_int_lodtensor(
lod,base_shape,place,low=0,high=word_dict_len-1)
mark=fluid.create_random_int_lodtensor(
lod,base_shape,place,low=0,high=mark_dict_len-1)
# Construct feed as a dictionary of {feed_target_name: feed_target_data}
# and results will contain a list of data corresponding to fetch_targets.
assertfeed_target_names[0]=='word_data'
assertfeed_target_names[1]=='verb_data'
assertfeed_target_names[2]=='ctx_n2_data'
assertfeed_target_names[3]=='ctx_n1_data'
assertfeed_target_names[4]=='ctx_0_data'
assertfeed_target_names[5]=='ctx_p1_data'
assertfeed_target_names[6]=='ctx_p2_data'
assertfeed_target_names[7]=='mark_data'
results=exe.run(inference_program,
feed={
feed_target_names[0]:word,
feed_target_names[1]:pred,
feed_target_names[2]:ctx_n2,
feed_target_names[3]:ctx_n1,
feed_target_names[4]:ctx_0,
feed_target_names[5]:ctx_p1,
feed_target_names[6]:ctx_p2,
feed_target_names[7]:mark
},
fetch_list=fetch_targets,
return_numpy=False)
print(results[0].lod())
np_data=np.array(results[0])
print("Inference Shape: ",np_data.shape)
```
### Application
- The main entrance of the whole program is as below:
When training is completed, we need to select an optimal model based one performance index to do inference. In this task, one can simply select the model with the least number of marks on the test set. The `paddle.layer.crf_decoding` layer is used in the inference, but its inputs do not include the ground truth label.