How to stack GRU layers in the seqToseq_net.py demo example? (#504) · Issue · PaddlePaddle / Paddle

How to stack GRU layers in the seqToseq_net.py demo example?

Created by: alvations

In the seqToseq demo, the main code that implements the decoder is:

    def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
        decoder_mem = memory(
            name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)

        context = simple_attention(
            encoded_sequence=enc_vec,
            encoded_proj=enc_proj,
            decoder_state=decoder_mem, )

        with mixed_layer(size=decoder_size * 3) as decoder_inputs:
            decoder_inputs += full_matrix_projection(input=context)
            decoder_inputs += full_matrix_projection(input=current_word)

        gru_step = gru_step_layer(
            name='gru_decoder',
            input=decoder_inputs,
            output_mem=decoder_mem,
            size=decoder_size)

        with mixed_layer(
                size=target_dict_dim, bias_attr=True,
                act=SoftmaxActivation()) as out:
            out += full_matrix_projection(input=gru_step)
        return out

How do I add more layers in the decoder like in the Google NMT paper?

From my understanding (according to the documentation), the recurrent_group at https://github.com/baidu/Paddle/blob/develop/demo/seqToseq/seqToseq_net.py#L156 is the part where Paddle takes a timestep so the layers should be added inside the gru_decoder_with_attention(), is that right?

But I'm unsure what is gru_step_layer() doing at https://github.com/baidu/Paddle/blob/develop/demo/seqToseq/seqToseq_net.py#L124 . There isn't much information in the documentation and from the code it says it's wrapping the gru_step which is at the GruStepLayer.cpp:

/**

@brief GruStepLayer is like GatedRecurrentLayer, but used in recurrent

layer group. GruStepLayer takes 2 input layer.

input[0] with size * 3 and diveded into 3 equal parts: (xz_t, xr_t, xi_t).

input[1] with size: {prev_out}.

parameter and biasParameter is also diveded into 3 equal parts:

parameter consists of (U_z, U_r, U)

baisParameter consists of (bias_z, bias_r, bias_o)

\f[

update \ gate: z_t = actGate(xz_t + U_z * prev_out + bias_z) \

reset \ gate: r_t = actGate(xr_t + U_r * prev_out + bias_r) \

output \ candidate: {h}_t = actNode(xi_t + U * dot(r_t, prev_out) + bias_o) \

output: h_t = dot((1-z_t), prev_out) + dot(z_t, prev_out)

\f]

@note

dot denotes "element-wise multiplication".

actNode is defined by config active_type

actGate is defined by config actvie_gate_type

The config file api if gru_step_layer. */

When I tried to stack the layers as such:

    def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
        decoder_mem = memory(name='gru_decoder',
                             size=decoder_size,
                             boot_layer=decoder_boot)

        context = simple_attention(encoded_sequence=enc_vec,
                                   encoded_proj=enc_proj,
                                   decoder_state=decoder_mem, )


        with mixed_layer(size=decoder_size * 3) as decoder_inputs:
            decoder_inputs += full_matrix_projection(input=context)
            decoder_inputs += full_matrix_projection(input=current_word)

        gru_step = gru_step_layer(name='gru_decoder',
                                  input=decoder_inputs,
                                  output_mem=decoder_mem,
                                  size=decoder_size)

        with mixed_layer(size=decoder_size * 3) as decoder_inputs:
            decoder_inputs += full_matrix_projection(input=context)
            decoder_inputs += full_matrix_projection(input=gru_step)

        gru_step = gru_step_layer(name='gru_decoder',
                                  input=decoder_inputs,
                                  output_mem=decoder_mem,
                                  size=decoder_size)


        with mixed_layer(size=decoder_size,
                         bias_attr=True,
                         act=SoftmaxActivation()) as out:
            out += full_matrix_projection(input=gru_step)

        return out

It throw an error that says there's no input sequence.

$ bash train.sh 
I1117 14:37:45.858877 13026 Util.cpp:155] commandline: /home/ltan/Paddle/binary/bin/../opt/paddle/bin/paddle_trainer --config=train.conf --save_dir=/home/ltan/Paddle/demo/ibot/model-sub --use_gpu=true --num_passes=100 --show_parameter_stats_period=1000 --trainer_count=4 --log_period=10 --dot_period=5 
I1117 14:37:51.345690 13026 Util.cpp:130] Calling runInitFunctions
I1117 14:37:51.345918 13026 Util.cpp:143] Call runInitFunctions done.
[WARNING 2016-11-17 14:37:51,529 layers.py:1133] You are getting the first instance for a time series, and it is a normal recurrent layer output. There is no time series information at all. Maybe you want to use last_seq instead.
[WARNING 2016-11-17 14:37:51,533 default_decorators.py:40] please use keyword arguments in paddle config.
[WARNING 2016-11-17 14:37:51,539 default_decorators.py:40] please use keyword arguments in paddle config.
[WARNING 2016-11-17 14:37:51,540 default_decorators.py:40] please use keyword arguments in paddle config.
[WARNING 2016-11-17 14:37:51,540 default_decorators.py:40] please use keyword arguments in paddle config.
[WARNING 2016-11-17 14:37:51,540 default_decorators.py:40] please use keyword arguments in paddle config.
[WARNING 2016-11-17 14:37:51,540 default_decorators.py:40] please use keyword arguments in paddle config.
[INFO 2016-11-17 14:37:51,543 networks.py:1125] The input order is [source_language_word, target_language_word, target_language_next_word]
[INFO 2016-11-17 14:37:51,543 networks.py:1132] The output order is [__cost_0__]
I1117 14:37:51.551156 13026 Trainer.cpp:170] trainer mode: Normal
I1117 14:37:51.552254 13026 MultiGradientMachine.cpp:108] numLogicalDevices=1 numThreads=4 numDevices=4
I1117 14:37:51.656347 13026 PyDataProvider2.cpp:247] loading dataprovider dataprovider::process
[INFO 2016-11-17 14:37:51,656 dataprovider.py:27] src dict len : 10000
[INFO 2016-11-17 14:37:51,656 dataprovider.py:37] trg dict len : 7116
I1117 14:37:51.676383 13026 PyDataProvider2.cpp:247] loading dataprovider dataprovider::process
[INFO 2016-11-17 14:37:51,676 dataprovider.py:27] src dict len : 10000
[INFO 2016-11-17 14:37:51,676 dataprovider.py:37] trg dict len : 7116
I1117 14:37:51.676964 13026 GradientMachine.cpp:134] Initing parameters..
I1117 14:37:52.880277 13026 GradientMachine.cpp:141] Init parameters done.
F1117 14:37:53.244454 13049 GatedRecurrentLayer.cpp:79] Check failed: input.sequenceStartPositions 
*** Check failure stack trace: ***
    @     0x7f3acc440daa  (unknown)
    @     0x7f3acc440ce4  (unknown)
    @     0x7f3acc4406e6  (unknown)
    @     0x7f3acc443687  (unknown)
    @           0x5a9d7c  paddle::GatedRecurrentLayer::forward()
    @           0x66c220  paddle::NeuralNetwork::forward()
    @           0x65cf6f  paddle::RecurrentGradientMachine::forward()
    @           0x5ef64a  paddle::RecurrentLayerGroup::forward()
    @           0x66c220  paddle::NeuralNetwork::forward()
    @           0x672617  paddle::TrainerThread::forward()
    @           0x674935  paddle::TrainerThread::computeThread()
    @     0x7f3acbfbda60  (unknown)
    @     0x7f3accff9184  start_thread
F1117 14:37:53.255451 13041 GatedRecurrentLayer.cpp:79] Check failed: input.sequenceStartPositions 
*** Check failure stack trace: ***
    @     0x7f3acb72537d  (unknown)
    @     0x7f3acc440daa  (unknown)
    @              (nil)  (unknown)
/home/ltan/Paddle/binary/bin/paddle: line 81: 13026 Aborted                 (core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}

I'm not sure what is wrong here. **Is it because my inputs to the gru_step_layer is wrong? **

How should the GRU layer stacking be done on the seqToseq GRU decoder?

PaddlePaddle / Paddle 大约 2 年 前同步成功

How to stack GRU layers in the seqToseq_net.py demo example?

PaddlePaddle / Paddle
大约 2 年前同步成功