【Question】关于训练和预测的若干问题
已关闭
【Question】关于训练和预测的若干问题
Created by: pkuyym
- demo里面有单机版通过swig api进行预测的样例,请问cluster版本如何实施预测
- 运行simple_gru2时,当minibatch的size设置较大时(减小size可以避免出现),出现如下信息,然后程序异常退出
信息如下(训练过程变得非常慢):
I0119 10:18:58.880401 27190 Stat.h:221] Stat: [BpBiasTimer] fc_layer_0 [Span:22465ms117us] I0119 10:19:16.421510 27190 Stat.h:221] Stat: [BackwardTimer] fc_layer_0 [Span:44247ms219us]
退出时的异常信息如下:
Thu Jan 19 10:22:40 2017[1,6]: @ 0x7f87586fa160 (unknown) Thu Jan 19 10:22:40 2017[1,6]: @ 0xe45da3 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:41 2017[1,19]:./train.sh: line 207: 31795 Floating point exceptionPYTHONPATH=./paddle:$PYTHONPATH GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_trainer --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --trainer_id=${OMPI_COMM_WORLD_RANK} --pservers=$ipstring --rdma_tcp=${rdma_tcp} --nics=${nics} ${train_arg} --config=conf/trainer_config.conf --save_dir=./${save_dir} ${extern_arg} Thu Jan 19 10:22:41 2017[1,19]:+ '[' 136 -ne 0 ']'
从异常上来看,时浮点数异常,推测应该是过小导致吧,但是minibatch较小时就不出现,难道是根据minibatch做了归一化?导致梯度过小?
Created by: pkuyym
更详细的信息:
Thu Jan 19 10:22:39 2017[1,19]:Thread [140008687789952] Forwarding __simple_gru2_0___transform, embedding_0, label, bidword_seq, Thu Jan 19 10:22:39 2017[1,19]:*** Aborted at 1484792559 (unix time) try "date -d @1484792559" if you are using GNU date *** Thu Jan 19 10:22:39 2017[1,19]:PC: @ 0xe45db7 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,1]:Thread [140303555991424] Forwarding __simple_gru2_0___transform, embedding_0, label, bidword_seq, Thu Jan 19 10:22:39 2017[1,1]:*** Aborted at 1484792559 (unix time) try "date -d @1484792559" if you are using GNU date *** Thu Jan 19 10:22:39 2017[1,2]:Thread [140571655157632] Forwarding __simple_gru2_0___transform, Thu Jan 19 10:22:39 2017[1,2]:embedding_0, label, bidword_seq, Thu Jan 19 10:22:39 2017[1,2]:*** Aborted at 1484792559 (unix time) try "date -d @1484792559" if you are using GNU date *** Thu Jan 19 10:22:39 2017[1,1]:PC: @ 0xe45da3 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,11]:Thread [139701206407040] Forwarding __simple_gru2_0___transform, embedding_0, label, bidword_seq, Thu Jan 19 10:22:39 2017[1,11]:*** Aborted at 1484792559 (unix time) try "date -d @1484792559" if you are using GNU date *** Thu Jan 19 10:22:39 2017[1,0]:Thread [140096278001536] Forwarding __simple_gru2_0___transform, embedding_0, label, Thu Jan 19 10:22:39 2017[1,0]:bidword_seq, Thu Jan 19 10:22:39 2017[1,19]:*** SIGFPE (@0xe45db7) received by PID 31795 (TID 0x7f565019e780) from PID 14966199; stack trace: *** Thu Jan 19 10:22:39 2017[1,0]:*** Aborted at 1484792559 (unix time) try "date -d @1484792559" if you are using GNU date *** Thu Jan 19 10:22:39 2017[1,19]: @ 0x7f564fd78160 (unknown) Thu Jan 19 10:22:39 2017[1,19]: @ 0xe45db7 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,2]:PC: @ 0xe45db7 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,1]:*** SIGFPE (@0xe45da3) received by PID 21078 (TID 0x7f9af79d9780) from PID 14966179; stack trace: *** Thu Jan 19 10:22:39 2017[1,1]: @ 0x7f9af75b3160 (unknown) Thu Jan 19 10:22:39 2017[1,1]: @ 0xe45da3 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,11]:PC: @ 0xe45db7 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,0]:PC: @ 0xe45d87 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,16]:Thread [140201746683776] Forwarding __simple_gru2_0___transform, embedding_0, label, bidword_seq, Thu Jan 19 10:22:39 2017[1,16]:*** Aborted at 1484792559 (unix time) try "date -d @1484792559" if you are using GNU date *** Thu Jan 19 10:22:39 2017[1,14]:Thread [140114723051392] Forwarding __simple_gru2_0___transform, embedding_0, label, bidword_seq, Thu Jan 19 10:22:39 2017[1,14]:*** Aborted at 1484792559 (unix time) try "date -d @1484792559" if you are using GNU date *** Thu Jan 19 10:22:39 2017[1,17]:Thread [140726612678528] Forwarding __simple_gru2_0___transform, embedding_0, label, bidword_seq, Thu Jan 19 10:22:39 2017[1,17]:*** Aborted at 1484792559 (unix time) try "date -d @1484792559" if you are using GNU date *** Thu Jan 19 10:22:39 2017[1,18]:Thread [140137521420160] Forwarding __simple_gru2_0___transform, embedding_0, label, bidword_seq, Thu Jan 19 10:22:39 2017[1,18]:*** Aborted at 1484792559 (unix time) try "date -d @1484792559" if you are using GNU date *** Thu Jan 19 10:22:39 2017[1,2]:*** SIGFPE (@0xe45db7) received by PID 4826 (TID 0x7fd963923780) from PID 14966199; stack trace: *** Thu Jan 19 10:22:39 2017[1,2]: @ 0x7fd9634fd160 (unknown) Thu Jan 19 10:22:39 2017[1,2]: @ 0xe45db7 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,11]:*** SIGFPE (@0xe45db7) received by PID 3758 (TID 0x7f0eb8c85780) from PID 14966199; stack trace: *** Thu Jan 19 10:22:39 2017[1,11]: @ 0x7f0eb885f160 (unknown) Thu Jan 19 10:22:39 2017[1,11]: @ 0xe45db7 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,16]:PC: @ 0xe45dcf mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,14]:PC: @ 0xe45da3 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,18]:PC: @ 0xe45dc7 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,17]:PC: @ 0xe45d87 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,13]:Thread [140467527620480] Forwarding __simple_gru2_0___transform, embedding_0, label, bidword_seq, Thu Jan 19 10:22:39 2017[1,13]:*** Aborted at 1484792559 (unix time) try "date -d @1484792559" if you are using GNU date *** Thu Jan 19 10:22:39 2017[1,14]:*** SIGFPE (@0xe45da3) received by PID 11229 (TID 0x7f6f004b9780) from PID 14966179; stack trace: *** Thu Jan 19 10:22:39 2017[1,14]: @ 0x7f6f00093160 (unknown) Thu Jan 19 10:22:39 2017[1,3]:Thread [140153097013120] Forwarding __simple_gru2_0___transform, embedding_0, label, bidword_seq, Thu Jan 19 10:22:39 2017[1,3]:*** Aborted at 1484792559 (unix time) try "date -d @1484792559" if you are using GNU date *** Thu Jan 19 10:22:39 2017[1,14]: @ 0xe45da3 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,0]:*** SIGFPE (@0xe45d87) received by PID 27190 (TID 0x7f6ab4e27780) from PID 14966151; stack trace: *** Thu Jan 19 10:22:39 2017[1,0]: @ 0x7f6ab4a01160 (unknown) Thu Jan 19 10:22:39 2017[1,0]: @ 0xe45d87 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,13]:PC: @ 0xe45d87 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,3]:PC: @ 0xe45db7 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,17]:*** SIGFPE (@0xe45d87) received by PID 31933 (TID 0x7ffd77c25780) from PID 14966151; stack trace: *** Thu Jan 19 10:22:39 2017[1,17]: @ 0x7ffd777ff160 (unknown) Thu Jan 19 10:22:39 2017[1,18]:*** SIGFPE (@0xe45dc7) received by PID 10594 (TID 0x7f744f2f1780) from PID 14966215; stack trace: *** Thu Jan 19 10:22:39 2017[1,17]: @ 0xe45d87 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,18]: @ 0x7f744eecb160 (unknown) Thu Jan 19 10:22:39 2017[1,18]: @ 0xe45dc7 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,16]:*** SIGFPE (@0xe45dcf) received by PID 10411 (TID 0x7f83434ed780) from PID 14966223; stack trace: *** Thu Jan 19 10:22:39 2017[1,13]:*** SIGFPE (@0xe45d87) received by PID 5793 (TID 0x7fc125161780) from PID 14966151; stack trace: *** Thu Jan 19 10:22:39 2017[1,16]: @ 0x7f83430c7160 (unknown) Thu Jan 19 10:22:39 2017[1,13]: @ 0x7fc124d3b160 (unknown) Thu Jan 19 10:22:39 2017[1,16]: @ 0xe45dcf mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,13]: @ 0xe45d87 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,10]:Thread [140005277992832] Forwarding __simple_gru2_0___transform, embedding_0, label, bidword_seq, Thu Jan 19 10:22:39 2017[1,10]:*** Aborted at 1484792559 (unix time) try "date -d @1484792559" if you are using GNU date *** Thu Jan 19 10:22:39 2017[1,3]:*** SIGFPE (@0xe45db7) received by PID 12770 (TID 0x7f77ef8fc780) from PID 14966199; stack trace: *** Thu Jan 19 10:22:39 2017[1,3]: @ 0x7f77ef4d6160 (unknown) Thu Jan 19 10:22:39 2017[1,3]: @ 0xe45db7 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,10]:PC: @ 0xe45db7 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,9]:Thread [139777130837888] Forwarding __simple_gru2_0___transform, embedding_0, label, bidword_seq, Thu Jan 19 10:22:39 2017[1,9]:*** Aborted at 1484792559 (unix time) try "date -d @1484792559" if you are using GNU date *** Thu Jan 19 10:22:39 2017[1,10]:*** SIGFPE (@0xe45db7) received by PID 18844 (TID 0x7f5584dc8780) from PID 14966199; stack trace: *** Thu Jan 19 10:22:39 2017[1,10]: @ 0x7f55849a2160 (unknown) Thu Jan 19 10:22:39 2017[1,10]: @ 0xe45db7 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,9]:PC: @ 0xe45dcf mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,9]:*** SIGFPE (@0xe45dcf) received by PID 9915 (TID 0x7f20663b3780) from PID 14966223; stack trace: *** Thu Jan 19 10:22:39 2017[1,9]: @ 0x7f2065f8d160 (unknown) Thu Jan 19 10:22:39 2017[1,9]: @ 0xe45dcf mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,12]:Thread [140589504018304] Forwarding __simple_gru2_0___transform, embedding_0, label, bidword_seq, Thu Jan 19 10:22:39 2017[1,12]:*** Aborted at 1484792559 (unix time) try "date -d @1484792559" if you are using GNU date *** Thu Jan 19 10:22:39 2017[1,12]:PC: @ 0xe45dc7 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,12]:*** SIGFPE (@0xe45dc7) received by PID 8599 (TID 0x7fdd8b723780) from PID 14966215; stack trace: *** Thu Jan 19 10:22:39 2017[1,12]: @ 0x7fdd8b2fd160 (unknown) Thu Jan 19 10:22:39 2017[1,12]: @ 0xe45dc7 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:39 2017[1,5]:Thread [139923642955648] Forwarding __simple_gru2_0___transform, embedding_0, label, bidword_seq, Thu Jan 19 10:22:39 2017[1,5]:*** Aborted at 1484792559 (unix time) try "date -d @1484792559" if you are using GNU date *** Thu Jan 19 10:22:39 2017[1,15]:Thread [139967555884928] Forwarding __simple_gru2_0___transform, embedding_0, label, bidword_seq, Thu Jan 19 10:22:39 2017[1,15]:*** Aborted at 1484792559 (unix time) try "date -d @1484792559" if you are using GNU date *** Thu Jan 19 10:22:39 2017[1,5]:PC: @ 0xe45d87 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:40 2017[1,15]:PC: @ 0xe45d87 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:40 2017[1,5]:*** SIGFPE (@0xe45d87) received by PID 31887 (TID 0x7f428308b780) from PID 14966151; stack trace: *** Thu Jan 19 10:22:40 2017[1,5]: @ 0x7f4282c65160 (unknown) Thu Jan 19 10:22:40 2017[1,5]: @ 0xe45d87 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:40 2017[1,7]:Thread [140408779712384] Forwarding __simple_gru2_0___transform, embedding_0, label, bidword_seq, Thu Jan 19 10:22:40 2017[1,7]:*** Aborted at 1484792560 (unix time) try "date -d @1484792560" if you are using GNU date *** Thu Jan 19 10:22:40 2017[1,15]:*** SIGFPE (@0xe45d87) received by PID 29708 (TID 0x7f4cbc72d780) from PID 14966151; stack trace: *** Thu Jan 19 10:22:40 2017[1,15]: @ 0x7f4cbc307160 (unknown) Thu Jan 19 10:22:40 2017[1,15]: @ 0xe45d87 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:40 2017[1,7]:PC: @ 0xe45d95 mkl_blas_avx_sgemm_kernel_0 Thu Jan 19 10:22:40 2017[1,4]:Thread [140132209895296] Forwarding __simple_gru2_0___transform, embedding_0, label, bidword_seq,
Created by: reyoung
这个FPE会随着batch size改变而改变的原因,是在这里 https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/optimizers.py#L375
Paddle在实现SGD的时候,有一个和公式不同的地方。
通常,一般的公式里面解释SGD的时候,都会将这个mini-batch中所有样本产生的Gradient加起来,然后除以样本数,计算出每一个样本的平均梯度,然后再用这个平均的梯度去更新神经网络。
在Paddle里面,并没有计算每条样本的平均梯度,而是使用所有样本的整体梯度去更新网络。
这么做的原因是,
- 如果想要用平均梯度的话,调整learning rate其实也就可以了。
- 现在流行的大部分优化算法都是adaptive的,梯度的大小并不那么重要,而方向却很重要。例如adam,rmsprop
- 如果需要除以样本数的话,多机的情况下,样本数还需要同步一遍。写起来比较麻烦。同时,在每个pass(epoch)结束的那个mini-batch里,样本数也会有一个变化。
正因为这样,在Paddle里面增大batch_size,就会导致Gradient的尺度变得更大,Value减去Gradient的时候减的更狠,所以更容易出现FPE。