PaddleSlim 搜索过程中断
Created by: lijiancheng0614
进行了一些 PaddleSlim 实验,主要遇到以下两个问题:
- 搜到大网络时会在 "Running evaluation" 后出现 "Cannot malloc XXX MB GPU memory.",而搜到小网络则不会报错。因为并不知道搜索过程中出现的大网络会有多大,希望加入报错中断的判断,跳过该网络继续搜索。
- 出现 "Segmentation fault" 未知错误。在集群搜索时遇到,查看相关 log 并不知道为什么会中断,因为实验前期跑过相同流程,而对于当前搜到的网络,训练和测试部分 batch 能通过:
INFO:paddle.fluid.contrib.slim.core.compressor:Finish evaluation
2019-07-13 15:19:25,898-INFO: epoch:4603; batch_id:0; ['loss'] = [1.663]
INFO:paddle.fluid.contrib.slim.core.compressor:epoch:4603; batch_id:0; ['loss'] = [1.663]
2019-07-13 15:19:39,971-INFO: epoch:4603; batch_id:20; ['loss'] = [1.694]
INFO:paddle.fluid.contrib.slim.core.compressor:epoch:4603; batch_id:20; ['loss'] = [1.694]
2019-07-13 15:19:54,306-INFO: epoch:4603; batch_id:40; ['loss'] = [1.583]
INFO:paddle.fluid.contrib.slim.core.compressor:epoch:4603; batch_id:40; ['loss'] = [1.583]
2019-07-13 15:20:01,012-INFO: Running evaluation
INFO:paddle.fluid.contrib.slim.core.compressor:Running evaluation
2019-07-13 15:20:02,553-INFO: batch-0; ['acc_top1', 'acc_top5']=[0.46875, 0.92578125]
INFO:paddle.fluid.contrib.slim.core.compressor:batch-0; ['acc_top1', 'acc_top5']=[0.46875, 0.92578125]
2019-07-13 15:20:03,057-INFO: Final eval result: ['acc_top1', 'acc_top5']=[0.43929303 0.9280546 ]
INFO:paddle.fluid.contrib.slim.core.compressor:Final eval result: ['acc_top1', 'acc_top5']=[0.43929303 0.9280546 ]
2019-07-13 15:20:03,057-INFO: Finish evaluation
INFO:paddle.fluid.contrib.slim.core.compressor:Finish evaluation
2019-07-13 15:20:04,224-INFO: epoch:4604; batch_id:0; ['loss'] = [1.544]
INFO:paddle.fluid.contrib.slim.core.compressor:epoch:4604; batch_id:0; ['loss'] = [1.544]
2019-07-13 15:20:18,097-INFO: epoch:4604; batch_id:20; ['loss'] = [1.598]
INFO:paddle.fluid.contrib.slim.core.compressor:epoch:4604; batch_id:20; ['loss'] = [1.598]
如上,log 挂在 batch_id:20
,而从前面 log 知 evaluation
还有 batch_id:40
。