I1028 21:52:26.817137 9630 parallel_executor.cc:421] The number of CUDAPlace, which is used in ParallelExecutor, is 8. And the Program will be copied 8 copies
W1028 21:52:41.982228 9630 fuse_all_reduce_op_pass.cc:72] Find all_reduce operators: 401. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 255.
4 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::SaveOpKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::SaveOpKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::SaveOpKernel<paddle::platform:I1029 10:38:26.419725 30194 parallel_executor.cc:421] The number of CUDAPlace, which is used in ParallelExecutor, is 8. And the Program will be copied 8 copies
W1029 10:38:48.046470 30194 fuse_all_reduce_op_pass.cc:72] Find all_reduce operators: 401. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 255.
I1029 10:43:41.409127 32687 parallel_executor.cc:421] The number of CUDAPlace, which is used in ParallelExecutor, is 8. And the Program will be copied 8 copies
W1029 10:44:03.299010 32687 fuse_all_reduce_op_pass.cc:72] Find all_reduce operators: 401. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 255.
# val = _check_and_adapt_shape_dtype([val1], [[1], 'int64'])
# results[outname_to_pos['batch_size']] = val
# val2 = len(outputs['token_ids'][0])
# val = _check_and_adapt_shape_dtype([val2], [[1], 'int64'])
# results[outname_to_pos['seqlen']] = val
# val = _check_and_adapt_shape_dtype([val1*val2], [[1], 'int64'])
# results[outname_to_pos['batchsize_x_seqlen']] = val
# else:
# if not has_show_warn:
# print('WARNING: token_ids not found in current batch, failed to yield batch_size, seqlen and batchsize_x_seqlen. (This message would be shown only once.)')