Unit test test_CompareTwoNets and test_CompareSparse failed on NVIDIA DRIVE PX2.
Created by: Xreki
I built Paddle on NVIDIA DRIVE PX2 with WITH_GPU=ON
, most of the unit tests passed.
test_CompareTwoNets
and test_CompareSparse
failed because of the same problem.
test_CompareTwoNets
:
57: I0527 00:48:58.920536 18051 GradientMachine.cpp:92] Init parameters done.
57: I0527 00:48:59.326797 18051 test_CompareTwoNets.cpp:175]
57:
57: forwardBackward of the Network B is finished
57:
57: I0527 00:48:59.327240 18051 test_CompareTwoNets.cpp:120]
57: -------------------------------- Check Network Output_0: -------------------------------------
57: I0527 00:48:59.327301 18051 test_CompareTwoNets.cpp:104] maxValue=1.14861 maxDiff=0
57:
57: I0527 00:48:59.327356 18051 test_CompareTwoNets.cpp:120]
57: -------------------------------- Check Network Output_1: -------------------------------------
57: I0527 00:48:59.327606 18051 test_CompareTwoNets.cpp:104] maxValue=0.83593 maxDiff=0
57:
57: I0527 00:48:59.327639 18051 test_CompareTwoNets.cpp:134]
57:
57: -------------------------------- Check Gradient Machine Parameters: -------------------------------------
57: F0527 00:48:59.327821 18051 Allocator.h:51] Check failed: posix_memalign(&ptr, 32ul, size) == 0 (12 vs. 0)
57: *** Check failure stack trace: ***
57: @ 0x9bcf28 google::LogMessage::Fail()
57: @ 0x9be88c google::LogMessage::SendToLog()
57: @ 0x9bca24 google::LogMessage::Flush()
57: @ 0x9c01ec google::LogMessageFatal::~LogMessageFatal()
57: @ 0x808518 paddle::CpuAllocator::alloc()
57: @ 0x8073a8 paddle::PoolAllocator::alloc()
57: @ 0x800fc0 paddle::CpuMemoryHandle::CpuMemoryHandle()
57: @ 0x7f9c94 paddle::CpuVectorT<>::CpuVectorT()
57: @ 0x5cca7c compareGradient()
57: @ 0x5cdfe8 Trainer_create_Test::TestBody()
57: @ 0xad4cf4 testing::internal::HandleExceptionsInMethodIfSupported<>()
57: @ 0xacb3ec testing::Test::Run()
57: @ 0xacb528 testing::TestInfo::Run()
57: @ 0xacb634 testing::TestCase::Run()
57: @ 0xacd888 testing::internal::UnitTestImpl::RunAllTests()
57: @ 0xacdbb8 testing::UnitTest::Run()
57: @ 0x5b81b4 main
57: @ 0x7f8e8448a0 __libc_start_main
57: /home/ubuntu/liuyiqun01/Paddle/paddle/.set_python_path.sh: line 42: 18051 Aborted (core dumped) $@
1/1 Test #57: test_CompareTwoNets ..............***Failed 52.92 sec
test_CompareSparse
:
59: [==========] Running 5 tests from 1 test case.
59: [----------] Global test environment set-up.
59: [----------] 5 tests from compareSparse
59: [ RUN ] compareSparse.cpu
59: I0527 00:51:39.578558 18880 test_CompareSparse.cpp:56] useGpu=0 trainerCount=1 configFile=trainer/tests/sample_trainer_config_qb_rnn.conf sparseUpdate=1
59: I0527 00:51:40.245077 18880 Trainer.cpp:114] ignore sparse_remote_update=true due to --local=true
59: I0527 00:51:40.245674 18880 Trainer.cpp:162] trainer mode: SgdSparseCpuTraining
59: I0527 00:51:42.028664 18880 ProtoDataProvider.cpp:55] load data file trainer/tests/data_bin_part
59: I0527 00:51:42.037497 18880 ProtoDataProvider.cpp:70] read done, num of instance=1000
59: I0527 00:51:42.037689 18880 ProtoDataProvider.cpp:367] slot0:avgNNZ=6.678; slot1:avgNNZ=5.47; slot2:avgNNZ=15.924; slot3:avgNNZ=12.808; slot4:avgNNZ=6.713; slot5:avgNNZ=5.489; slot6:avgNNZ=16.915; slot7:avgNNZ=13.482;
59: I0527 00:51:42.038173 18880 GradientMachine.cpp:85] Initing parameters..
59: I0527 00:52:03.085042 18880 GradientMachine.cpp:92] Init parameters done.
59: ..........I0527 00:52:08.090608 18880 CostLayer.cpp:337] calc pos/neg: 1.12314 pos= 529 neg= 471
59: I0527 00:52:08.090728 18880 TrainerInternal.cpp:181] Pass=0 Batch=10 samples=1000 AvgCost=0.859857 Eval:
59: I0527 00:52:08.094395 18880 GradientMachine.cpp:63] Saving parameters to ./output/model/pass-00000
59: I0527 00:52:13.917371 18880 test_CompareSparse.cpp:56] useGpu=0 trainerCount=1 configFile=trainer/tests/sample_trainer_config_qb_rnn.conf sparseUpdate=0
59: I0527 00:52:13.950443 18880 Trainer.cpp:114] ignore sparse_remote_update=true due to --local=true
59: I0527 00:52:13.950489 18880 Trainer.cpp:165] trainer mode: Normal
59: I0527 00:52:16.702788 18880 ProtoDataProvider.cpp:55] load data file trainer/tests/data_bin_part
59: I0527 00:52:16.708703 18880 ProtoDataProvider.cpp:70] read done, num of instance=1000
59: I0527 00:52:16.708886 18880 ProtoDataProvider.cpp:367] slot0:avgNNZ=6.678; slot1:avgNNZ=5.47; slot2:avgNNZ=15.924; slot3:avgNNZ=12.808; slot4:avgNNZ=6.713; slot5:avgNNZ=5.489; slot6:avgNNZ=16.915; slot7:avgNNZ=13.482;
59: I0527 00:52:16.709239 18880 GradientMachine.cpp:85] Initing parameters..
59: I0527 00:52:37.756114 18880 GradientMachine.cpp:92] Init parameters done.
59: ..........I0527 00:52:44.449095 18880 CostLayer.cpp:337] calc pos/neg: 1.12314 pos= 529 neg= 471
59: I0527 00:52:44.449199 18880 TrainerInternal.cpp:181] Pass=0 Batch=10 samples=1000 AvgCost=0.859857 Eval:
59: I0527 00:52:44.449470 18880 GradientMachine.cpp:63] Saving parameters to ./output/model/pass-00000
59: I0527 00:52:51.500138 18880 test_CompareSparse.cpp:115]
59:
59: -------------------------------- Check Gradient Machine Parameters: -------------------------------------
59: F0527 00:52:51.509215 18880 Allocator.h:51] Check failed: posix_memalign(&ptr, 32ul, size) == 0 (12 vs. 0)
59: *** Check failure stack trace: ***
59: @ 0x9fa038 google::LogMessage::Fail()
59: @ 0x9fb99c google::LogMessage::SendToLog()
59: @ 0x9f9b34 google::LogMessage::Flush()
59: @ 0x9fd2fc google::LogMessageFatal::~LogMessageFatal()
59: @ 0x8433c8 paddle::CpuAllocator::alloc()
59: @ 0x842258 paddle::PoolAllocator::alloc()
59: @ 0x83be70 paddle::CpuMemoryHandle::CpuMemoryHandle()
59: @ 0x834b44 paddle::CpuVectorT<>::CpuVectorT()
59: @ 0x5edea4 compareValue()
59: @ 0x5ef1ec compareSparse_cpu_Test::TestBody()
59: @ 0xb11ad4 testing::internal::HandleExceptionsInMethodIfSupported<>()
59: @ 0xb081cc testing::Test::Run()
59: @ 0xb08308 testing::TestInfo::Run()
59: @ 0xb08414 testing::TestCase::Run()
59: @ 0xb0a668 testing::internal::UnitTestImpl::RunAllTests()
59: @ 0xb0a998 testing::UnitTest::Run()
59: @ 0x5db0b4 main
59: @ 0x7f7d8e88a0 __libc_start_main
59: ./.common_test_util.sh: line 72: 18880 Aborted (core dumped) $cmd --$port_type=$port
59: /home/ubuntu/liuyiqun01/Paddle/build/paddle/trainer/tests/test_CompareSparse run wrong
1/1 Test #59: test_CompareSparse ...............***Failed 76.19 sec
When running test_CompareTwoNets
, I tracked the memory usage using top
command and guess it was caused by exhausted of memory.