test_concat_mkldnn_op and test_transpose_mkldnn_op random fail
Created by: luotao1
[18:34:44][Step 1/1] ======================================================================
[18:34:44][Step 1/1] ERROR: test_check_output (test_transpose_mkldnn_op.TestCase1a)
[18:34:44][Step 1/1] ----------------------------------------------------------------------
[18:34:44][Step 1/1] Traceback (most recent call last):
[18:34:44][Step 1/1] File "/paddle/build/python/paddle/fluid/tests/unittests/mkldnn/test_transpose_mkldnn_op.py", line 44, in test_check_output
[18:34:44][Step 1/1] self.check_output(no_check_set=['XShape'], check_dygraph=False)
[18:34:44][Step 1/1] File "/paddle/build/python/paddle/fluid/tests/unittests/op_test.py", line 1182, in check_output
[18:34:44][Step 1/1] equal_nan, check_dygraph)
[18:34:44][Step 1/1] File "/paddle/build/python/paddle/fluid/tests/unittests/op_test.py", line 968, in check_output_with_place
[18:34:44][Step 1/1] outs, fetch_list = self._calc_output(place, no_check_set=no_check_set)
[18:34:44][Step 1/1] File "/paddle/build/python/paddle/fluid/tests/unittests/op_test.py", line 552, in _calc_output
[18:34:44][Step 1/1] feed_map = self.feed_var(inputs, place)
[18:34:44][Step 1/1] File "/paddle/build/python/paddle/fluid/tests/unittests/op_test.py", line 309, in feed_var
[18:34:44][Step 1/1] tensor.set(self.inputs[var_name], place)
[18:34:44][Step 1/1] RuntimeError:
[18:34:44][Step 1/1]
[18:34:44][Step 1/1] --------------------------------------------
[18:34:44][Step 1/1] C++ Call Stacks (More useful to developers):
[18:34:44][Step 1/1] --------------------------------------------
[18:34:44][Step 1/1] 0 std::string paddle::platform::GetTraceBackString<std::string>(std::string&&, char const*, int)
[18:34:44][Step 1/1] 1 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
[18:34:44][Step 1/1] 2 paddle::memory::allocation::AllocatorFacade::Alloc(paddle::platform::Place const&, unsigned long)
[18:34:44][Step 1/1] 3 paddle::memory::allocation::AllocatorFacade::AllocShared(paddle::platform::Place const&, unsigned long)
[18:34:44][Step 1/1] 4 paddle::memory::AllocShared(paddle::platform::Place const&, unsigned long)
[18:34:44][Step 1/1] 5 paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsigned long)
[18:34:44][Step 1/1]
[18:34:44][Step 1/1] ----------------------
[18:34:44][Step 1/1] Error Message Summary:
[18:34:44][Step 1/1] ----------------------
[18:34:44][Step 1/1] ResourceExhaustedError:
[18:34:44][Step 1/1]
[18:34:44][Step 1/1] Out of memory error on GPU 0. Cannot allocate 480.000000B memory on GPU 0, available memory is only 0.000000B.
The stress machine is 2650v4 CPU and 4 P4 GPU cards. It is hard to understand why MKLDNN related unit-test need to allocate GPU memory.
@jczaja @lidanqing-intel Could you help see it?
Paddle-Manylinux_PR_CI_Manylinux_Coverage_49776.log Paddle-Manylinux_PR_CI_Manylinux_Coverage_49775.log