Created by: dzhwinter
hack the memory profiler for memory debugging and benchmark. When I do this job, I find that the structure of profiler needs to make it readable.
Use with profiler.profiler('CPU', 'total')
as prof to package the profiling code.
profiler.reset_profiler()
can be used to clear the previous records.
A simple usage is as follows:
image = fluid.layers.data(name='x', shape=[784], dtype='float32')
# ...
avg_cost = fluid.layers.mean(x=cost)
optimizer = fluid.optimizer.Momentum(learning_rate=0.001, momentum=0.9)
opts = optimizer.minimize(avg_cost)
accuracy = fluid.evaluator.Accuracy(input=predict, label=label)
place = fluid.CPUPlace() # or fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
accuracy.reset(exe)
with profiler.profiler('CPU', 'total') as prof:
for iter in range(10):
if iter == 2:
profiler.reset_profiler()
x = np.random.random((32, 784)).astype("float32")
y = np.random.randint(0, 10, (32, 1)).astype("int64")
outs = exe.run(fluid.default_main_program(),
feed={'x': x,
'y': y},
fetch_list=[avg_cost] + accuracy.metrics)
acc = np.array(outs[1])
pass_acc = accuracy.eval(exe)
Please see this for demo usage https://github.com/dzhwinter/benchmark/pull/80
1 ----------- Configuration Arguments -----------
2 batch_size: 16
3 data_format: NCHW
4 data_set: flowers
5 device: GPU
6 learning_rate: 0.001
7 pass_num: 1
8 ------------------------------------------------
9
10 -------------------------> Profiling Report <-------------------------
11
12 Place: CUDA Total Time:0ms Total Memory:9065.64MB Sorted by total time in descending order in the same thread
13
14 Event Calls Total Min. Max. Ave. Total Memory.Min Memory. Max Memory. Ave Memory.
15 thread0::elementwise_add_grad 16 206.254 0.037888 63.5791 12.8909 4648.4 0.00683594 196.001 290.525
16 thread0::conv2d_grad 13 112.87 2.88768 19.3096 8.68234 4727.98 0.00683594 196.141 363.691
17 thread0::dropout 10 77.7945 0.043072 35.1161 7.77945 1121.58 0.0629883 392 112.158
18 thread0::conv2d 13 55.5693 0.971776 12.2081 4.27456 337.579 6.12524 196 25.9676
19 thread0::batch_norm_grad 14 14.7364 0.074752 3.76525 1.0526 4649.8 0.0358887 196.001 332.129
20 thread0::batch_norm 14 13.8792 0.083968 4.15642 0.991369 729.579 0.0358887 196.001 52.1128
21 thread0::relu_grad 14 8.91693 0.027648 2.32448 0.636923 4649.76 0.0314941 196 332.126
22 thread0::elementwise_add 16 7.9863 0.027648 2.15555 0.499144 533.579 0.00634766 196 33.3487
23 thread0::relu 14 7.06045 0.022528 1.92 0.504318 925.581 0.0314941 196 66.1129
24 thread0::pool2d_grad 5 5.06368 0.152576 2.57126 1.01274 4703.47 6.12524 196 940.693
25 thread0::dropout_grad 10 4.85581 0.034816 2.26202 0.485581 4649.73 0.0314941 196 464.973
26 thread0::adam 60 3.85926 0.026624 0.975872 0.0643211 9065.61 0 0 151.094
27 thread0::fill_zeros_like 66 2.56307 0.017408 0.636928 0.0388344 4649.7 0.000488281 196 70.45
28 thread0::pool2d 5 1.72131 0.079872 0.797664 0.344262 2297.58 1.53149 49.0002 459.517
29 thread0::elementwise_mul 60 1.66064 0.0256 0.048128 0.0276773 9065.61 0.000244141 0.000244141 151.094
30 thread0::fill_constant 61 0.998304 0.014336 0.035744 0.0163656 4648.38 0.000244141 0.000244141 76.203
31 thread0::mul_grad 3 0.518048 0.067584 0.36352 0.172683 4648.4 0.230957 50.5317 1549.47
32 thread0::mul 3 0.433088 0.041984 0.344 0.144363 4648.1 0.00634766 0.0314941 1549.37
33 thread0::fetch 2 0.275616 0.027136 0.24848 0.137808 9065.64 0 0 4532.82