ResNet50 latency test based on CAPI is very slow
Created by: chuanqi129
Hi, I used the below cmake command to build paddle:
cmake3 .. -DWITH_DOC=OFF -DWITH_GPU=OFF -DWITH_DISTRIBUTE=OFF -DWITH_MKLDNN=ON -DWITH_GOLANG=OFF -DWITH_SWIG_PY=ON -DWITH_TESTING=ON -DWITH_INFERENCE_API_TEST=ON -DCMAKE_BUILD_TYPE=Release -DWITH_TIMER=OFF -DWITH_PROFILER=OFF -DWITH_FLUID_ONLY=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
and run below command for ResNet50 latency test based on CAPI:
FLAGS_use_mkldnn=true OMP_NUM_THREADS=40 KMP_AFFINITY=compact,granularity=fine taskset -c 0-40 numactl -l ./paddle/fluid/inference/tests/api/test_analyzer_resnet50 --infer_model=third_party/inference_demo/resnet50/model --repeat=100 --gtest_filter=Analyzer_resnet50.profile_mkldnn --batch_size=1 --num_threads=40
But the finally results are very slow like below log information:
- Running Analysis pass [DFG to fluid] W1112 14:13:24.766306 393684 data_flow_graph_to_fluid_pass.cc:70] parameter changes in the scope takes effect
- Running Analysis pass [dfg-to-fluid-debuger-pass] sh: dot: command not found I1112 14:13:24.786814 393684 analysis_predictor.cc:257] == optimize end == I1112 15:25:30.706657 393769 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 37, latency: 43259.1ms, fps: 0.0231165 ====== I1112 15:25:31.645990 393766 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 34, latency: 43268.4ms, fps: 0.0231115 ====== I1112 15:25:33.218940 393756 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 24, latency: 43284.2ms, fps: 0.0231031 ====== I1112 15:25:42.488104 393737 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 5, latency: 43376.9ms, fps: 0.0230537 ====== I1112 15:25:46.173593 393753 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 21, latency: 43413.7ms, fps: 0.0230342 ====== I1112 15:25:49.869393 393744 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 12, latency: 43450.7ms, fps: 0.0230146 ====== I1112 15:25:52.919145 393751 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 19, latency: 43481.2ms, fps: 0.0229984 ====== I1112 15:25:53.417090 393745 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 13, latency: 43486.2ms, fps: 0.0229958 ====== I1112 15:25:54.113740 393761 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 29, latency: 43493.1ms, fps: 0.0229921 ====== I1112 15:25:54.355783 393768 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 36, latency: 43495.5ms, fps: 0.0229909 ====== I1112 15:25:56.391198 393767 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 35, latency: 43515.9ms, fps: 0.0229801 ====== I1112 15:25:57.218600 393743 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 11, latency: 43524.2ms, fps: 0.0229757 ====== I1112 15:25:57.219207 393738 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 6, latency: 43524.2ms, fps: 0.0229757 ====== I1112 15:25:57.253047 393742 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 10, latency: 43524.5ms, fps: 0.0229755 ====== I1112 15:25:57.371704 393760 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 28, latency: 43525.7ms, fps: 0.0229749 ====== I1112 15:25:57.457932 393763 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 31, latency: 43526.6ms, fps: 0.0229745 ====== I1112 15:25:57.460876 393733 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 1, latency: 43526.6ms, fps: 0.0229744 ====== I1112 15:25:57.491451 393764 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 32, latency: 43526.9ms, fps: 0.0229743 ====== I1112 15:25:57.588882 393740 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 8, latency: 43527.8ms, fps: 0.0229738 ====== I1112 15:25:57.596644 393770 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 38, latency: 43527.8ms, fps: 0.0229738 ====== I1112 15:25:57.621752 393752 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 20, latency: 43528.1ms, fps: 0.0229737 ====== I1112 15:25:57.621911 393754 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 22, latency: 43528.1ms, fps: 0.0229737 ====== I1112 15:25:57.628367 393757 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 25, latency: 43528.3ms, fps: 0.0229736 ====== I1112 15:25:57.668081 393755 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 23, latency: 43528.7ms, fps: 0.0229734 ====== I1112 15:25:57.689358 393762 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 30, latency: 43528.9ms, fps: 0.0229733 ====== I1112 15:25:57.690102 393746 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 14, latency: 43528.9ms, fps: 0.0229732 ====== I1112 15:25:57.723378 393759 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 27, latency: 43529.2ms, fps: 0.0229731 ====== I1112 15:25:57.747210 393765 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 33, latency: 43529.5ms, fps: 0.0229729 ====== I1112 15:25:57.790575 393735 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 3, latency: 43529.8ms, fps: 0.0229728 ====== I1112 15:25:57.790573 393736 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 4, latency: 43529.8ms, fps: 0.0229728 ====== I1112 15:25:57.792763 393749 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 17, latency: 43529.8ms, fps: 0.0229728 ====== I1112 15:25:57.811703 393732 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 0, latency: 43530.1ms, fps: 0.0229726 ====== I1112 15:25:57.853924 393771 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 39, latency: 43530.5ms, fps: 0.0229724 ====== I1112 15:25:57.857890 393758 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 26, latency: 43530.6ms, fps: 0.0229724 ====== I1112 15:25:57.873034 393748 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 16, latency: 43530.7ms, fps: 0.0229723 ====== I1112 15:25:57.888835 393739 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 7, latency: 43530.9ms, fps: 0.0229722 ====== I1112 15:25:57.900784 393750 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 18, latency: 43531ms, fps: 0.0229721 ====== I1112 15:25:57.914562 393741 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 9, latency: 43531.2ms, fps: 0.0229721 ====== I1112 15:25:58.040720 393747 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 15, latency: 43532.4ms, fps: 0.0229714 ====== I1112 15:25:58.094480 393734 helper.h:162] ====== batch_size: 1, repeat: 100, threads: 40, thread id: 2, latency: 43533ms, fps: 0.0229711 ====== [ OK ] Analyzer_resnet50.profile_mkldnn (4519964 ms) -------- 1 test from Analyzer_resnet50 (4519964 ms total)
-------- Global test environment tear-down [==========] 1 test from 1 test case ran. (4519964 ms total) [ PASSED ] 1 test.
Does it make sense? Remember the result should be better than this.