cpu utilization is low in PaddleRec/ctr model
Created by: lzha106
When run ctr training model in PaddleRec/ctr on one 48 core server, it seems the CPU utilization is very low (avg to be 200%) during training. It is same either in local mode or distributed mode.
- Cmd (Run in docker)
# export NUM_THREADS=20
# python train.py --train_data_path ./data/train.txt
- Env
- hub.baidubce.com/paddlepaddle/paddle:latest
- Log
2019-05-14 11:51:05,883-INFO: run local training 2019-05-14 11:51:05,884-INFO: num threads= 0 2019-05-14 11:51:05,884-INFO: cpu num = 48 ParallelExecutor is deprecated. Please use CompiledProgram and Executor. CompiledProgram is a central place for optimization and Executor is the unified executor. Example can be found in compiler.py. W0514 11:51:05.886332 4342 graph.h:204] WARN: After a series of passes, the current graph can be quite different from OriginProgram. So, please avoid using the
OriginProgram()
method! 2019-05-14 11:51:14,050-INFO: TRAIN --> pass: 0 batch: 0 loss: 0.735086364746 auc: 0.503536689127, batch_auc: 0.504577935984 2019-05-14 11:51:16,557-INFO: TRAIN --> pass: 0 batch: 1 loss: 0.687324645996 auc: 0.503425833509, batch_auc: 0.502190308561 2019-05-14 11:51:25,562-INFO: TRAIN --> pass: 0 batch: 2 loss: 0.652316955566 auc: 0.49810737994, batch_auc: 0.511218654271 2019-05-14 11:51:28,201-INFO: TRAIN --> pass: 0 batch: 3 loss: 0.624634887695 auc: 0.492766345222, batch_auc: 0.516552284336 2019-05-14 11:51:36,096-INFO: TRAIN --> pass: 0 batch: 4 loss: 0.604645751953 auc: 0.491407935754, batch_auc: 0.521633943716 2019-05-14 11:51:38,219-INFO: TRAIN --> pass: 0 batch: 5 loss: 0.591483703613 auc: 0.49000279105, batch_auc: 0.516322304276 2019-05-14 11:51:46,069-INFO: TRAIN --> pass: 0 batch: 6 loss: 0.584241821289 auc: 0.489829021597, batch_auc: 0.519842867821 2019-05-14 11:51:48,183-INFO: TRAIN --> pass: 0 batch: 7 loss: 0.578797851562 auc: 0.490242734155, batch_auc: 0.524336059192 2019-05-14 11:51:56,606-INFO: TRAIN --> pass: 0 batch: 8 loss: 0.579120727539 auc: 0.490747555956, batch_auc: 0.526528383667 2019-05-14 11:51:58,903-INFO: TRAIN --> pass: 0 batch: 9 loss: 0.577897460938 auc: 0.491967175691, batch_auc: 0.531965010737 2019-05-14 11:52:06,682-INFO: TRAIN --> pass: 0 batch: 10 loss: 0.578246459961 auc: 0.493372356486, batch_auc: 0.52775468058 2019-05-14 11:52:08,846-INFO: TRAIN --> pass: 0 batch: 11 loss: 0.578046508789 auc: 0.495237219199, batch_auc: 0.531861205473 2019-05-14 11:52:17,071-INFO: TRAIN --> pass: 0 batch: 12 loss: 0.577464294434 auc: 0.497194691728, batch_auc: 0.536939430909 2019-05-14 11:52:19,243-INFO: TRAIN --> pass: 0 batch: 13 loss: 0.580472900391 auc: 0.499336215148, batch_auc: 0.539940881903 2019-05-14 11:52:27,277-INFO: TRAIN --> pass: 0 batch: 14 loss: 0.578672119141 auc: 0.501259237555, batch_auc: 0.545386792237 2019-05-14 11:52:29,362-INFO: TRAIN --> pass: 0 batch: 15 loss: 0.572959228516 auc: 0.503349640492, batch_auc: 0.546570044491 2019-05-14 11:52:37,339-INFO: TRAIN --> pass: 0 batch: 16 loss: 0.568415466309 auc: 0.505703493934, batch_auc: 0.553619374192 2019-05-14 11:52:39,439-INFO: TRAIN --> pass: 0 batch: 17 loss: 0.572214294434 auc: 0.507821561535, batch_auc: 0.553008822041 2019-05-14 11:52:47,451-INFO: TRAIN --> pass: 0 batch: 18 loss: 0.564864013672 auc: 0.509964511397, batch_auc: 0.556775745606 2019-05-14 11:52:49,572-INFO: TRAIN --> pass: 0 batch: 19 loss: 0.559789306641 auc: 0.512298856513, batch_auc: 0.570818080853 pass_id: 0, pass_time_cost: 105.293049
- Top info
top - 19:51:50 up 188 days, 3:56, 2 users, load average: 42.83, 42.64, 42.54 Tasks: 787 total, 2 running, 785 sleeping, 0 stopped, 0 zombie %Cpu(s): 4.3 us, 0.2 sy, 0.0 ni, 95.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 19779323+total, 13814598+free, 11023284 used, 48623964 buff/cache KiB Swap: 4194300 total, 3760116 free, 434184 used. 18493142+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 60425 root 20 0 14.730g 2.927g 45432 S 170.9 1.6 1:30.33 python