support large scale ctr model with distributed lookup table[一期]
Created by: jacquesqiao
Tasks:
- distribute AUC calculating @seiriosPlus
- profile data reading time in the training.
- support AUC calculate in CPP @jacquesqiao
- experiment async training.
- larger scale cluster(100+ machines) and large scale data(10T+)
- test parameter server save load for large distributed lookup table.
- optimize selectedRows for large number of ids. @jacquesqiao https://github.com/PaddlePaddle/Paddle/pull/12431
- paddle cloud support multi level dir on hdfs @paddlecloud (done, need verify)
- save lookup table when save model @seiriosPlus
-
use parallel_executor to train the model on CPU @jacquesqiao
- 加速效果明显,5core 658.100209736 -> 2407.47938408 samples/second
- 加速之后,Python 的AUC计算和训练时间差不多了,更加成为瓶颈。
- 测试parallel_executor的加速比
- paralle_executor reduce模式支持dist lookup table https://github.com/PaddlePaddle/Paddle/pull/12535
- dist lookup table支持除SGD外的别的optimizer https://github.com/PaddlePaddle/Paddle/pull/12544
Enhancement
- parallel_executor return should return one value for one fetch target https://github.com/PaddlePaddle/Paddle/issues/12060 @chengduoZH
Feature
- distribute train: load parameter https://github.com/PaddlePaddle/Paddle/issues/12167
Performance
- use prefetch op instead of recv dense parameter for the embedding layer.
- TensorSlice structure
- RPC fusion
- sparse gradient fusion
optimize profiling tools
- Add profiler to pserver https://github.com/PaddlePaddle/Paddle/pull/12456 @jacquesqiao
- optimize profiler https://github.com/PaddlePaddle/Paddle/pull/12541
Bugs:
-
distribute train grpc error
Socket closed
. https://github.com/PaddlePaddle/Paddle/issues/12054- 原因是pserver挂了,但是日志丢了,已经有段时间没出现了,所以暂时关闭
-
adam op bugs https://github.com/PaddlePaddle/Paddle/issues/12055
- fixed in the develop branch. @jacquesqiao
- Distribute transpiler handle adam accumulator https://github.com/PaddlePaddle/Paddle/pull/12123
- The implementation of Adam is problematic. https://github.com/PaddlePaddle/Paddle/issues/12068 @jacquesqiao https://github.com/PaddlePaddle/Paddle/pull/12103
- Fix AUC op @typhoonzero https://github.com/PaddlePaddle/Paddle/pull/12087
- FLAGS_rpc_deadline has no effect https://github.com/PaddlePaddle/Paddle/issues/12110 @Yancey1989
- grpc "os_error":"Protocol not available" https://github.com/PaddlePaddle/Paddle/issues/12111 这个问题会报一个日志,但是训练能正常进行。
- check the accuracy of ParallelExecutor in distributed environment. @seiriosPlus https://github.com/PaddlePaddle/Paddle/issues/12096
- Wrong behavior of bcast on ParallelExecutor + CPU @Yancey1989 #12120
- parallel_executor can not run the model saved by save_inference_model https://github.com/PaddlePaddle/Paddle/issues/12187 @chengduoZH
Future feature:
- distributed look up table support flexible rows.
- fault-tolerance of look up table.
- fast fault-recovery of look up table.
- distributed monitor and analyse system.