support large scale ctr model with distributed lookup table[三期]
Created by: jacquesqiao
一期: https://github.com/PaddlePaddle/Paddle/issues/12008 二期: https://github.com/PaddlePaddle/Paddle/issues/13205
table save/load/infer:
- Fix mkdir conflict in save_inference_model https://github.com/PaddlePaddle/Paddle/pull/14285 @seiriosPlus
- selected_rows save/load 目前把完整的tensor存下来了,就算没有对应的id,load也是一样,应该改改成只save有对应id的tensor,load也只load这部分数据。
速度优化
- 参数更新也是用sparse的方法 @jacquesqiao
dist table
- Dist table support multi table https://github.com/PaddlePaddle/Paddle/pull/14190 @jacquesqiao
- dist table support other optimize and regular config https://github.com/PaddlePaddle/Paddle/pull/14300 @jacquesqiao
- 稀疏更新+正则化在pserver端有bug,表现为当某个梯度为空的时候,dim为[0],而正则项为dim为 [0, table_size] 后边check shape的时候挂了
- 按id出现的频率删除不常见的id,避免table太大,超过dict_size。
- distribute_lookup_table增加test 模式,在测试的时候,有可能出现训练阶段没见过的id,这种情况下应该把这个id对应的值返回0. https://github.com/PaddlePaddle/Paddle/pull/14387 @jacquesqiao
- pass 同步的问题。
- sgd_op检查了id是否超过selected_rows height, 在自动增长模式下,这个检查不应该存在 https://github.com/PaddlePaddle/Paddle/pull/14389 @jacquesqiao
性能优化
- [wip]add multi thread c++ reader. https://github.com/PaddlePaddle/Paddle/pull/13983 @jacquesqiao
功能完善
- Optimize thread pool https://github.com/PaddlePaddle/Paddle/pull/14259 @jacquesqiao
- clean rpc server profiler https://github.com/PaddlePaddle/Paddle/pull/14257 @jacquesqiao
- add bilinear_tensor_product(tensor_layer in V2) layer https://github.com/PaddlePaddle/Paddle/pull/14336/ @jacquesqiao
- gracefull shutdown 解决任务被kill报double free不知道为什么挂的问题
- support selu
- fix prelu layer
- gaussian_random support init selected_rows