Speed up dygraph DataLoader based on shared memory and LoDTensor serialization (!22541) · 合并请求 · PaddlePaddle / Paddle

Speed up dygraph DataLoader based on shared memory and LoDTensor serialization !22541

Created by: chenwhql

This PR is to speed up dygraph DataLoader, adding two mechanisms:

LoDTensor shared memory

The data of LoDTensor created by DataLoader child process will be moved to shared memory.
Then it can be reconstructed in another process based on the shared memory identifier.

LoDTensor serialization

We can put the LoDTensor into multiprocessing.Queue or Pipe in one process, and then take out the tensor in another process
Tensor data copy does not occur between processes

Test Data (P40):

Compared to the original multi-process DataLoader

Singal Card 1 epoch (s)

model	DataLoader 1.7	DataLoader New	UpRate
resnet	62.04	53.88	+15.1%
mobilenet v1	6614.68	3303.86	+100.2%

For mnist, se_resnet, ptb and transformer, the time to read data on the DataLoader1.7 is already close to 0, no obvious improvement

Due to the heavy burden of child threads, DataLoader1.7 multi-process has no acceleration effect on mobilenet, so the improvement is obvious

Current performance

mobilenet v1 1 epoch train time(s)

mode	DataLoader SingPro	DataLoader MultiPro	SppedUpRate
1 card	5038.19	3303.86	52.5%
4 cards	1250.87	933.70	34.0%

mobilenet v1 1 epoch read next approximate time(s)

mode	DataLoader SingPro	DataLoader MultiPro
1 card	0.39	0.02 ~ 0.13
4 cards	0.05 ~ 0.07	0.0001

Other simple models train time(s) - 1 card 1 epoch

model	Reader	DataLoader SingPro	DataLoader MultiPro	Fake Data	Batch Read Time
mnist	10.60	10.28	6.60 (+60.6%)	6.16	0.0001
resnet	94.69	83.98	53.88 (+75.7%)	44.08	0.020
se_resnext	178.92	132.29	125.51 (+42.6%)	123.83	0.0001
ptb_lm	120.33	108.56	108.27 (+11.1%)	107.43	0.0001

Related PR:

Note:

Docker /dev/shm size: This PR use mmap(Memory Map) to implement shared memory. You need to confirm if your shared memory is big enough, otherwise a bus error will occur. The docker default /dev/shm size is 64M, you can set it by docker --shm-size parameter, for example, docker run --name [Name of container] -it --shm-size=2g -v $PWD:/paddle <imagename> /bin/bash, you can see docker's shared memory size by df -h
You need to make sure your /dev/shm size is larger than the following calculation results, otherwise, it will report an Bus error because of insufficient shared memory

DataLoader Num * queue capacity * 1 batch data size

复杂问题解决：共享内存映射文件在程序意外退出时的清理

1. 分3种情况：

DataLoader使用全部数据，读取完，正常结束：无需清理
DataLoader仅使用部分数据，到一定step break，正常结束：需要清理
DataLoader运行到一半，用户Ctrl+C，异常结束：需要清理

2. 需要清理的内容：

主进程未来得及被析构的内存映射文件需要清理（大部分在BlockingQueue中）
进程间Queue中的内存映射文件需要清理
子进程中已申请但未来得及被放入进程间Queue的内存映射文件需要清理

3. 清理方法：

对于主进程包括BlockingQueue残留文件的清理
- 创建记录主进程文件描述符的全局Set单例，在程序退出时Clear
- 为什么不采用BlockingQueue的全局Set单例，在程序退出时Clear
  - 实现其实写完了，但这种方式总是会剩下几个清不掉
  - 这几个没清掉的文件不在BlockingQueue中，也不在进程间Queue中
  - 应该是线程读出来的实例，尚未放进BlockingQueue中，但由于突然退出也未被析构
  - 所以采取了直接记录文件描述符的方法
对于进程间Queue残留文件的清理
- 创建全局的multiprocessing.Queue set，记录DataLoader中创建的进程间Queue
- 实现register_exit_func，注册python退出时执行的_cleanup函数，对multiprocessing.Queue set中所有非空的Queue进行清理
- 这里涉及到对原DataLoader实现的改动
  - _cleanup函数执行时，multiprocessing.Queue可能已经被close了，而Queue close的时候不会对内存映射文件进行清理
  - 另外，Queue.cancel_join_thread(allow_exit_without_flush)也是为了不等待队列中数据清除，直接退出进程，避免hang的函数，而现在既然会对所有的Queue在最后进行清除，理论上也不再需要调用这个方法了
  - 因此去掉了几处主动close和cancel_join_thread multiprocessing.Queue的实现，改为由程序自动调用
    - https://docs.python.org/3.5/library/multiprocessing.html#multiprocessing.Queue.close
    - https://docs.python.org/3.5/library/multiprocessing.html#multiprocessing.Queue.cancel_join_thread
对于子进程中残留文件的清理（非常少，不超过一个batch）
- 在子进程中也维护文件描述符全局Set集合
- 在创建内存映射文件的时候，记录到Set集合中，在tensor list插入到进程间Queue之后，从Set集合中移除
- 最后用注册的exit函数清理子进程文件描述符Set
- 这样创建了但未被放入进程间Queue的文件也能够被清理

4. 经初步测试，以上机制，能够在训练中途break以及ctrl+c时完成对内存映射文件的清理

执行单测test_imperative_data_loader_fds_clear.py 20次，/dev/shm下无残留文件

PaddlePaddle / Paddle 1 年多 前同步成功

Speed up dygraph DataLoader based on shared memory and LoDTensor serialization !22541

Test Data (P40):

Compared to the original multi-process DataLoader

Current performance

Related PR:

Note:

复杂问题解决：共享内存映射文件在程序意外退出时的清理

1. 分3种情况：

2. 需要清理的内容：

3. 清理方法：

4. 经初步测试，以上机制，能够在训练中途break以及ctrl+c时完成对内存映射文件的清理

PaddlePaddle / Paddle
1 年多前同步成功