Created by: chenwhql
This PR is to speed up dygraph DataLoader, adding two mechanisms:
- LoDTensor shared memory
- The data of LoDTensor created by DataLoader child process will be moved to shared memory.
- Then it can be reconstructed in another process based on the shared memory identifier.
- LoDTensor serialization
- We can put the LoDTensor into multiprocessing.Queue or Pipe in one process, and then take out the tensor in another process
- Tensor data copy does not occur between processes
Test Data (P40):
Compared to the original multi-process DataLoader
- Singal Card 1 epoch (s)
model | DataLoader 1.7 | DataLoader New | UpRate |
---|---|---|---|
resnet | 62.04 | 53.88 | +15.1% |
mobilenet v1 | 6614.68 | 3303.86 | +100.2% |
- For mnist, se_resnet, ptb and transformer, the time to read data on the DataLoader1.7 is already close to 0, no obvious improvement
- Due to the heavy burden of child threads, DataLoader1.7 multi-process has no acceleration effect on mobilenet, so the improvement is obvious
Current performance
- mobilenet v1 1 epoch train time(s)
mode | DataLoader SingPro | DataLoader MultiPro | SppedUpRate |
---|---|---|---|
1 card | 5038.19 | 3303.86 | 52.5% |
4 cards | 1250.87 | 933.70 | 34.0% |
- mobilenet v1 1 epoch read next approximate time(s)
mode | DataLoader SingPro | DataLoader MultiPro |
---|---|---|
1 card | 0.39 | 0.02 ~ 0.13 |
4 cards | 0.05 ~ 0.07 | 0.0001 |
- Other simple models train time(s) - 1 card 1 epoch
model | Reader | DataLoader SingPro | DataLoader MultiPro | Fake Data | Batch Read Time |
---|---|---|---|---|---|
mnist | 10.60 | 10.28 | 6.60 (+60.6%) | 6.16 | 0.0001 |
resnet | 94.69 | 83.98 | 53.88 (+75.7%) | 44.08 | 0.020 |
se_resnext | 178.92 | 132.29 | 125.51 (+42.6%) | 123.83 | 0.0001 |
ptb_lm | 120.33 | 108.56 | 108.27 (+11.1%) | 107.43 | 0.0001 |
Related PR:
Note:
- Docker
/dev/shm
size: This PR use mmap(Memory Map) to implement shared memory. You need to confirm if your shared memory is big enough, otherwise a bus error will occur. The docker default/dev/shm
size is 64M, you can set it by docker--shm-size
parameter, for example,docker run --name [Name of container] -it --shm-size=2g -v $PWD:/paddle <imagename> /bin/bash
, you can see docker's shared memory size bydf -h
- You need to make sure your
/dev/shm
size is larger than the following calculation results, otherwise, it will report anBus error
because of insufficient shared memory
DataLoader Num * queue capacity * 1 batch data size
复杂问题解决:共享内存映射文件在程序意外退出时的清理
1. 分3种情况:
- DataLoader使用全部数据,读取完,正常结束:无需清理
- DataLoader仅使用部分数据,到一定step break,正常结束:需要清理
- DataLoader运行到一半,用户Ctrl+C,异常结束:需要清理
2. 需要清理的内容:
- 主进程未来得及被析构的内存映射文件需要清理(大部分在BlockingQueue中)
- 进程间Queue中的内存映射文件需要清理
- 子进程中已申请但未来得及被放入进程间Queue的内存映射文件需要清理
3. 清理方法:
- 对于主进程包括BlockingQueue残留文件的清理
- 创建记录主进程文件描述符的全局Set单例,在程序退出时Clear
- 为什么不采用BlockingQueue的全局Set单例,在程序退出时Clear
- 实现其实写完了,但这种方式总是会剩下几个清不掉
- 这几个没清掉的文件不在BlockingQueue中,也不在进程间Queue中
- 应该是线程读出来的实例,尚未放进BlockingQueue中,但由于突然退出也未被析构
- 所以采取了直接记录文件描述符的方法
- 对于进程间Queue残留文件的清理
- 创建全局的multiprocessing.Queue set,记录DataLoader中创建的进程间Queue
- 实现register_exit_func,注册python退出时执行的_cleanup函数,对multiprocessing.Queue set中所有非空的Queue进行清理
- 这里涉及到对原DataLoader实现的改动
- _cleanup函数执行时,multiprocessing.Queue可能已经被close了,而Queue close的时候不会对内存映射文件进行清理
- 另外,Queue.cancel_join_thread(allow_exit_without_flush)也是为了不等待队列中数据清除,直接退出进程,避免hang的函数,而现在既然会对所有的Queue在最后进行清除,理论上也不再需要调用这个方法了
- 因此去掉了几处主动close和cancel_join_thread multiprocessing.Queue的实现,改为由程序自动调用
- 对于子进程中残留文件的清理(非常少,不超过一个batch)
- 在子进程中也维护文件描述符全局Set集合
- 在创建内存映射文件的时候,记录到Set集合中,在tensor list插入到进程间Queue之后,从Set集合中移除
- 最后用注册的exit函数清理子进程文件描述符Set
- 这样创建了但未被放入进程间Queue的文件也能够被清理
4. 经初步测试,以上机制,能够在训练中途break以及ctrl+c时完成对内存映射文件的清理
- 执行单测test_imperative_data_loader_fds_clear.py 20次,/dev/shm下无残留文件