New feature: thread local allocator, test=develop (!23989) · 合并请求 · PaddlePaddle / Paddle

New feature: thread local allocator, test=develop !23989

Created by: Shixiaowei02

本提交是预测多流的一部分，基于 https://github.com/PaddlePaddle/Paddle/pull/23737 。 1、因为目前显存池隐式要求 GPU 单流顺序计算，为便于计算流绑定线程，新增 ThreadLocalAllocator 作为策略 AllocatorStrategy::kThreadLocal ，线程独占 CUDAThreadLocalAllocatorPool。 2、为支持跨线程 / 跨作用域内存块析构，在 Allocation 中保存 Allocator 智能指针。 3、此修改将仅在 AnalysisConfig 使能 GPU 流绑定线程时生效，此时 fraction_of_gpu_memory_to_use 将由单进程显存池空间变为单线程显存池空间。其它情况沿用已有策略，所以对现存训练和预测显存分配无影响。

还考虑过下面两种修改方式： 1、重构 CUDADeviceContextAllocator，但显存池绑定上下文会延迟归还，一定概率造成显存占用增大。 2、只将 NaiveBestFitAllocator 设为线程本地变量，但因早先 BuddyAllocator 未用智能指针，仍有内存不能归还或归还段错误可能。

PaddlePaddle / Paddle 大约 1 年 前同步成功

New feature: thread local allocator, test=develop !23989

PaddlePaddle / Paddle
大约 1 年前同步成功