vfio: relieve mmap_sem reader cacheline bouncing by holding it longer
hulk inclusion category: feature bugzilla: 13228 CVE: NA --------------------------- Profiling shows significant time being spent on atomic ops in mmap_sem reader acquisition. mmap_sem is taken and dropped for every single base page during pinning, so this is not surprising. Reduce the number of times mmap_sem is taken by holding for longer, which relieves atomic cacheline bouncing. Results for all VFIO page pinning patches ----------------------------------------- The test measures the time from qemu invocation to the start of guest boot. The guest uses kvm with 320G memory backed with THP. 320G fits in a node on the test machine used here, so there was no thrashing in reclaim because of __GFP_THISNODE in THP allocations[1]. CPU: 2 nodes * 24 cores/node * 2 threads/core = 96 CPUs Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz memory: 754G split evenly between nodes scaling_governor: performance patch 6 patch 8 patch 9 (this one) ----------------------- ----------------- --------------------- thr speedup average sec speedup average sec speedup average sec 1 65.0(± 0.6%) 65.2(± 0.5%) 65.5(± 0.4%) 2 1.5x 42.8(± 5.8%) 1.8x 36.0(± 0.9%) 1.9x 34.4(± 0.3%) 3 1.9x 35.0(±11.3%) 2.5x 26.4(± 4.2%) 2.8x 23.7(± 0.2%) 4 2.3x 28.5(± 1.3%) 3.1x 21.2(± 2.8%) 3.6x 18.3(± 0.3%) 5 2.5x 26.2(± 1.5%) 3.6x 17.9(± 0.9%) 4.3x 15.1(± 0.3%) 6 2.7x 24.5(± 1.8%) 4.0x 16.5(± 3.0%) 5.1x 12.9(± 0.1%) 7 2.8x 23.5(± 4.9%) 4.2x 15.4(± 2.7%) 5.7x 11.5(± 0.6%) 8 2.8x 22.8(± 1.8%) 4.2x 15.5(± 4.7%) 6.4x 10.3(± 0.8%) 12 3.2x 20.2(± 1.4%) 4.4x 14.7(± 2.9%) 8.6x 7.6(± 0.6%) 16 3.3x 20.0(± 0.7%) 4.3x 15.4(± 1.3%) 10.2x 6.4(± 0.6%) At patch 6, lock_stat showed long reader wait time on mmap_sem writers, leading to patch 8. At patch 8, profiling revealed the issue with mmap_sem described above. Across all three patches, performance consistently improves as the thread count increases. The one exception is the antiscaling with nthr=16 in patch 8: those mmap_sem atomics are really bouncing around the machine. The performance with patch 9 looks pretty good overall. I'm working on finding the next bottleneck, and this is where it stopped: When nthr=16, the obvious issue profiling showed was contention on the split PMD page table lock when pages are faulted in during the pinning (>2% of the time). A split PMD lock protects a PUD_SIZE-ed amount of page table mappings (1G on x86), so if threads were operating on smaller chunks and contending in the same PUD_SIZE range, this could be the source of contention. However, when nthr=16, threads operate on 5G chunks (320G / 16 threads / (1<<KTASK_LOAD_BAL_SHIFT)), so this wasn't the cause, and aligning the chunks on PUD_SIZE boundaries didn't help either. The time is short (6.4 seconds), so the next theory was threads finishing at different times, but probes showed the threads all returned within less than a millisecond of each other. Kernel probes turned up a few smaller VFIO page pin calls besides the heavy 320G call. The chunk size given (PMD_SIZE) could affect thread count and chunk size for these, so chunk size was increased from 2M to 1G. This caused the split PMD contention to disappear, but with little change in the runtime. More digging required. [1] lkml.kernel.org/r/20180925120326.24392-1-mhocko@kernel.org Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: NHongbo Yao <yaohongbo@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Tested-by: NHongbo Yao <yaohongbo@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Showing
想要评论请 注册 或 登录