vfio: relieve mmap_sem reader cacheline bouncing by holding it longer

hulk inclusion category: feature bugzilla: 13228 CVE: NA --------------------------- Profiling shows significant time being spent on atomic ops in mmap_sem reader acquisition. mmap_sem is taken and dropped for every single base page during pinning, so this is not surprising. Reduce the number of times mmap_sem is taken by holding for longer, which relieves atomic cacheline bouncing. Results for all VFIO page pinning patches ----------------------------------------- The test measures the time from qemu invocation to the start of guest boot. The guest uses kvm with 320G memory backed with THP. 320G fits in a node on the test machine used here, so there was no thrashing in reclaim because of __GFP_THISNODE in THP allocations[1]. CPU: 2 nodes * 24 cores/node * 2 threads/core = 96 CPUs Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz memory: 754G split evenly between nodes scaling_governor: performance patch 6 patch 8 patch 9 (this one) ----------------------- ----------------- --------------------- thr speedup average sec speedup average sec speedup average sec 1 65.0(± 0.6%) 65.2(± 0.5%) 65.5(± 0.4%) 2 1.5x 42.8(± 5.8%) 1.8x 36.0(± 0.9%) 1.9x 34.4(± 0.3%) 3 1.9x 35.0(±11.3%) 2.5x 26.4(± 4.2%) 2.8x 23.7(± 0.2%) 4 2.3x 28.5(± 1.3%) 3.1x 21.2(± 2.8%) 3.6x 18.3(± 0.3%) 5 2.5x 26.2(± 1.5%) 3.6x 17.9(± 0.9%) 4.3x 15.1(± 0.3%) 6 2.7x 24.5(± 1.8%) 4.0x 16.5(± 3.0%) 5.1x 12.9(± 0.1%) 7 2.8x 23.5(± 4.9%) 4.2x 15.4(± 2.7%) 5.7x 11.5(± 0.6%) 8 2.8x 22.8(± 1.8%) 4.2x 15.5(± 4.7%) 6.4x 10.3(± 0.8%) 12 3.2x 20.2(± 1.4%) 4.4x 14.7(± 2.9%) 8.6x 7.6(± 0.6%) 16 3.3x 20.0(± 0.7%) 4.3x 15.4(± 1.3%) 10.2x 6.4(± 0.6%) At patch 6, lock_stat showed long reader wait time on mmap_sem writers, leading to patch 8. At patch 8, profiling revealed the issue with mmap_sem described above. Across all three patches, performance consistently improves as the thread count increases. The one exception is the antiscaling with nthr=16 in patch 8: those mmap_sem atomics are really bouncing around the machine. The performance with patch 9 looks pretty good overall. I'm working on finding the next bottleneck, and this is where it stopped: When nthr=16, the obvious issue profiling showed was contention on the split PMD page table lock when pages are faulted in during the pinning (>2% of the time). A split PMD lock protects a PUD_SIZE-ed amount of page table mappings (1G on x86), so if threads were operating on smaller chunks and contending in the same PUD_SIZE range, this could be the source of contention. However, when nthr=16, threads operate on 5G chunks (320G / 16 threads / (1<<KTASK_LOAD_BAL_SHIFT)), so this wasn't the cause, and aligning the chunks on PUD_SIZE boundaries didn't help either. The time is short (6.4 seconds), so the next theory was threads finishing at different times, but probes showed the threads all returned within less than a millisecond of each other. Kernel probes turned up a few smaller VFIO page pin calls besides the heavy 320G call. The chunk size given (PMD_SIZE) could affect thread count and chunk size for these, so chunk size was increased from 2M to 1G. This caused the split PMD contention to disappear, but with little change in the runtime. More digging required. [1] lkml.kernel.org/r/20180925120326.24392-1-mhocko@kernel.org Signed-off-by: N Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: N Hongbo Yao <yaohongbo@huawei.com> Reviewed-by: N Xie XiuQi <xiexiuqi@huawei.com> Tested-by: N Hongbo Yao <yaohongbo@huawei.com> Signed-off-by: N Yang Yingliang <yangyingliang@huawei.com>

vfio: relieve mmap_sem reader cacheline bouncing by holding it longer
hulk inclusion category: feature bugzilla: 13228 CVE: NA --------------------------- Profiling shows significant time being spent on atomic ops in mmap_sem reader acquisition. mmap_sem is taken and dropped for every single base page during pinning, so this is not surprising. Reduce the number of times mmap_sem is taken by holding for longer, which relieves atomic cacheline bouncing. Results for all VFIO page pinning patches ----------------------------------------- The test measures the time from qemu invocation to the start of guest boot. The guest uses kvm with 320G memory backed with THP. 320G fits in a node on the test machine used here, so there was no thrashing in reclaim because of __GFP_THISNODE in THP allocations[1]. CPU: 2 nodes * 24 cores/node * 2 threads/core = 96 CPUs Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz memory: 754G split evenly between nodes scaling_governor: performance patch 6 patch 8 patch 9 (this one) ----------------------- ----------------- --------------------- thr speedup average sec speedup average sec speedup average sec 1 65.0(± 0.6%) 65.2(± 0.5%) 65.5(± 0.4%) 2 1.5x 42.8(± 5.8%) 1.8x 36.0(± 0.9%) 1.9x 34.4(± 0.3%) 3 1.9x 35.0(±11.3%) 2.5x 26.4(± 4.2%) 2.8x 23.7(± 0.2%) 4 2.3x 28.5(± 1.3%) 3.1x 21.2(± 2.8%) 3.6x 18.3(± 0.3%) 5 2.5x 26.2(± 1.5%) 3.6x 17.9(± 0.9%) 4.3x 15.1(± 0.3%) 6 2.7x 24.5(± 1.8%) 4.0x 16.5(± 3.0%) 5.1x 12.9(± 0.1%) 7 2.8x 23.5(± 4.9%) 4.2x 15.4(± 2.7%) 5.7x 11.5(± 0.6%) 8 2.8x 22.8(± 1.8%) 4.2x 15.5(± 4.7%) 6.4x 10.3(± 0.8%) 12 3.2x 20.2(± 1.4%) 4.4x 14.7(± 2.9%) 8.6x 7.6(± 0.6%) 16 3.3x 20.0(± 0.7%) 4.3x 15.4(± 1.3%) 10.2x 6.4(± 0.6%) At patch 6, lock_stat showed long reader wait time on mmap_sem writers, leading to patch 8. At patch 8, profiling revealed the issue with mmap_sem described above. Across all three patches, performance consistently improves as the thread count increases. The one exception is the antiscaling with nthr=16 in patch 8: those mmap_sem atomics are really bouncing around the machine. The performance with patch 9 looks pretty good overall. I'm working on finding the next bottleneck, and this is where it stopped: When nthr=16, the obvious issue profiling showed was contention on the split PMD page table lock when pages are faulted in during the pinning (>2% of the time). A split PMD lock protects a PUD_SIZE-ed amount of page table mappings (1G on x86), so if threads were operating on smaller chunks and contending in the same PUD_SIZE range, this could be the source of contention. However, when nthr=16, threads operate on 5G chunks (320G / 16 threads / (1<<KTASK_LOAD_BAL_SHIFT)), so this wasn't the cause, and aligning the chunks on PUD_SIZE boundaries didn't help either. The time is short (6.4 seconds), so the next theory was threads finishing at different times, but probes showed the threads all returned within less than a millisecond of each other. Kernel probes turned up a few smaller VFIO page pin calls besides the heavy 320G call. The chunk size given (PMD_SIZE) could affect thread count and chunk size for these, so chunk size was increased from 2M to 1G. This caused the split PMD contention to disappear, but with little change in the runtime. More digging required. [1] lkml.kernel.org/r/20180925120326.24392-1-mhocko@kernel.org Signed-off-by: N Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: N Hongbo Yao <yaohongbo@huawei.com> Reviewed-by: N Xie XiuQi <xiexiuqi@huawei.com> Tested-by: N Hongbo Yao <yaohongbo@huawei.com> Signed-off-by: N Yang Yingliang <yangyingliang@huawei.com>
22705b26 · Daniel Jordan · Xie XiuQi · 68a4b481 · 22705b26
隐藏空白更改
内联并排

Showing with 19 addition and 9 deletion

drivers/vfio/vfio_iommu_type1.c drivers/vfio/vfio_iommu_type1.c +19 -9

未找到文件。
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -347,7 +347,7 @@ static int put_pfn(unsigned long pfn, int prot)
 }

 static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
-			 int prot, unsigned long *pfn)
+			 int prot, unsigned long *pfn, bool handle_mmap_sem)
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
@@ -358,7 +358,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 	if (prot & IOMMU_WRITE)
 		flags |= FOLL_WRITE;

-	down_read(&mm->mmap_sem);
+	if (handle_mmap_sem)
+		down_read(&mm->mmap_sem);
 	if (mm == current->mm) {
 		ret = get_user_pages_longterm(vaddr, 1, flags, page, vmas);
 	} else {
@@ -376,14 +377,16 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 			put_page(page[0]);
 		}
 	}
-	up_read(&mm->mmap_sem);
+	if (handle_mmap_sem)
+		up_read(&mm->mmap_sem);

 	if (ret == 1) {
 		*pfn = page_to_pfn(page[0]);
 		return 0;
 	}

-	down_read(&mm->mmap_sem);
+	if (handle_mmap_sem)
+		down_read(&mm->mmap_sem);

 	vma = find_vma_intersection(mm, vaddr, vaddr + 1);

@@ -393,7 +396,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 			ret = 0;
 	}

-	up_read(&mm->mmap_sem);
+	if (handle_mmap_sem)
+		up_read(&mm->mmap_sem);
 	return ret;
 }

@@ -415,9 +419,12 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 	if (!mm)
 		return -ENODEV;

-	ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base);
-	if (ret)
+	down_read(&mm->mmap_sem);
+	ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base, false);
+	if (ret) {
+		up_read(&mm->mmap_sem);
 		return ret;
+	}

 	pinned++;
 	rsvd = is_invalid_reserved_pfn(*pfn_base);
@@ -432,6 +439,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 			put_pfn(*pfn_base, dma->prot);
 			pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
 					limit << PAGE_SHIFT);
+			up_read(&mm->mmap_sem);
 			return -ENOMEM;
 		}
 		lock_acct++;
@@ -443,7 +451,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 	/* Lock all the consecutive pages from pfn_base */
 	for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage;
 	     pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) {
-		ret = vaddr_get_pfn(mm, vaddr, dma->prot, &pfn);
+		ret = vaddr_get_pfn(mm, vaddr, dma->prot, &pfn, false);
 		if (ret)
 			break;

@@ -460,6 +468,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 				pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
 					__func__, limit << PAGE_SHIFT);
 				ret = -ENOMEM;
+				up_read(&mm->mmap_sem);
 				goto unpin_out;
 			}
 			lock_acct++;
@@ -467,6 +476,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 	}

 out:
+	up_read(&mm->mmap_sem);
 	ret = vfio_lock_acct(dma, lock_acct, false);

 unpin_out:
@@ -513,7 +523,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 	if (!mm)
 		return -ENODEV;

-	ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base);
+	ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base, true);
 	if (!ret && do_accounting && !is_invalid_reserved_pfn(*pfn_base)) {
 		ret = vfio_lock_acct(dma, 1, true);
 		if (ret) {