- 30 10月, 2019 40 次提交
-
-
由 Suren Baghdasaryan 提交于
commit 04e048cf09d7b5fc995817cdc5ae1acd4482429c upstream. When a process creates a new trigger by writing into /proc/pressure/* files, permissions to write such a file should be used to determine whether the process is allowed to do so or not. Current implementation would also require such a process to have setsched capability. Setting of psi trigger thread's scheduling policy is an implementation detail and should not be exposed to the user level. Remove the permission check by using _nocheck version of the function. Suggested-by: NNick Kralevich <nnk@google.com> Signed-off-by: NSuren Baghdasaryan <surenb@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: lizefan@huawei.com Cc: mingo@redhat.com Cc: akpm@linux-foundation.org Cc: kernel-team@android.com Cc: dennisszhou@gmail.com Cc: dennis@kernel.org Cc: hannes@cmpxchg.org Cc: axboe@kernel.dk Link: https://lkml.kernel.org/r/20190730013310.162367-1-surenb@google.comSigned-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Peter Zijlstra 提交于
commit 14f5c7b46a41a595fc61db37f55721714729e59e upstream. PSI defaults to a FIFO-99 thread, reduce this to FIFO-1. FIFO-99 is the very highest priority available to SCHED_FIFO and it not a suitable default; it would indicate the psi work is the most important work on the machine. Since Real-Time tasks will have pre-allocated memory and locked it in place, Real-Time tasks do not care about PSI. All it needs is to be above OTHER. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Tested-by: NSuren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Josef Bacik 提交于
commit fd112c74652371a023f85d87b70bee7169e8f4d0 upstream. With the psi stuff in place we can use the memstall flag to indicate pressure that happens from throttling. Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Hui Zhu 提交于
commit d2fcd82bb83aab47c6d63aa8c960cd5edb578065 upstream This is the third version that was updated according to the comments from Sergey Senozhatsky https://lkml.org/lkml/2019/5/29/73 and Shakeel Butt https://lkml.org/lkml/2019/6/4/973 zswap compresses swap pages into a dynamically allocated RAM-based memory pool. The memory pool should be zbud, z3fold or zsmalloc. All of them will allocate unmovable pages. It will increase the number of unmovable page blocks that will bad for anti-fragment. zsmalloc support page migration if request movable page: handle = zs_malloc(zram->mem_pool, comp_len, GFP_NOIO | __GFP_HIGHMEM | __GFP_MOVABLE); And commit "zpool: Add malloc_support_movable to zpool_driver" add zpool_malloc_support_movable check malloc_support_movable to make sure if a zpool support allocate movable memory. This commit let zswap allocate block with gfp __GFP_HIGHMEM | __GFP_MOVABLE if zpool support allocate movable memory. Following part is test log in a pc that has 8G memory and 2G swap. Without this commit: ~# echo lz4 > /sys/module/zswap/parameters/compressor ~# echo zsmalloc > /sys/module/zswap/parameters/zpool ~# echo 1 > /sys/module/zswap/parameters/enabled ~# swapon /swapfile ~# cd /home/teawater/kernel/vm-scalability/ /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024)) /home/teawater/kernel/vm-scalability# ./case-anon-w-seq 2717908992 bytes / 4826062 usecs = 549973 KB/s 2717908992 bytes / 4864201 usecs = 545661 KB/s 2717908992 bytes / 4867015 usecs = 545346 KB/s 2717908992 bytes / 4915485 usecs = 539968 KB/s 397853 usecs to free memory 357820 usecs to free memory 421333 usecs to free memory 420454 usecs to free memory /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo Page block order: 9 Pages per block: 512 Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 0, zone DMA, type Unmovable 1 1 1 0 2 1 1 0 1 0 0 Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 1 3 Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Unmovable 6 5 8 6 6 5 4 1 1 1 0 Node 0, zone DMA32, type Movable 25 20 20 19 22 15 14 11 11 5 767 Node 0, zone DMA32, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Unmovable 4753 5588 5159 4613 3712 2520 1448 594 188 11 0 Node 0, zone Normal, type Movable 16 3 457 2648 2143 1435 860 459 223 224 296 Node 0, zone Normal, type Reclaimable 0 0 44 38 11 2 0 0 0 0 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Number of blocks type Unmovable Movable Reclaimable HighAtomic CMA Isolate Node 0, zone DMA 1 7 0 0 0 0 Node 0, zone DMA32 4 1652 0 0 0 0 Node 0, zone Normal 931 1485 15 0 0 0 With this commit: ~# echo lz4 > /sys/module/zswap/parameters/compressor ~# echo zsmalloc > /sys/module/zswap/parameters/zpool ~# echo 1 > /sys/module/zswap/parameters/enabled ~# swapon /swapfile ~# cd /home/teawater/kernel/vm-scalability/ /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024)) /home/teawater/kernel/vm-scalability# ./case-anon-w-seq 2717908992 bytes / 4689240 usecs = 566020 KB/s 2717908992 bytes / 4760605 usecs = 557535 KB/s 2717908992 bytes / 4803621 usecs = 552543 KB/s 2717908992 bytes / 5069828 usecs = 523530 KB/s 431546 usecs to free memory 383397 usecs to free memory 456454 usecs to free memory 224487 usecs to free memory /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo Page block order: 9 Pages per block: 512 Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 0, zone DMA, type Unmovable 1 1 1 0 2 1 1 0 1 0 0 Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 1 3 Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Unmovable 10 8 10 9 10 4 3 2 3 0 0 Node 0, zone DMA32, type Movable 18 12 14 16 16 11 9 5 5 6 775 Node 0, zone DMA32, type Reclaimable 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone DMA32, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Unmovable 2669 1236 452 118 37 14 4 1 2 3 0 Node 0, zone Normal, type Movable 3850 6086 5274 4327 3510 2494 1520 934 438 220 470 Node 0, zone Normal, type Reclaimable 56 93 155 124 47 31 17 7 3 0 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Number of blocks type Unmovable Movable Reclaimable HighAtomic CMA Isolate Node 0, zone DMA 1 7 0 0 0 0 Node 0, zone DMA32 4 1650 2 0 0 0 Node 0, zone Normal 79 2326 26 0 0 0 You can see that the number of unmovable page blocks is decreased when the kernel has this commit. Link: http://lkml.kernel.org/r/20190605100630.13293-2-teawaterz@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitalywool@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Hui Zhu 提交于
commit c165f25d23ecb2f9f121ced20435415b931219e2 upstream As a zpool_driver, zsmalloc can allocate movable memory because it support migate pages. But zbud and z3fold cannot allocate movable memory. Add malloc_support_movable to zpool_driver. If a zpool_driver support allocate movable memory, set it to true. And add zpool_malloc_support_movable check malloc_support_movable to make sure if a zpool support allocate movable memory. Link: http://lkml.kernel.org/r/20190605100630.13293-1-teawaterz@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitalywool@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Gen Zhang 提交于
commit 425aa0e1d01513437668fa3d4a971168bbaa8515 upstream. In function ip_ra_control(), the pointer new_ra is allocated a memory space via kmalloc(). And it is used in the following codes. However, when there is a memory allocation error, kmalloc() fails. Thus null pointer dereference may happen. And it will cause the kernel to crash. Therefore, we should check the return value and handle the error. Signed-off-by: NGen Zhang <blackgod016574@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Gen Zhang 提交于
commit 4e78921ba4dd0aca1cc89168f45039add4183f8e upstream. The old_memmap flow in efi_call_phys_prolog() performs numerous memory allocations, and either does not check for failure at all, or it does but fails to propagate it back to the caller, which may end up calling into the firmware with an incomplete 1:1 mapping. So let's fix this by returning NULL from efi_call_phys_prolog() on memory allocation failures only, and by handling this condition in the caller. Also, clean up any half baked sets of page tables that we may have created before returning with a NULL return value. Note that any failure at this level will trigger a panic() two levels up, so none of this makes a huge difference, but it is a nice cleanup nonetheless. [ardb: update commit log, add efi_call_phys_epilog() call on error path] Signed-off-by: NGen Zhang <blackgod016574@gmail.com> Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rob Bradford <robert.bradford@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20190525112559.7917-2-ard.biesheuvel@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Gen Zhang 提交于
commit 95baa60a0da80a0143e3ddd4d3725758b4513825 upstream. In function ip6_ra_control(), the pointer new_ra is allocated a memory space via kmalloc(). And it is used in the following codes. However, when there is a memory allocation error, kmalloc() fails. Thus null pointer dereference may happen. And it will cause the kernel to crash. Therefore, we should check the return value and handle the error. Signed-off-by: NGen Zhang <blackgod016574@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Gen Zhang 提交于
commit f9e3ebeea4521652318af903cddeaf033527e93e upstream. In _ctl_ioctl_main(), 'ioctl_header' is fetched the first time from userspace. 'ioctl_header.ioc_number' is then checked. The legal result is saved to 'ioc'. Then, in condition MPT3COMMAND, the whole struct is fetched again from the userspace. Then _ctl_do_mpt_command() is called, 'ioc' and 'karg' as inputs. However, a malicious user can change the 'ioc_number' between the two fetches, which will cause a potential security issues. Moreover, a malicious user can provide a valid 'ioc_number' to pass the check in first fetch, and then modify it in the second fetch. To fix this, we need to recheck the 'ioc_number' in the second fetch. Signed-off-by: NGen Zhang <blackgod016574@gmail.com> Acked-by: NSuganath Prabu S <suganath-prabu.subramani@broadcom.com> Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Gen Zhang 提交于
commit fcdf445ff42f036d22178b49cf64e92d527c1330 upstream. In sunxi_divs_clk_setup(), 'derived_name' is allocated by kstrndup(). It returns NULL when fails. 'derived_name' should be checked. Signed-off-by: NGen Zhang <blackgod016574@gmail.com> Signed-off-by: NMaxime Ripard <maxime.ripard@bootlin.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Gen Zhang 提交于
commit efa9ace68e487ddd29c2b4d6dd23242158f1f607 upstream. In dlpar_parse_cc_property(), 'prop->name' is allocated by kstrdup(). kstrdup() may return NULL, so it should be checked and handle error. And prop should be freed if 'prop->name' is NULL. Signed-off-by: NGen Zhang <blackgod016574@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Miguel Bernal Marin 提交于
commit f74dc880098b4a29f76d756b888fb31d81ad9a0c upstream. Suggested-by: NTim Pepper <timothy.c.pepper@linux.intel.com> Signed-off-by: NMiguel Bernal Marin <miguel.bernal.marin@linux.intel.com> Signed-off-by: NPaul Menzel <pmenzel@molgen.mpg.de> Acked-by: NSasha Neftin <sasha.neftin@intel.com> Tested-by: NAaron Brown <aaron.f.brown@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Arjan van de Ven 提交于
commit ab6973aed6200510662856afce5e3d1e386b7b64 upstream. The e1000e driver is a great user of the usleep_range() API, and has any nice ranges that in principle help power management. However the ranges that are used only during system startup are very long (and can add easily 100 msec to the boot time) while the power savings of such long ranges is irrelevant due to the one-off, boot only, nature of these functions. This patch shrinks some of the longest ranges to be shorter (while still using a power friendly 1 msec range); this saves 100msec+ of boot time on my BDW NUCs Signed-off-by: NArjan van de Ven <arjan@linux.intel.com> Signed-off-by: NPaul Menzel <pmenzel@molgen.mpg.de> Tested-by: NAaron Brown <aaron.f.brown@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jason Xing 提交于
Only when calling the poll syscall the first time can user receive POLLPRI correctly. After that, user always fails to acquire the event signal. Reproduce case: 1. Get the monitor code in Documentation/accounting/psi.txt 2. Run it, and wait for the event triggered. 3. Kill and restart the process. The question is why we can end up with poll_scheduled = 1 but the work not running (which would reset it to 0). And the answer is because the scheduling side sees group->poll_kworker under RCU protection and then schedules it, but here we cancel the work and destroy the worker. The cancel needs to pair with resetting the poll_scheduled flag. Signed-off-by: NJason Xing <kerneljasonxing@linux.alibaba.com> Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com> Reviewed-by: NSuren Baghdasaryan <surenb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit 7bd75230b43727b258a4f7a59d62114cffe1b6c8 upstream. Ext4 may not free clusters correctly when punching holes in bigalloc file systems under high load conditions. If it's not possible to extend and restart the journal in ext4_ext_rm_leaf() when preparing to remove blocks from a punched region, a retry of the entire punch operation is triggered in ext4_ext_remove_space(). This causes a partial cluster to be set to the first cluster in the extent found to the right of the punched region. However, if the punch operation prior to the retry had made enough progress to delete one or more extents and a partial cluster candidate for freeing had already been recorded, the retry would overwrite the partial cluster. The loss of this information makes it impossible to correctly free the original partial cluster in all cases. This bug can cause generic/476 to fail when run as part of xfstests-bld's bigalloc and bigalloc_1k test cases. The failure is reported when e2fsck detects bad iblocks counts greater than expected in units of whole clusters and also detects a number of negative block bitmap differences equal to the iblocks discrepancy in cluster units. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Gabriel Krisman Bertazi 提交于
commit 799578ab16e86b074c184ec5abbda0bc698c7b0b upstream. Enabling DX_DEBUG triggers the build error below. info is an attribute of the dxroot structure. linux/fs/ext4/namei.c:2264:12: error: ‘info’ undeclared (first use in this function); did you mean ‘insl’? info->indirect_levels)); Fixes: e08ac99f ("ext4: add largedir feature") Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NLukas Czerner <lczerner@redhat.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Dave Chinner 提交于
commit 64081362e8ff4587b4554087f3cfc73d3e0a4cd7 upstream. We've recently seen a workload on XFS filesystems with a repeatable deadlock between background writeback and a multi-process application doing concurrent writes and fsyncs to a small range of a file. range_cyclic writeback Process 1 Process 2 xfs_vm_writepages write_cache_pages writeback_index = 2 cycled = 0 .... find page 2 dirty lock Page 2 ->writepage page 2 writeback page 2 clean page 2 added to bio no more pages write() locks page 1 dirties page 1 locks page 2 dirties page 1 fsync() .... xfs_vm_writepages write_cache_pages start index 0 find page 1 towrite lock Page 1 ->writepage page 1 writeback page 1 clean page 1 added to bio find page 2 towrite lock Page 2 page 2 is writeback <blocks> write() locks page 1 dirties page 1 fsync() .... xfs_vm_writepages write_cache_pages start index 0 !done && !cycled sets index to 0, restarts lookup find page 1 dirty find page 1 towrite lock Page 1 page 1 is writeback <blocks> lock Page 1 <blocks> DEADLOCK because: - process 1 needs page 2 writeback to complete to make enough progress to issue IO pending for page 1 - writeback needs page 1 writeback to complete so process 2 can progress and unlock the page it is blocked on, then it can issue the IO pending for page 2 - process 2 can't make progress until process 1 issues IO for page 1 The underlying cause of the problem here is that range_cyclic writeback is processing pages in descending index order as we hold higher index pages in a structure controlled from above write_cache_pages(). The write_cache_pages() caller needs to be able to submit these pages for IO before write_cache_pages restarts writeback at mapping index 0 to avoid wcp inverting the page lock/writeback wait order. generic_writepages() is not susceptible to this bug as it has no private context held across write_cache_pages() - filesystems using this infrastructure always submit pages in ->writepage immediately and so there is no problem with range_cyclic going back to mapping index 0. However: mpage_writepages() has a private bio context, exofs_writepages() has page_collect fuse_writepages() has fuse_fill_wb_data nfs_writepages() has nfs_pageio_descriptor xfs_vm_writepages() has xfs_writepage_ctx All of these ->writepages implementations can hold pages under writeback in their private structures until write_cache_pages() returns, and hence they are all susceptible to this deadlock. Also worth noting is that ext4 has it's own bastardised version of write_cache_pages() and so it /may/ have an equivalent deadlock. I looked at the code long enough to understand that it has a similar retry loop for range_cyclic writeback reaching the end of the file and then promptly ran away before my eyes bled too much. I'll leave it for the ext4 developers to determine if their code is actually has this deadlock and how to fix it if it has. There's a few ways I can see avoid this deadlock. There's probably more, but these are the first I've though of: 1. get rid of range_cyclic altogether 2. range_cyclic always stops at EOF, and we start again from writeback index 0 on the next call into write_cache_pages() 2a. wcp also returns EAGAIN to ->writepages implementations to indicate range cyclic has hit EOF. writepages implementations can then flush the current context and call wpc again to continue. i.e. lift the retry into the ->writepages implementation 3. range_cyclic uses trylock_page() rather than lock_page(), and it skips pages it can't lock without blocking. It will already do this for pages under writeback, so this seems like a no-brainer 3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid blocking as per pages under writeback. I don't think #1 is an option - range_cyclic prevents frequently dirtied lower file offset from starving background writeback of rarely touched higher file offsets. performance as going back to the start of the file implies an immediate seek. We'll have exactly the same number of seeks if we switch writeback to another inode, and then come back to this one later and restart from index 0. retry loop up into the wcp caller means we can issue IO on the pending pages before calling wcp again, and so avoid locking or waiting on pages in the wrong order. I'm not convinced we need to do this given that we get the same thing from #2 on the next writeback call from the writeback infrastructure. inversion problem, just prevents it from becoming a deadlock situation. I'd prefer we fix the inversion, not sweep it under the carpet like this. band-aid fix of #3. So it seems that the simplest way to fix this issue is to implement solution #2 Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.comSigned-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NJan Kara <jack@suse.de> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Ming Lei 提交于
commit 2a5cf35cd6c56b2924bce103413ad3381bdc31fa upstream. There are actually two kinds of discard merge: - one is the normal discard merge, just like normal read/write request, and call it single-range discard - another is the multi-range discard, queue_max_discard_segments(rq->q) > 1 For the former case, queue_max_discard_segments(rq->q) is 1, and we should handle this kind of discard merge like the normal read/write request. This patch fixes the following kernel panic issue[1], which is caused by not removing the single-range discard request from elevator queue. Guangwu has one raid discard test case, in which this issue is a bit easier to trigger, and I verified that this patch can fix the kernel panic issue in Guangwu's test case. [1] kernel panic log from Jens's report BUG: unable to handle kernel NULL pointer dereference at 0000000000000148 PGD 0 P4D 0. Oops: 0000 [#1] SMP PTI CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted \ 4.20.0-rc3-00649-ge64d9a554a91-dirty #14 Hardware name: Wiwynn \ Leopard-Orv2/Leopard-DDR BW, BIOS LBM08 03/03/2017 Workqueue: kblockd \ blk_mq_run_work_fn RIP: \ 0010:blk_mq_get_driver_tag+0x81/0x120 Code: 24 \ 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 \ 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 \ f6 87 b0 00 00 00 02 RSP: 0018:ffffc90004aabd30 EFLAGS: 00010246 \ RAX: 0000000000000003 RBX: ffff888465ea1300 RCX: ffffc90004aabde8 RDX: 00000000ffffffff RSI: ffffc90004aabde8 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffff888465ea1348 R09: 0000000000000000 R10: 0000000000001000 R11: 00000000ffffffff R12: ffff888465ea1300 R13: 0000000000000000 R14: ffff888465ea1348 R15: ffff888465d10000 FS: 0000000000000000(0000) GS:ffff88846f9c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000148 CR3: 000000000220a003 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: blk_mq_dispatch_rq_list+0xec/0x480 ? elv_rb_del+0x11/0x30 blk_mq_do_dispatch_sched+0x6e/0xf0 blk_mq_sched_dispatch_requests+0xfa/0x170 __blk_mq_run_hw_queue+0x5f/0xe0 process_one_work+0x154/0x350 worker_thread+0x46/0x3c0 kthread+0xf5/0x130 ? process_one_work+0x350/0x350 ? kthread_destroy_worker+0x50/0x50 ret_from_fork+0x1f/0x30 Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel \ kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii \ cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq \ button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme \ nvme_core fuse sg loop efivarfs autofs4 CR2: 0000000000000148 \ ---[ end trace 340a1fb996df1b9b ]--- RIP: 0010:blk_mq_get_driver_tag+0x81/0x120 Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 \ 00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 \ 20 72 37 f6 87 b0 00 00 00 02 Fixes: 445251d0 ("blk-mq: fix discard merge with scheduler attached") Reported-by: NJens Axboe <axboe@kernel.dk> Cc: Guangwu Zhang <guazhang@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Olga Kornievskaia 提交于
commit 44f411c353bf6d98d5a34f8f1b8605d43b2e50b8 upstream. Running "./nfstest_delegation --runtest recall26" uncovers that client doesn't recover the lock when we have an appending open, where the initial open got a write delegation. Instead of checking for the passed in open context against the file lock's open context. Check that the state is the same. Signed-off-by: NOlga Kornievskaia <kolga@netapp.com> Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jianchao Wang 提交于
commit 69840466086d2248898020a08dda52732686c4e6 upstream. There are two cases when handle DISCARD merge. If max_discard_segments == 1, the bios/requests need to be contiguous to merge. If max_discard_segments > 1, it takes every bio as a range and different range needn't to be contiguous. But now, attempt_merge screws this up. It always consider contiguity for DISCARD for the case max_discard_segments > 1 and cannot merge contiguous DISCARD for the case max_discard_segments == 1, because rq_attempt_discard_merge always returns false in this case. This patch fixes both of the two cases above. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Heinz Mauelshagen 提交于
commit 74694bcbdf7e28a5ad548cdda9ac56d30be00d13 upstream. Sending a check/repair message infrequently leads to -EBUSY instead of properly identifying an active resync. This occurs because raid_message() is testing recovery bits in a racy way. Fix by calling decipher_sync_action() from raid_message() to properly identify the idle state of the RAID device. Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Dave Chinner 提交于
commit 37fd1678245f7a5898c1b05128bc481fb403c290 upstream. When looking at a 4.18 based KASAN use after free report, I noticed that racing xfs_buf_rele() may race on dropping the last reference to the buffer and taking the buffer lock. This was the symptom displayed by the KASAN report, but the actual issue that was reported had already been fixed in 4.19-rc1 by commit e339dd8d ("xfs: use sync buffer I/O for sync delwri queue submission"). Despite this, I think there is still an issue with xfs_buf_rele() in this code: release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock); spin_lock(&bp->b_lock); if (!release) { ..... If two threads race on the b_lock after both dropping a reference and one getting dropping the last reference so release = true, we end up with: CPU 0 CPU 1 atomic_dec_and_lock() atomic_dec_and_lock() spin_lock(&bp->b_lock) spin_lock(&bp->b_lock) <spins> <release = true bp->b_lru_ref = 0> <remove from lists> freebuf = true spin_unlock(&bp->b_lock) xfs_buf_free(bp) <gets lock, reading and writing freed memory> <accesses freed memory> spin_unlock(&bp->b_lock) <reads/writes freed memory> IOWs, we can't safely take bp->b_lock after dropping the hold reference because the buffer may go away at any time after we drop that reference. However, this can be fixed simply by taking the bp->b_lock before we drop the reference. It is safe to nest the pag_buf_lock inside bp->b_lock as the pag_buf_lock is only used to serialise against lookup in xfs_buf_find() and no other locks are held over or under the pag_buf_lock there. Make this clear by documenting the buffer lock orders at the top of the file. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com Signed-off-by: NDave Chinner <david@fromorbit.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Will Deacon 提交于
commit 6e693b3ffecb0b478c7050b44a4842854154f715 upstream. Commit 594cc251fdd0 ("make 'user_access_begin()' do 'access_ok()'") makes the access_ok() check part of the user_access_begin() preceding a series of 'unsafe' accesses. This has the desirable effect of ensuring that all 'unsafe' accesses have been range-checked, without having to pick through all of the callsites to verify whether the appropriate checking has been made. However, the consolidated range check does not inhibit speculation, so it is still up to the caller to ensure that they are not susceptible to any speculative side-channel attacks for user addresses that ultimately fail the access_ok() check. This is an oversight, so use __uaccess_begin_nospec() to ensure that speculation is inhibited until the access_ok() check has passed. Reported-by: NJulien Thierry <julien.thierry@arm.com> Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Linus Torvalds 提交于
commit 594cc251fdd0d231d342d88b2fdff4bc42fb0690 upstream. Originally, the rule used to be that you'd have to do access_ok() separately, and then user_access_begin() before actually doing the direct (optimized) user access. But experience has shown that people then decide not to do access_ok() at all, and instead rely on it being implied by other operations or similar. Which makes it very hard to verify that the access has actually been range-checked. If you use the unsafe direct user accesses, hardware features (either SMAP - Supervisor Mode Access Protection - on x86, or PAN - Privileged Access Never - on ARM) do force you to use user_access_begin(). But nothing really forces the range check. By putting the range check into user_access_begin(), we actually force people to do the right thing (tm), and the range check vill be visible near the actual accesses. We have way too long a history of people trying to avoid them. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> [ Shile: fix following conflicts by adding a dummy arguments ] Conflicts: kernel/compat.c kernel/exit.c Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
由 Linus Torvalds 提交于
commit 0b2c8f8b6b0c7530e2866c95862546d0da2057b0 upstream. When commit fddcd00a49e9 ("drm/i915: Force the slow path after a user-write error") unified the error handling for various user access problems, it didn't do the user_access_end() that is needed for the unsafe_put_user() case. It's not a huge deal: a missed user_access_end() will only mean that SMAP protection isn't active afterwards, and for the error case we'll be returning to user mode soon enough anyway. But it's wrong, and adding the proper user_access_end() is trivial enough (and doing it for the other error cases where it isn't needed doesn't hurt). I noticed it while doing the same prep-work for changing user_access_begin() that precipitated the access_ok() changes in commit 96d4f267e40f ("Remove 'type' argument from access_ok() function"). Fixes: fddcd00a49e9 ("drm/i915: Force the slow path after a user-write error") Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: stable@kernel.org # v4.20 Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Chris Wilson 提交于
commit fddcd00a49e9122a3579247151e9cb3ce5a1a36e upstream. If we fail to write the user relocation back when it is changed, force ourselves to take the slow relocation path where we can handle faults in the write path. There is still an element of dubiousness as having patched up the batch to use the correct offset, it no longer matches the presumed_offset in the relocation, so a second pass may miss any changes in layout. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180903083337.13134-3-chris@chris-wilson.co.ukSigned-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Andrea Arcangeli 提交于
commit 3b9aadf7278d16d7bed4d5d808501065f70898d8 upstream. get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would not be waiting for userfaults before failing and it would hit on a SIGBUS instead. Using get_user_pages_locked/unlocked instead will allow get_mempolicy to allow userfaults to resolve the fault and fill the hole, before grabbing the node id of the page. If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an address inside an area managed by uffd and there is no page at that address, the page allocation from within get_mempolicy() will fail because get_user_pages() does not allow for page fault retry required for uffd; the user will get SIGBUS. With this patch, the page fault will be resolved by the uffd and the get_mempolicy() will continue normally. Background: Via code review, previously the syscall would have returned -EFAULT (vm_fault_to_errno), now it will block and wait for an userfault (if it's waken before the fault is resolved it'll still -EFAULT). This way get_mempolicy will give a chance to an "unaware" app to be compliant with userfaults. The reason this visible change is that becoming "userfault compliant" cannot regress anything: all other syscalls including read(2)/write(2) had to become "userfault compliant" long time ago (that's one of the things userfaultfd can do that PROT_NONE and trapping segfaults can't). So this is just one more syscall that become "userfault compliant" like all other major ones already were. This has been happening on virtio-bridge dpdk process which just called get_mempolicy on the guest space post live migration, but before the memory had a chance to be migrated to destination. I didn't run an strace to be able to show the -EFAULT going away, but I've the confirmation of the below debug aid information (only visible with CONFIG_DEBUG_VM=y) going away with the patch: [20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0 [20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1 [20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017 [20116.371466] Call Trace: [20116.371473] dump_stack+0x5c/0x80 [20116.371476] handle_userfault.cold.37+0x1b/0x22 [20116.371479] ? remove_wait_queue+0x20/0x60 [20116.371481] ? poll_freewait+0x45/0xa0 [20116.371483] ? do_sys_poll+0x31c/0x520 [20116.371485] ? radix_tree_lookup_slot+0x1e/0x50 [20116.371488] shmem_getpage_gfp+0xce7/0xe50 [20116.371491] ? page_add_file_rmap+0x1a/0x2c0 [20116.371493] shmem_fault+0x78/0x1e0 [20116.371495] ? filemap_map_pages+0x3a1/0x450 [20116.371498] __do_fault+0x1f/0xc0 [20116.371500] __handle_mm_fault+0xe2e/0x12f0 [20116.371502] handle_mm_fault+0xda/0x200 [20116.371504] __get_user_pages+0x238/0x790 [20116.371506] get_user_pages+0x3e/0x50 [20116.371510] kernel_get_mempolicy+0x40b/0x700 [20116.371512] ? vfs_write+0x170/0x1a0 [20116.371515] __x64_sys_get_mempolicy+0x21/0x30 [20116.371517] do_syscall_64+0x5b/0x160 [20116.371520] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The above harmless debug message (not a kernel crash, just a dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify and improve kernel spots that may have to become "userfaultfd compliant" like this one (without having to run an strace and search for syscall misbehavior). Spots like the above are more closer to a kernel bug for the non-cooperative usages that Mike focuses on, than for for dpdk qemu-cooperative usages that reproduced it, but it's still nicer to get this fixed for dpdk too. The part of the patch that caused me to think is only the implementation issue of mpol_get, but it looks like it should work safe no matter the kind of mempolicy structure that is (the default static policy also starts at 1 so it'll go to 2 and back to 1 without crashing everything at 0). [rppt@linux.vnet.ibm.com: changelog addition] http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com> Reported-by: NMaxime Coquelin <maxime.coquelin@redhat.com> Tested-by: NDr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Xingjun Liu 提交于
During the module initialization phase, entropy will be added to entropy pool for every interrupt, the change should speed up initialization of the random module. Before optimization: [ 22.180236] random: crng init done After optimization: [ 1.474832] random: crng init done Signed-off-by: NXingjun Liu <xingjun.lxj@alibaba-inc.com> Reviewed-by: NLiu Jiang <gerry@linux.alibaba.com> Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com> Reviewed-by: Jia Zhang <zhang.jia@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
-
由 Xingjun Liu 提交于
Add random entropy with the module parameter as the initialization seed when the kernel startup. For guest OS working in VM, the random entropy will be less, it cause the random module to initialize very slowly, and if the application which running in guest os gets a certain amount of random numbers in the initialization phase, it will be blocked. This patch allows the VMM to provide a certain amount of random seed when starting guest OS, speeding up the initialization of the entire guest OS random module. Before optimization: [ 22.180236] random: crng init done After optimization: [ 1.553362] random: crng init done Signed-off-by: NXingjun Liu <xingjun.lxj@alibaba-inc.com> Reviewed-by: NLiu Jiang <gerry@linux.alibaba.com> Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com> Reviewed-by: Jia Zhang <zhang.jia@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
-
由 Borislav Petkov 提交于
commit 4ab526468344c11d2d1807ae95feb1f5305dc014 upstream. This driver is Intel-only so loading on anything which is not Intel is pointless. Prevent it from doing so. While at it, correct the "not supported" print statement to say CPU "model" which is what that test does. Fixes: 076b862c7e44 (cpufreq: intel_pstate: Add reasons for failure and debug messages) Suggested-by: NErwan Velu <e.velu@criteo.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NThomas Renninger <trenn@suse.de> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Erwan Velu 提交于
commit 076b862c7e4409d2dcacfda19f7eaf8d07ab9200 upstream. The init code path has several exceptions where the driver can decide not to load. As CONFIG_X86_INTEL_PSTATE is generally set to Y, the return code is not reachable. The initialization code is neither verbose of the reason why it did choose to prematurely exit, so it is difficult for a user to determine, on a given platform, why the driver didn't load properly. This patch is about reporting to the user the reason/context of why the driver failed to load. That is a precious hint when debugging a platform. Signed-off-by: NErwan Velu <e.velu@criteo.com> [ rjw: Subject & changelog, minor fixups ] Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Srinivas Pandruvada 提交于
commit af3b7379e2d709f2d7c6966b8a6f5ec6bd134241 upstream. Force HWP Request MAX = HWP Request MIN = HWP Capability MIN and EPP to 0xFF. In this way the performance limits on the offlined CPU will not influence performance limits on its sibling CPU, which is still online. If the sibling CPU is calling for higher performance, it will impact the max core performance. Here core performance will follow higher of the performance requests from each sibling. Reported-and-tested-by: NChen Yu <yu.c.chen@intel.com> Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Mike Snitzer 提交于
commit 075c18c3e124a1511ebc10a89f1858c8a77dcb01 upstream. Provides useful context about bio splits in blktrace. Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Mike Snitzer 提交于
commit 6548c7c538e5658cbce686c2dd1a9b4f5398bf34 upstream. Otherwise targets that don't support/expect IO splitting could resubmit bios using code paths with unnecessary IO splitting complexity. Depends-on: 24113d487843 ("dm: avoid indirect call in __dm_make_request") Fixes: 978e51ba ("dm: optimize bio-based NVMe IO submission") Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Mikulas Patocka 提交于
commit 24113d4878439baf1f23c1a33dfcc340fba66e97 upstream. Indirect calls are inefficient because of retpolines that are used for spectre workaround. This patch replaces an indirect call with a condition (that can be predicted by the branch predictor). Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Mike Snitzer 提交于
commit a1e1cb72d96491277ede8d257ce6b48a381dd336 upstream. [Joseph: cherry-pick part_stat_get() from commit 1226b8dd0e91 ("block: switch to per-cpu in-flight counters") since we don't want the whole patch series get involved.] The risk of redundant IO accounting was not taken into consideration when commit 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk") introduced IO splitting in terms of recursion via generic_make_request(). Fix this by subtracting the split bio's payload from the IO stats that were already accounted for by start_io_acct() upon dm_make_request() entry. This repeat oscillation of the IO accounting, up then down, isn't ideal but refactoring DM core's IO splitting to pre-split bios _before_ they are accounted turned out to be an excessive amount of change that will need a full development cycle to refine and verify. Before this fix: /dev/mapper/stripe_dev is a 4-way stripe using a 32k chunksize, so bios are split on 32k boundaries. # fio --name=16M --filename=/dev/mapper/stripe_dev --rw=write --bs=64k --size=16M \ --iodepth=1 --ioengine=libaio --direct=1 --refill_buffers with debugging added: [103898.310264] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=0 len=128 [103898.318704] device-mapper: core: __split_and_process_bio: recursing for following split bio: [103898.329136] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=64 len=64 ... 16M written yet 136M (278528 * 512b) accounted: # cat /sys/block/dm-2/stat | awk '{ print $7 }' 278528 After this fix: 16M written and 16M (32768 * 512b) accounted: # cat /sys/block/dm-2/stat | awk '{ print $7 }' 32768 Fixes: 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk") Cc: stable@vger.kernel.org # 4.16+ Reported-by: NBryan Gurney <bgurney@redhat.com> Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Mike Snitzer 提交于
commit 57c36519e4b949f89381053f7283f5d605595b42 upstream. DM's clone_bio() now benefits from using bio_trim() by fixing the fact that clone_bio() wasn't clearing BIO_SEG_VALID like bio_trim() does; which triggers blk_recount_segments() via bio_phys_segments(). Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Xiaoguang Wang 提交于
commit a297b2fcee461e40df763e179cbbfba5a9e572d2 upstream. In mpage_add_bh_to_extent(), when accumulated extents length is greater than MAX_WRITEPAGES_EXTENT_LEN or buffer head's b_stat is not equal, we will not continue to search unmapped area for this page, but note this page is locked, and will only be unlocked in mpage_release_unused_pages() after ext4_io_submit, if io also is throttled by blk-throttle or similar io qos, we will hold this page locked for a while, it's unnecessary. I think the best fix is to refactor mpage_add_bh_to_extent() to let it return some hints whether to unlock this page, but given that we will improve dioread_nolock later, we can let it done later, so currently the simple fix would just call mpage_release_unused_pages() before ext4_io_submit(). Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
-
由 Shanpei Chen 提交于
Autogroup feature is used to improve interactivity for desktop application. Since our kernel runs on server, just like RHEL8, disable it by default to avoid unnecessary computing. More details, please refer https://lwn.net/Articles/416641/Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com> Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Dan Schatzberg 提交于
commit df5ba5be7425e1df296d40c5f37a39d98ec666a2 upstream. Pressure metrics are already recorded and exposed in procfs for the entire system, but any tool which monitors cgroup pressure has to special case the root cgroup to read from procfs. This patch exposes the already recorded pressure metrics on the root cgroup. Link: http://lkml.kernel.org/r/20190510174938.3361741-1-dschatzberg@fb.comSigned-off-by: NDan Schatzberg <dschatzberg@fb.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-