1. 07 11月, 2021 40 次提交
    • C
      fs: explicitly unregister per-superblock BDIs · 0b3ea092
      Christoph Hellwig 提交于
      Add a new SB_I_ flag to mark superblocks that have an ephemeral bdi
      associated with them, and unregister it when the superblock is shut
      down.
      
      Link: https://lkml.kernel.org/r/20211021124441.668816-4-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Vignesh Raghavendra <vigneshr@ti.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b3ea092
    • C
      mtd: call bdi_unregister explicitly · 9718c59c
      Christoph Hellwig 提交于
      Call bdi_unregister explicitly instead of relying on the automatic
      unregistration.
      
      Link: https://lkml.kernel.org/r/20211021124441.668816-3-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Vignesh Raghavendra <vigneshr@ti.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9718c59c
    • C
      mm: export bdi_unregister · c6fd3ac0
      Christoph Hellwig 提交于
      Patch series "simplify bdi unregistation".
      
      This series simplifies the BDI code to get rid of the magic
      auto-unregister feature that hid a recent block layer refcounting bug.
      
      This patch (of 5):
      
      To wind down the magic auto-unregister semantics we'll need to push this
      into modular code.
      
      Link: https://lkml.kernel.org/r/20211021124441.668816-1-hch@lst.de
      Link: https://lkml.kernel.org/r/20211021124441.668816-2-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Vignesh Raghavendra <vigneshr@ti.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6fd3ac0
    • D
      mm: stop filemap_read() from grabbing a superfluous page · 8c8387ee
      David Howells 提交于
      Under some circumstances, filemap_read() will allocate sufficient pages
      to read to the end of the file, call readahead/readpages on them and
      copy the data over - and then it will allocate another page at the EOF
      and call readpage on that and then ignore it.  This is unnecessary and a
      waste of time and resources.
      
      filemap_read() *does* check for this, but only after it has already done
      the allocation and I/O.  Fix this by checking before calling
      filemap_get_pages() also.
      
      Link: https://lkml.kernel.org/r/163472463105.3126792.7056099385135786492.stgit@warthog.procyon.org.uk
      Link: https://lore.kernel.org/r/160588481358.3465195.16552616179674485179.stgit@warthog.procyon.org.uk/
      Link: https://lore.kernel.org/r/163456863216.2614702.6384850026368833133.stgit@warthog.procyon.org.uk/Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJeff Layton <jlayton@kernel.org>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c8387ee
    • Y
      mm/page_ext.c: fix a comment · d1fea155
      Yinan Zhang 提交于
      I have noticed that the previous macro is #ifndef CONFIG_SPARSEMEM.  I
      think the comment of #else should be CONFIG_SPARSEMEM.
      
      Link: https://lkml.kernel.org/r/20211008140312.6492-1-zhangyinan2019@email.szu.edu.cnSigned-off-by: NYinan Zhang <zhangyinan2019@email.szu.edu.cn>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1fea155
    • K
      percpu: add __alloc_size attributes for better bounds checking · 17197dd4
      Kees Cook 提交于
      As already done in GrapheneOS, add the __alloc_size attribute for
      appropriate percpu allocator interfaces, to provide additional hinting
      for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
      compiler optimizations.
      
      Note that due to the implementation of the percpu API, this is unlikely
      to ever actually provide compile-time checking beyond very simple
      non-SMP builds.  But, since they are technically allocators, mark them
      as such.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-9-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Co-developed-by: NDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: NDaniel Micay <danielmicay@gmail.com>
      Acked-by: NDennis Zhou <dennis@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17197dd4
    • K
      mm/page_alloc: add __alloc_size attributes for better bounds checking · abd58f38
      Kees Cook 提交于
      As already done in GrapheneOS, add the __alloc_size attribute for
      appropriate page allocator interfaces, to provide additional hinting for
      better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
      compiler optimizations.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-8-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Co-developed-by: NDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: NDaniel Micay <danielmicay@gmail.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abd58f38
    • K
      mm/vmalloc: add __alloc_size attributes for better bounds checking · 894f24bb
      Kees Cook 提交于
      As already done in GrapheneOS, add the __alloc_size attribute for
      appropriate vmalloc allocator interfaces, to provide additional hinting
      for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
      compiler optimizations.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-7-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Co-developed-by: NDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: NDaniel Micay <danielmicay@gmail.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      894f24bb
    • K
      mm/kvmalloc: add __alloc_size attributes for better bounds checking · 56bcf40f
      Kees Cook 提交于
      As already done in GrapheneOS, add the __alloc_size attribute for
      regular kvmalloc interfaces, to provide additional hinting for better
      bounds checking, assisting CONFIG_FORTIFY_SOURCE and other compiler
      optimizations.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-6-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Co-developed-by: NDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: NDaniel Micay <danielmicay@gmail.com>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56bcf40f
    • K
      slab: add __alloc_size attributes for better bounds checking · c37495d6
      Kees Cook 提交于
      As already done in GrapheneOS, add the __alloc_size attribute for
      regular kmalloc interfaces, to provide additional hinting for better
      bounds checking, assisting CONFIG_FORTIFY_SOURCE and other compiler
      optimizations.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-5-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Co-developed-by: NDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: NDaniel Micay <danielmicay@gmail.com>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c37495d6
    • K
      slab: clean up function prototypes · 72d67229
      Kees Cook 提交于
      Based on feedback from Joe Perches and Linus Torvalds, regularize the
      slab function prototypes before making attribute changes.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-4-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72d67229
    • K
      Compiler Attributes: add __alloc_size() for better bounds checking · 86cffecd
      Kees Cook 提交于
      GCC and Clang can use the "alloc_size" attribute to better inform the
      results of __builtin_object_size() (for compile-time constant values).
      Clang can additionally use alloc_size to inform the results of
      __builtin_dynamic_object_size() (for run-time values).
      
      Because GCC sees the frequent use of struct_size() as an allocator size
      argument, and notices it can return SIZE_MAX (the overflow indication),
      it complains about these call sites overflowing (since SIZE_MAX is
      greater than the default -Walloc-size-larger-than=PTRDIFF_MAX).  This
      isn't helpful since we already know a SIZE_MAX will be caught at
      run-time (this was an intentional design).  To deal with this, we must
      disable this check as it is both a false positive and redundant.  (Clang
      does not have this warning option.)
      
      Unfortunately, just checking the -Wno-alloc-size-larger-than is not
      sufficient to make the __alloc_size attribute behave correctly under
      older GCC versions.  The attribute itself must be disabled in those
      situations too, as there appears to be no way to reliably silence the
      SIZE_MAX constant expression cases for GCC versions less than 9.1:
      
         In file included from ./include/linux/resource_ext.h:11,
                          from ./include/linux/pci.h:40,
                          from drivers/net/ethernet/intel/ixgbe/ixgbe.h:9,
                          from drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c:4:
         In function 'kmalloc_node',
             inlined from 'ixgbe_alloc_q_vector' at ./include/linux/slab.h:743:9:
         ./include/linux/slab.h:618:9: error: argument 1 value '18446744073709551615' exceeds maximum object size 9223372036854775807 [-Werror=alloc-size-larger-than=]
           return __kmalloc_node(size, flags, node);
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         ./include/linux/slab.h: In function 'ixgbe_alloc_q_vector':
         ./include/linux/slab.h:455:7: note: in a call to allocation function '__kmalloc_node' declared here
          void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_slab_alignment __malloc;
                ^~~~~~~~~~~~~~
      
      Specifically:
       '-Wno-alloc-size-larger-than' is not correctly handled by GCC < 9.1
          https://godbolt.org/z/hqsfG7q84 (doesn't disable)
          https://godbolt.org/z/P9jdrPTYh (doesn't admit to not knowing about option)
          https://godbolt.org/z/465TPMWKb (only warns when other warnings appear)
      
       '-Walloc-size-larger-than=18446744073709551615' is not handled by GCC < 8.2
          https://godbolt.org/z/73hh1EPxz (ignores numeric value)
      
      Since anything marked with __alloc_size would also qualify for marking
      with __malloc, just include __malloc along with it to avoid redundant
      markings.  (Suggested by Linus Torvalds.)
      
      Finally, make sure checkpatch.pl doesn't get confused about finding the
      __alloc_size attribute on functions.  (Thanks to Joe Perches.)
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-3-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Tested-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86cffecd
    • K
      rapidio: avoid bogus __alloc_size warning · 75da0eba
      Kees Cook 提交于
      Patch series "Add __alloc_size()", v3.
      
      GCC and Clang both use the "alloc_size" attribute to assist with bounds
      checking around the use of allocation functions.  Add the attribute,
      adjust the Makefile to silence needless warnings, and add the hints to
      the allocators where possible.  These changes have been in use for a
      while now in GrapheneOS.
      
      This patch (of 8):
      
      After adding __alloc_size attributes to the allocators, GCC 9.3 (but not
      later) may incorrectly evaluate the arguments to check_copy_size(),
      getting seemingly confused by the size being returned from array_size().
      Instead, perform the calculation once, which both makes the code more
      readable and avoids the bug in GCC.
      
         In file included from arch/x86/include/asm/preempt.h:7,
                          from include/linux/preempt.h:78,
                          from include/linux/spinlock.h:55,
                          from include/linux/mm_types.h:9,
                          from include/linux/buildid.h:5,
                          from include/linux/module.h:14,
                          from drivers/rapidio/devices/rio_mport_cdev.c:13:
         In function 'check_copy_size',
             inlined from 'copy_from_user' at include/linux/uaccess.h:191:6,
             inlined from 'rio_mport_transfer_ioctl' at drivers/rapidio/devices/rio_mport_cdev.c:983:6:
         include/linux/thread_info.h:213:4: error: call to '__bad_copy_to' declared with attribute error: copy destination size is too small
           213 |    __bad_copy_to();
               |    ^~~~~~~~~~~~~~~
      
      But the allocation size and the copy size are identical:
      
      	transfer = vmalloc(array_size(sizeof(*transfer), transaction.count));
      	if (!transfer)
      		return -ENOMEM;
      
      	if (unlikely(copy_from_user(transfer,
      				    (void __user *)(uintptr_t)transaction.block,
      				    array_size(sizeof(*transfer), transaction.count)))) {
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-1-keescook@chromium.org
      Link: https://lkml.kernel.org/r/20210930222704.2631604-2-keescook@chromium.org
      Link: https://lore.kernel.org/linux-mm/202109091134.FHnRmRxu-lkp@intel.com/Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75da0eba
    • K
      kasan: test: bypass __alloc_size checks · d73dad4e
      Kees Cook 提交于
      Intentional overflows, as performed by the KASAN tests, are detected at
      compile time[1] (instead of only at run-time) with the addition of
      __alloc_size.  Fix this by forcing the compiler into not being able to
      trust the size used following the kmalloc()s.
      
      [1] https://lore.kernel.org/lkml/20211005184717.65c6d8eb39350395e387b71f@linux-foundation.org
      
      Link: https://lkml.kernel.org/r/20211006181544.1670992-1-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d73dad4e
    • G
      mm: debug_vm_pgtable: don't use __P000 directly · 8772716f
      Guo Ren 提交于
      The __Pxxx/__Sxxx macros are only for protection_map[] init.  All usage
      of them in linux should come from protection_map array.
      
      Because a lot of architectures would re-initilize protection_map[]
      array, eg: x86-mem_encrypt, m68k-motorola, mips, arm, sparc.
      
      Using __P000 is not rigorous.
      
      Link: https://lkml.kernel.org/r/20210924060821.1138281-1-guoren@kernel.orgSigned-off-by: NGuo Ren <guoren@linux.alibaba.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8772716f
    • P
      mm/smaps: simplify shmem handling of pte holes · 23010032
      Peter Xu 提交于
      Firstly, check_shmem_swap variable is actually not necessary, because
      it's always set with pte_hole hook; checking each would work.
      
      Meanwhile, the check within smaps_pte_entry is not easy to follow.
      E.g., pte_none() check is not needed as "!pte_present && !is_swap_pte"
      is the same.  Since at it, use the pte_hole() helper rather than dup the
      page cache lookup.
      
      Still keep the CONFIG_SHMEM part so the code can be optimized to nop for
      !SHMEM.
      
      There will be a very slight functional change in smaps_pte_entry(), that
      for !SHMEM we'll return early for pte_none (before checking page==NULL),
      but that's even nicer.
      
      Link: https://lkml.kernel.org/r/20210917164756.8586-4-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23010032
    • P
      mm/smaps: use vma->vm_pgoff directly when counting partial swap · 02399c88
      Peter Xu 提交于
      As it's trying to cover the whole vma anyways, use direct vm_pgoff value
      and vma_pages() rather than linear_page_index.
      
      Link: https://lkml.kernel.org/r/20210917164756.8586-3-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02399c88
    • P
      mm/smaps: fix shmem pte hole swap calculation · 10c848c8
      Peter Xu 提交于
      Patch series "mm/smaps: Fixes and optimizations on shmem swap handling".
      
      This patch (of 3):
      
      The shmem swap calculation on the privately writable mappings are using
      wrong parameters as spotted by Vlastimil.  Fix them.  This was
      introduced in commit 48131e03 ("mm, proc: reduce cost of
      /proc/pid/smaps for unpopulated shmem mappings"), when shmem_swap_usage
      was reworked to shmem_partial_swap_usage.
      
      Test program:
      
        void main(void)
        {
            char *buffer, *p;
            int i, fd;
      
            fd = memfd_create("test", 0);
            assert(fd > 0);
      
            /* isize==2M*3, fill in pages, swap them out */
            ftruncate(fd, SIZE_2M * 3);
            buffer = mmap(NULL, SIZE_2M * 3, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
            assert(buffer);
            for (i = 0, p = buffer; i < SIZE_2M * 3 / 4096; i++) {
                *p = 1;
                p += 4096;
            }
            madvise(buffer, SIZE_2M * 3, MADV_PAGEOUT);
            munmap(buffer, SIZE_2M * 3);
      
            /*
             * Remap with private+writtable mappings on partial of the inode (<= 2M*3),
             * while the size must also be >= 2M*2 to make sure there's a none pmd so
             * smaps_pte_hole will be triggered.
             */
            buffer = mmap(NULL, SIZE_2M * 2, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
            printf("pid=%d, buffer=%p\n", getpid(), buffer);
      
            /* Check /proc/$PID/smap_rollup, should see 4MB swap */
            sleep(1000000);
        }
      
      Before the patch, smaps_rollup shows <4MB swap and the number will be
      random depending on the alignment of the buffer of mmap() allocated.
      After this patch, it'll show 4MB.
      
      Link: https://lkml.kernel.org/r/20210917164756.8586-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20210917164756.8586-2-peterx@redhat.com
      Fixes: 48131e03 ("mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings")
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Reported-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10c848c8
    • P
      kasan: test: add memcpy test that avoids out-of-bounds write · 758cabae
      Peter Collingbourne 提交于
      With HW tag-based KASAN, error checks are performed implicitly by the
      load and store instructions in the memcpy implementation.  A failed
      check results in tag checks being disabled and execution will keep
      going.  As a result, under HW tag-based KASAN, prior to commit
      1b0668be ("kasan: test: disable kmalloc_memmove_invalid_size for
      HW_TAGS"), this memcpy would end up corrupting memory until it hits an
      inaccessible page and causes a kernel panic.
      
      This is a pre-existing issue that was revealed by commit 28513304
      ("arm64: Import latest memcpy()/memmove() implementation") which changed
      the memcpy implementation from using signed comparisons (incorrectly,
      resulting in the memcpy being terminated early for negative sizes) to
      using unsigned comparisons.
      
      It is unclear how this could be handled by memcpy itself in a reasonable
      way.  One possibility would be to add an exception handler that would
      force memcpy to return if a tag check fault is detected -- this would
      make the behavior roughly similar to generic and SW tag-based KASAN.
      However, this wouldn't solve the problem for asynchronous mode and also
      makes memcpy behavior inconsistent with manually copying data.
      
      This test was added as a part of a series that taught KASAN to detect
      negative sizes in memory operations, see commit 8cceeff4 ("kasan:
      detect negative size in memory operation function").  Therefore we
      should keep testing for negative sizes with generic and SW tag-based
      KASAN.  But there is some value in testing small memcpy overflows, so
      let's add another test with memcpy that does not destabilize the kernel
      by performing out-of-bounds writes, and run it in all modes.
      
      Link: https://linux-review.googlesource.com/id/I048d1e6a9aff766c4a53f989fb0c83de68923882
      Link: https://lkml.kernel.org/r/20210910211356.3603758-1-pcc@google.comSigned-off-by: NPeter Collingbourne <pcc@google.com>
      Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com>
      Acked-by: NMarco Elver <elver@google.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      758cabae
    • M
      kasan: fix tag for large allocations when using CONFIG_SLAB · 820a1e6e
      Matthew Wilcox (Oracle) 提交于
      If an object is allocated on a tail page of a multi-page slab, kasan
      will get the wrong tag because page->s_mem is NULL for tail pages.  I'm
      not quite sure what the user-visible effect of this might be.
      
      Link: https://lkml.kernel.org/r/20211001024105.3217339-1-willy@infradead.org
      Fixes: 7f94ffbc ("kasan: add hooks implementation for tag-based mode")
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NMarco Elver <elver@google.com>
      Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      820a1e6e
    • M
      workqueue, kasan: avoid alloc_pages() when recording stack · f70da745
      Marco Elver 提交于
      Shuah Khan reported:
      
       | When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
       | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
       | it tries to allocate memory attempting to acquire spinlock in page
       | allocation code while holding workqueue pool raw_spinlock.
       |
       | There are several instances of this problem when block layer tries
       | to __queue_work(). Call trace from one of these instances is below:
       |
       |     kblockd_mod_delayed_work_on()
       |       mod_delayed_work_on()
       |         __queue_delayed_work()
       |           __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
       |             insert_work()
       |               kasan_record_aux_stack()
       |                 kasan_save_stack()
       |                   stack_depot_save()
       |                     alloc_pages()
       |                       __alloc_pages()
       |                         get_page_from_freelist()
       |                           rm_queue()
       |                             rm_queue_pcplist()
       |                               local_lock_irqsave(&pagesets.lock, flags);
       |                               [ BUG: Invalid wait context triggered ]
      
      The default kasan_record_aux_stack() calls stack_depot_save() with
      GFP_NOWAIT, which in turn can then call alloc_pages(GFP_NOWAIT, ...).
      In general, however, it is not even possible to use either GFP_ATOMIC
      nor GFP_NOWAIT in certain non-preemptive contexts, including
      raw_spin_locks (see gfp.h and commmit ab00db21).
      
      Fix it by instructing stackdepot to not expand stack storage via
      alloc_pages() in case it runs out by using
      kasan_record_aux_stack_noalloc().
      
      While there is an increased risk of failing to insert the stack trace,
      this is typically unlikely, especially if the same insertion had already
      succeeded previously (stack depot hit).
      
      For frequent calls from the same location, it therefore becomes
      extremely unlikely that kasan_record_aux_stack_noalloc() fails.
      
      Link: https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org
      Link: https://lkml.kernel.org/r/20210913112609.2651084-7-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Reported-by: NShuah Khan <skhan@linuxfoundation.org>
      Tested-by: NShuah Khan <skhan@linuxfoundation.org>
      Acked-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Walter Wu <walter-zh.wu@mediatek.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f70da745
    • M
      kasan: generic: introduce kasan_record_aux_stack_noalloc() · 7cb3007c
      Marco Elver 提交于
      Introduce a variant of kasan_record_aux_stack() that does not do any
      memory allocation through stackdepot.  This will permit using it in
      contexts that cannot allocate any memory.
      
      Link: https://lkml.kernel.org/r/20210913112609.2651084-6-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Tested-by: NShuah Khan <skhan@linuxfoundation.org>
      Acked-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Walter Wu <walter-zh.wu@mediatek.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cb3007c
    • M
      kasan: common: provide can_alloc in kasan_save_stack() · 7594b347
      Marco Elver 提交于
      Add another argument, can_alloc, to kasan_save_stack() which is passed
      as-is to __stack_depot_save().
      
      No functional change intended.
      
      Link: https://lkml.kernel.org/r/20210913112609.2651084-5-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Tested-by: NShuah Khan <skhan@linuxfoundation.org>
      Acked-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Walter Wu <walter-zh.wu@mediatek.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7594b347
    • M
      lib/stackdepot: introduce __stack_depot_save() · 11ac25c6
      Marco Elver 提交于
      Add __stack_depot_save(), which provides more fine-grained control over
      stackdepot's memory allocation behaviour, in case stackdepot runs out of
      "stack slabs".
      
      Normally stackdepot uses alloc_pages() in case it runs out of space;
      passing can_alloc==false to __stack_depot_save() prohibits this, at the
      cost of more likely failure to record a stack trace.
      
      Link: https://lkml.kernel.org/r/20210913112609.2651084-4-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Tested-by: NShuah Khan <skhan@linuxfoundation.org>
      Acked-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Walter Wu <walter-zh.wu@mediatek.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11ac25c6
    • M
      lib/stackdepot: remove unused function argument · 7f2b8818
      Marco Elver 提交于
      alloc_flags in depot_alloc_stack() is no longer used; remove it.
      
      Link: https://lkml.kernel.org/r/20210913112609.2651084-3-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Tested-by: NShuah Khan <skhan@linuxfoundation.org>
      Acked-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Walter Wu <walter-zh.wu@mediatek.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f2b8818
    • M
      lib/stackdepot: include gfp.h · 7857ccdf
      Marco Elver 提交于
      Patch series "stackdepot, kasan, workqueue: Avoid expanding stackdepot
      slabs when holding raw_spin_lock", v2.
      
      Shuah Khan reported [1]:
      
       | When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
       | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
       | it tries to allocate memory attempting to acquire spinlock in page
       | allocation code while holding workqueue pool raw_spinlock.
       |
       | There are several instances of this problem when block layer tries
       | to __queue_work(). Call trace from one of these instances is below:
       |
       |     kblockd_mod_delayed_work_on()
       |       mod_delayed_work_on()
       |         __queue_delayed_work()
       |           __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
       |             insert_work()
       |               kasan_record_aux_stack()
       |                 kasan_save_stack()
       |                   stack_depot_save()
       |                     alloc_pages()
       |                       __alloc_pages()
       |                         get_page_from_freelist()
       |                           rm_queue()
       |                             rm_queue_pcplist()
       |                               local_lock_irqsave(&pagesets.lock, flags);
       |                               [ BUG: Invalid wait context triggered ]
      
      PROVE_RAW_LOCK_NESTING is pointing out that (on RT kernels) the locking
      rules are being violated.  More generally, memory is being allocated
      from a non-preemptive context (raw_spin_lock'd c-s) where it is not
      allowed.
      
      To properly fix this, we must prevent stackdepot from replenishing its
      "stack slab" pool if memory allocations cannot be done in the current
      context: it's a bug to use either GFP_ATOMIC nor GFP_NOWAIT in certain
      non-preemptive contexts, including raw_spin_locks (see gfp.h and commit
      ab00db21).
      
      The only downside is that saving a stack trace may fail if: stackdepot
      runs out of space AND the same stack trace has not been recorded before.
      I expect this to be unlikely, and a simple experiment (boot the kernel)
      didn't result in any failure to record stack trace from insert_work().
      
      The series includes a few minor fixes to stackdepot that I noticed in
      preparing the series.  It then introduces __stack_depot_save(), which
      exposes the option to force stackdepot to not allocate any memory.
      Finally, KASAN is changed to use the new stackdepot interface and
      provide kasan_record_aux_stack_noalloc(), which is then used by
      workqueue code.
      
      [1] https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org
      
      This patch (of 6):
      
      <linux/stackdepot.h> refers to gfp_t, but doesn't include gfp.h.
      
      Fix it by including <linux/gfp.h>.
      
      Link: https://lkml.kernel.org/r/20210913112609.2651084-1-elver@google.com
      Link: https://lkml.kernel.org/r/20210913112609.2651084-2-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Tested-by: NShuah Khan <skhan@linuxfoundation.org>
      Acked-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Walter Wu <walter-zh.wu@mediatek.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
      Cc: Taras Madan <tarasmadan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7857ccdf
    • C
      mm: don't include <linux/dax.h> in <linux/mempolicy.h> · 96c84dde
      Christoph Hellwig 提交于
      Not required at all, and having this causes a huge kernel rebuild as
      soon as something in dax.h changes.
      
      Link: https://lkml.kernel.org/r/20210921082253.1859794-1-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96c84dde
    • S
      mm: disable NUMA_BALANCING_DEFAULT_ENABLED and TRANSPARENT_HUGEPAGE on PREEMPT_RT · 554b0f3c
      Sebastian Andrzej Siewior 提交于
      TRANSPARENT_HUGEPAGE:
        There are potential non-deterministic delays to an RT thread if a
        critical memory region is not THP-aligned and a non-RT buffer is
        located in the same hugepage-aligned region. It's also possible for an
        unrelated thread to migrate pages belonging to an RT task incurring
        unexpected page faults due to memory defragmentation even if
        khugepaged is disabled.
      
      Regular HUGEPAGEs are not affected by this can be used.
      
      NUMA_BALANCING:
        There is a non-deterministic delay to mark PTEs PROT_NONE to gather
        NUMA fault samples, increased page faults of regions even if mlocked
        and non-deterministic delays when migrating pages.
      
      [Mel Gorman worded 99% of the commit description].
      
      Link: https://lore.kernel.org/all/20200304091159.GN3818@techsingularity.net/
      Link: https://lore.kernel.org/all/20211026165100.ahz5bkx44lrrw5pt@linutronix.de/
      Link: https://lkml.kernel.org/r/20211028143327.hfbxjze7palrpfgp@linutronix.deSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      554b0f3c
    • H
      mm, slub: use prefetchw instead of prefetch · 04b4b006
      Hyeonggon Yoo 提交于
      Commit 0ad9500e ("slub: prefetch next freelist pointer in
      slab_alloc()") introduced prefetch_freepointer() because when other
      cpu(s) freed objects into a page that current cpu owns, the freelist
      link is hot on cpu(s) which freed objects and possibly very cold on
      current cpu.
      
      But if freelist link chain is hot on cpu(s) which freed objects, it's
      better to invalidate that chain because they're not going to access
      again within a short time.
      
      So use prefetchw instead of prefetch.  On supported architectures like
      x86 and arm, it invalidates other copied instances of a cache line when
      prefetching it.
      
      Before:
      
      Time: 91.677
      
       Performance counter stats for 'hackbench -g 100 -l 10000':
              1462938.07 msec cpu-clock                 #   15.908 CPUs utilized
                18072550      context-switches          #   12.354 K/sec
                 1018814      cpu-migrations            #  696.416 /sec
                  104558      page-faults               #   71.471 /sec
           1580035699271      cycles                    #    1.080 GHz                      (54.51%)
           2003670016013      instructions              #    1.27  insn per cycle           (54.31%)
              5702204863      branch-misses                                                 (54.28%)
            643368500985      cache-references          #  439.778 M/sec                    (54.26%)
             18475582235      cache-misses              #    2.872 % of all cache refs      (54.28%)
            642206796636      L1-dcache-loads           #  438.984 M/sec                    (46.87%)
             18215813147      L1-dcache-load-misses     #    2.84% of all L1-dcache accesses  (46.83%)
            653842996501      dTLB-loads                #  446.938 M/sec                    (46.63%)
              3227179675      dTLB-load-misses          #    0.49% of all dTLB cache accesses  (46.85%)
            537531951350      iTLB-loads                #  367.433 M/sec                    (54.33%)
               114750630      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.37%)
            630135543177      L1-icache-loads           #  430.733 M/sec                    (46.80%)
             22923237620      L1-icache-load-misses     #    3.64% of all L1-icache accesses  (46.76%)
      
            91.964452802 seconds time elapsed
      
            43.416742000 seconds user
          1422.441123000 seconds sys
      
      After:
      
      Time: 90.220
      
       Performance counter stats for 'hackbench -g 100 -l 10000':
              1437418.48 msec cpu-clock                 #   15.880 CPUs utilized
                17694068      context-switches          #   12.310 K/sec
                  958257      cpu-migrations            #  666.651 /sec
                  100604      page-faults               #   69.989 /sec
           1583259429428      cycles                    #    1.101 GHz                      (54.57%)
           2004002484935      instructions              #    1.27  insn per cycle           (54.37%)
              5594202389      branch-misses                                                 (54.36%)
            643113574524      cache-references          #  447.409 M/sec                    (54.39%)
             18233791870      cache-misses              #    2.835 % of all cache refs      (54.37%)
            640205852062      L1-dcache-loads           #  445.386 M/sec                    (46.75%)
             17968160377      L1-dcache-load-misses     #    2.81% of all L1-dcache accesses  (46.79%)
            651747432274      dTLB-loads                #  453.415 M/sec                    (46.59%)
              3127124271      dTLB-load-misses          #    0.48% of all dTLB cache accesses  (46.75%)
            535395273064      iTLB-loads                #  372.470 M/sec                    (54.38%)
               113500056      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.35%)
            628871845924      L1-icache-loads           #  437.501 M/sec                    (46.80%)
             22585641203      L1-icache-load-misses     #    3.59% of all L1-icache accesses  (46.79%)
      
            90.514819303 seconds time elapsed
      
            43.877656000 seconds user
          1397.176001000 seconds sys
      
      Link: https://lkml.org/lkml/2021/10/8/598=20
      Link: https://lkml.kernel.org/r/20211011144331.70084-1-42.hyeyoo@gmail.comSigned-off-by: NHyeonggon Yoo <42.hyeyoo@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04b4b006
    • V
      mm/slub: increase default cpu partial list sizes · 23e98ad1
      Vlastimil Babka 提交于
      The defaults are determined based on object size and can go up to 30 for
      objects smaller than 256 bytes.  Before the previous patch changed the
      accounting, this could have made cpu partial list contain up to 30
      pages.  After that patch, only up to 2 pages with default allocation
      order.
      
      Very short lists limit the usefulness of the whole concept of cpu
      partial lists, so this patch aims at a more reasonable default under the
      new accounting.  The defaults are quadrupled, except for object size >=
      PAGE_SIZE where it's doubled.  This makes the lists grow up to 10 pages
      in practice.
      
      A quick test of booting a kernel under virtme with 4GB RAM and 8 vcpus
      shows the following slab memory usage after boot:
      
      Before previous patch (using page->pobjects):
        Slab:              36732 kB
        SReclaimable:      14836 kB
        SUnreclaim:        21896 kB
      
      After previous patch (using page->pages):
        Slab:              34720 kB
        SReclaimable:      13716 kB
        SUnreclaim:        21004 kB
      
      After this patch (using page->pages, higher defaults):
        Slab:              35252 kB
        SReclaimable:      13944 kB
        SUnreclaim:        21308 kB
      
      In the same setup, I also ran 5 times:
      
          hackbench -l 16000 -g 16
      
      Differences in time were in the noise, we can compare slub stats as
      given by slabinfo -r skbuff_head_cache (the other cache heavily used by
      hackbench, kmalloc-cg-512 looks similar).  Negligible stats left out for
      brevity.
      
      Before previous patch (using page->pobjects):
      
        Objects: 1408, Memory Total:  401408 Used :  304128
      
        Slab Perf Counter       Alloc     Free %Al %Fr
        --------------------------------------------------
        Fastpath             469952498  5946606  91   1
        Slowpath             42053573 506059465   8  98
        Page Alloc              41093    41044   0   0
        Add partial                18 21229327   0   4
        Remove partial       20039522    36051   3   0
        Cpu partial list      4686640 24767229   0   4
        RemoteObj/SlabFrozen       16 124027841   0  24
        Total                512006071 512006071
        Flushes       18
      
        Slab Deactivation             Occurrences %
        -------------------------------------------------
        Slab empty                       4993    0%
        Deactivation bypass           24767229   99%
        Refilled from foreign frees   21972674   88%
      
      After previous patch (using page->pages):
      
        Objects: 480, Memory Total:  131072 Used :  103680
      
        Slab Perf Counter       Alloc     Free %Al %Fr
        --------------------------------------------------
        Fastpath             473016294  5405653  92   1
        Slowpath             38989777 506600418   7  98
        Page Alloc              32717    32701   0   0
        Add partial                 3 22749164   0   4
        Remove partial       11371127    32474   2   0
        Cpu partial list     11686226 23090059   2   4
        RemoteObj/SlabFrozen        2 67541803   0  13
        Total                512006071 512006071
        Flushes        3
      
        Slab Deactivation             Occurrences %
        -------------------------------------------------
        Slab empty                        227    0%
        Deactivation bypass           23090059   99%
        Refilled from foreign frees   27585695  119%
      
      After this patch (using page->pages, higher defaults):
      
        Objects: 896, Memory Total:  229376 Used :  193536
      
        Slab Perf Counter       Alloc     Free %Al %Fr
        --------------------------------------------------
        Fastpath             473799295  4980278  92   0
        Slowpath             38206776 507025793   7  99
        Page Alloc              32295    32267   0   0
        Add partial                11 23291143   0   4
        Remove partial        5815764    31278   1   0
        Cpu partial list     18119280 23967320   3   4
        RemoteObj/SlabFrozen       10 76974794   0  15
        Total                512006071 512006071
        Flushes       11
      
        Slab Deactivation             Occurrences %
        -------------------------------------------------
        Slab empty                        989    0%
        Deactivation bypass           23967320   99%
        Refilled from foreign frees   32358473  135%
      
      As expected, memory usage dropped significantly with change of
      accounting, increasing the defaults increased it, but not as much.  The
      number of page allocation/frees dropped significantly with the new
      accounting, but didn't increase with the higher defaults.
      Interestingly, the number of fasthpath allocations increased, as well as
      allocations from the cpu partial list, even though it's shorter.
      
      Link: https://lkml.kernel.org/r/20211012134651.11258-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23e98ad1
    • V
      mm, slub: change percpu partial accounting from objects to pages · b47291ef
      Vlastimil Babka 提交于
      With CONFIG_SLUB_CPU_PARTIAL enabled, SLUB keeps a percpu list of
      partial slabs that can be promoted to cpu slab when the previous one is
      depleted, without accessing the shared partial list.  A slab can be
      added to this list by 1) refill of an empty list from get_partial_node()
      - once we really have to access the shared partial list, we acquire
      multiple slabs to amortize the cost of locking, and 2) first free to a
      previously full slab - instead of putting the slab on a shared partial
      list, we can more cheaply freeze it and put it on the per-cpu list.
      
      To control how large a percpu partial list can grow for a kmem cache,
      set_cpu_partial() calculates a target number of free objects on each
      cpu's percpu partial list, and this can be also set by the sysfs file
      cpu_partial.
      
      However, the tracking of actual number of objects is imprecise, in order
      to limit overhead from cpu X freeing an objects to a slab on percpu
      partial list of cpu Y.  Basically, the percpu partial slabs form a
      single linked list, and when we add a new slab to the list with current
      head "oldpage", we set in the struct page of the slab we're adding:
      
          page->pages = oldpage->pages + 1; // this is precise
          page->pobjects = oldpage->pobjects + (page->objects - page->inuse);
          page->next = oldpage;
      
      Thus the real number of free objects in the slab (objects - inuse) is
      only determined at the moment of adding the slab to the percpu partial
      list, and further freeing doesn't update the pobjects counter nor
      propagate it to the current list head.  As Jann reports [1], this can
      easily lead to large inaccuracies, where the target number of objects
      (up to 30 by default) can translate to the same number of (empty) slab
      pages on the list.  In case 2) above, we put a slab with 1 free object
      on the list, thus only increase page->pobjects by 1, even if there are
      subsequent frees on the same slab.  Jann has noticed this in practice
      and so did we [2] when investigating significant increase of kmemcg
      usage after switching from SLAB to SLUB.
      
      While this is no longer a problem in kmemcg context thanks to the
      accounting rewrite in 5.9, the memory waste is still not ideal and it's
      questionable whether it makes sense to perform free object count based
      control when object counts can easily become so much inaccurate.  So
      this patch converts the accounting to be based on number of pages only
      (which is precise) and removes the page->pobjects field completely.
      This is also ultimately simpler.
      
      To retain the existing set_cpu_partial() heuristic, first calculate the
      target number of objects as previously, but then convert it to target
      number of pages by assuming the pages will be half-filled on average.
      This assumption might obviously also be inaccurate in practice, but
      cannot degrade to actual number of pages being equal to the target
      number of objects.
      
      We could also skip the intermediate step with target number of objects
      and rewrite the heuristic in terms of pages.  However we still have the
      sysfs file cpu_partial which uses number of objects and could break
      existing users if it suddenly becomes number of pages, so this patch
      doesn't do that.
      
      In practice, after this patch the heuristics limit the size of percpu
      partial list up to 2 pages.  In case of a reported regression (which
      would mean some workload has benefited from the previous imprecise
      object based counting), we can tune the heuristics to get a better
      compromise within the new scheme, while still avoid the unexpectedly
      long percpu partial lists.
      
      [1] https://lore.kernel.org/linux-mm/CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=-umR7BLsEgjEYzA@mail.gmail.com/
      [2] https://lore.kernel.org/all/2f0f46e8-2535-410a-1859-e9cfa4e57c18@suse.cz/
      
      ==========
      Evaluation
      ==========
      
      Mel was kind enough to run v1 through mmtests machinery for netperf
      (localhost) and hackbench and, for most significant results see below.
      So there are some apparent regressions, especially with hackbench, which
      I think ultimately boils down to having shorter percpu partial lists on
      average and some benchmarks benefiting from longer ones.  Monitoring
      slab usage also indicated less memory usage by slab.  Based on that, the
      following patch will bump the defaults to allow longer percpu partial
      lists than after this patch.
      
      However the goal is certainly not such that we would limit the percpu
      partial lists to 30 pages just because previously a specific alloc/free
      pattern could lead to the limit of 30 objects translate to a limit to 30
      pages - that would make little sense.  This is a correctness patch, and
      if a workload benefits from larger lists, the sysfs tuning knobs are
      still there to allow that.
      
      Netperf
      
        2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads per socket), 384GB RAM
        TCP-RR:
          hmean before 127045.79 after 121092.94 (-4.69%, worse)
          stddev before  2634.37 after   1254.08
        UDP-RR:
          hmean before 166985.45 after 160668.94 ( -3.78%, worse)
          stddev before 4059.69 after 1943.63
      
        2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads per socket), 512GB RAM
        TCP-RR:
          hmean before 84173.25 after 76914.72 ( -8.62%, worse)
        UDP-RR:
          hmean before 93571.12 after 96428.69 ( 3.05%, better)
          stddev before 23118.54 after 16828.14
      
        2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads per socket), 64GB RAM
        TCP-RR:
          hmean before 49984.92 after 48922.27 ( -2.13%, worse)
          stddev before 6248.15 after 4740.51
        UDP-RR:
          hmean before 61854.31 after 68761.81 ( 11.17%, better)
          stddev before 4093.54 after 5898.91
      
        other machines - within 2%
      
      Hackbench
      
        (results before and after the patch, negative % means worse)
      
        2-socket AMD EPYC 7713 (64 cores, 128 threads per core), 256GB RAM
        hackbench-process-sockets
        Amean 	1 	0.5380	0.5583	( -3.78%)
        Amean 	4 	0.7510	0.8150	( -8.52%)
        Amean 	7 	0.7930	0.9533	( -20.22%)
        Amean 	12 	0.7853	1.1313	( -44.06%)
        Amean 	21 	1.1520	1.4993	( -30.15%)
        Amean 	30 	1.6223	1.9237	( -18.57%)
        Amean 	48 	2.6767	2.9903	( -11.72%)
        Amean 	79 	4.0257	5.1150	( -27.06%)
        Amean 	110	5.5193	7.4720	( -35.38%)
        Amean 	141	7.2207	9.9840	( -38.27%)
        Amean 	172	8.4770	12.1963	( -43.88%)
        Amean 	203	9.6473	14.3137	( -48.37%)
        Amean 	234	11.3960	18.7917	( -64.90%)
        Amean 	265	13.9627	22.4607	( -60.86%)
        Amean 	296	14.9163	26.0483	( -74.63%)
      
        hackbench-thread-sockets
        Amean 	1 	0.5597	0.5877	( -5.00%)
        Amean 	4 	0.7913	0.8960	( -13.23%)
        Amean 	7 	0.8190	1.0017	( -22.30%)
        Amean 	12 	0.9560	1.1727	( -22.66%)
        Amean 	21 	1.7587	1.5660	( 10.96%)
        Amean 	30 	2.4477	1.9807	( 19.08%)
        Amean 	48 	3.4573	3.0630	( 11.41%)
        Amean 	79 	4.7903	5.1733	( -8.00%)
        Amean 	110	6.1370	7.4220	( -20.94%)
        Amean 	141	7.5777	9.2617	( -22.22%)
        Amean 	172	9.2280	11.0907	( -20.18%)
        Amean 	203	10.2793	13.3470	( -29.84%)
        Amean 	234	11.2410	17.1070	( -52.18%)
        Amean 	265	12.5970	23.3323	( -85.22%)
        Amean 	296	17.1540	24.2857	( -41.57%)
      
        2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads
        per socket), 384GB RAM
        hackbench-process-sockets
        Amean 	1 	0.5760	0.4793	( 16.78%)
        Amean 	4 	0.9430	0.9707	( -2.93%)
        Amean 	7 	1.5517	1.8843	( -21.44%)
        Amean 	12 	2.4903	2.7267	( -9.49%)
        Amean 	21 	3.9560	4.2877	( -8.38%)
        Amean 	30 	5.4613	5.8343	( -6.83%)
        Amean 	48 	8.5337	9.2937	( -8.91%)
        Amean 	79 	14.0670	15.2630	( -8.50%)
        Amean 	110	19.2253	21.2467	( -10.51%)
        Amean 	141	23.7557	25.8550	( -8.84%)
        Amean 	172	28.4407	29.7603	( -4.64%)
        Amean 	203	33.3407	33.9927	( -1.96%)
        Amean 	234	38.3633	39.1150	( -1.96%)
        Amean 	265	43.4420	43.8470	( -0.93%)
        Amean 	296	48.3680	48.9300	( -1.16%)
      
        hackbench-thread-sockets
        Amean 	1 	0.6080	0.6493	( -6.80%)
        Amean 	4 	1.0000	1.0513	( -5.13%)
        Amean 	7 	1.6607	2.0260	( -22.00%)
        Amean 	12 	2.7637	2.9273	( -5.92%)
        Amean 	21 	5.0613	4.5153	( 10.79%)
        Amean 	30 	6.3340	6.1140	( 3.47%)
        Amean 	48 	9.0567	9.5577	( -5.53%)
        Amean 	79 	14.5657	15.7983	( -8.46%)
        Amean 	110	19.6213	21.6333	( -10.25%)
        Amean 	141	24.1563	26.2697	( -8.75%)
        Amean 	172	28.9687	30.2187	( -4.32%)
        Amean 	203	33.9763	34.6970	( -2.12%)
        Amean 	234	38.8647	39.3207	( -1.17%)
        Amean 	265	44.0813	44.1507	( -0.16%)
        Amean 	296	49.2040	49.4330	( -0.47%)
      
        2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads
        per socket), 512GB RAM
        hackbench-process-sockets
        Amean 	1 	0.5027	0.5017	( 0.20%)
        Amean 	4 	1.1053	1.2033	( -8.87%)
        Amean 	7 	1.8760	2.1820	( -16.31%)
        Amean 	12 	2.9053	3.1810	( -9.49%)
        Amean 	21 	4.6777	4.9920	( -6.72%)
        Amean 	30 	6.5180	6.7827	( -4.06%)
        Amean 	48 	10.0710	10.5227	( -4.48%)
        Amean 	79 	16.4250	17.5053	( -6.58%)
        Amean 	110	22.6203	24.4617	( -8.14%)
        Amean 	141	28.0967	31.0363	( -10.46%)
        Amean 	172	34.4030	36.9233	( -7.33%)
        Amean 	203	40.5933	43.0850	( -6.14%)
        Amean 	234	46.6477	48.7220	( -4.45%)
        Amean 	265	53.0530	53.9597	( -1.71%)
        Amean 	296	59.2760	59.9213	( -1.09%)
      
        hackbench-thread-sockets
        Amean 	1 	0.5363	0.5330	( 0.62%)
        Amean 	4 	1.1647	1.2157	( -4.38%)
        Amean 	7 	1.9237	2.2833	( -18.70%)
        Amean 	12 	2.9943	3.3110	( -10.58%)
        Amean 	21 	4.9987	5.1880	( -3.79%)
        Amean 	30 	6.7583	7.0043	( -3.64%)
        Amean 	48 	10.4547	10.8353	( -3.64%)
        Amean 	79 	16.6707	17.6790	( -6.05%)
        Amean 	110	22.8207	24.4403	( -7.10%)
        Amean 	141	28.7090	31.0533	( -8.17%)
        Amean 	172	34.9387	36.8260	( -5.40%)
        Amean 	203	41.1567	43.0450	( -4.59%)
        Amean 	234	47.3790	48.5307	( -2.43%)
        Amean 	265	53.9543	54.6987	( -1.38%)
        Amean 	296	60.0820	60.2163	( -0.22%)
      
        1-socket Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz (4 cores, 8 threads),
        32 GB RAM
        hackbench-process-sockets
        Amean 	1 	1.4760	1.5773	( -6.87%)
        Amean 	3 	3.9370	4.0910	( -3.91%)
        Amean 	5 	6.6797	6.9357	( -3.83%)
        Amean 	7 	9.3367	9.7150	( -4.05%)
        Amean 	12	15.7627	16.1400	( -2.39%)
        Amean 	18	23.5360	23.6890	( -0.65%)
        Amean 	24	31.0663	31.3137	( -0.80%)
        Amean 	30	38.7283	39.0037	( -0.71%)
        Amean 	32	41.3417	41.6097	( -0.65%)
      
        hackbench-thread-sockets
        Amean 	1 	1.5250	1.6043	( -5.20%)
        Amean 	3 	4.0897	4.2603	( -4.17%)
        Amean 	5 	6.7760	7.0933	( -4.68%)
        Amean 	7 	9.4817	9.9157	( -4.58%)
        Amean 	12	15.9610	16.3937	( -2.71%)
        Amean 	18	23.9543	24.3417	( -1.62%)
        Amean 	24	31.4400	31.7217	( -0.90%)
        Amean 	30	39.2457	39.5467	( -0.77%)
        Amean 	32	41.8267	42.1230	( -0.71%)
      
        2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads
        per socket), 64GB RAM
        hackbench-process-sockets
        Amean 	1 	1.0347	1.0880	( -5.15%)
        Amean 	4 	1.7267	1.8527	( -7.30%)
        Amean 	7 	2.6707	2.8110	( -5.25%)
        Amean 	12 	4.1617	4.3383	( -4.25%)
        Amean 	21 	7.0070	7.2600	( -3.61%)
        Amean 	30 	9.9187	10.2397	( -3.24%)
        Amean 	48 	15.6710	16.3923	( -4.60%)
        Amean 	79 	24.7743	26.1247	( -5.45%)
        Amean 	110	34.3000	35.9307	( -4.75%)
        Amean 	141	44.2043	44.8010	( -1.35%)
        Amean 	172	54.2430	54.7260	( -0.89%)
        Amean 	192	60.6557	60.9777	( -0.53%)
      
        hackbench-thread-sockets
        Amean 	1 	1.0610	1.1353	( -7.01%)
        Amean 	4 	1.7543	1.9140	( -9.10%)
        Amean 	7 	2.7840	2.9573	( -6.23%)
        Amean 	12 	4.3813	4.4937	( -2.56%)
        Amean 	21 	7.3460	7.5350	( -2.57%)
        Amean 	30 	10.2313	10.5190	( -2.81%)
        Amean 	48 	15.9700	16.5940	( -3.91%)
        Amean 	79 	25.3973	26.6637	( -4.99%)
        Amean 	110	35.1087	36.4797	( -3.91%)
        Amean 	141	45.8220	46.3053	( -1.05%)
        Amean 	172	55.4917	55.7320	( -0.43%)
        Amean 	192	62.7490	62.5410	( 0.33%)
      
      Link: https://lkml.kernel.org/r/20211012134651.11258-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NJann Horn <jannh@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b47291ef
    • K
      slub: add back check for free nonslab objects · d0fe47c6
      Kefeng Wang 提交于
      After commit f227f0fa ("slub: fix unreclaimable slab stat for bulk
      free"), the check for free nonslab page is replaced by VM_BUG_ON_PAGE,
      which only check with CONFIG_DEBUG_VM enabled, but this config may
      impact performance, so it only for debug.
      
      Commit 0937502a ("slub: Add check for kfree() of non slab objects.")
      add the ability, which should be needed in any configs to catch the
      invalid free, they even could be potential issue, eg, memory corruption,
      use after free and double free, so replace VM_BUG_ON_PAGE to
      WARN_ON_ONCE, add object address printing to help use to debug the
      issue.
      
      Link: https://lkml.kernel.org/r/20210930070214.61499-1-wangkefeng.wang@huawei.comSigned-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rienjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0fe47c6
    • S
      mm/slab.c: remove useless lines in enable_cpucache() · ffc95a46
      Shi Lei 提交于
      These lines are useless, so remove them.
      
      Link: https://lkml.kernel.org/r/20210930034845.2539-1-shi_lei@massclouds.com
      Fixes: 10befea9 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
      Signed-off-by: NShi Lei <shi_lei@massclouds.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ffc95a46
    • M
      mm: move kvmalloc-related functions to slab.h · 8587ca6f
      Matthew Wilcox (Oracle) 提交于
      Not all files in the kernel should include mm.h.  Migrating callers from
      kmalloc to kvmalloc is easier if the kvmalloc functions are in slab.h.
      
      [akpm@linux-foundation.org: move the new kvrealloc() also]
      [akpm@linux-foundation.org: drivers/hwmon/occ/p9_sbe.c needs slab.h]
      
      Link: https://lkml.kernel.org/r/20210622215757.3525604-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8587ca6f
    • J
      d_path: fix Kernel doc validator complaining · d41b6035
      Jia He 提交于
      Kernel doc validator complains:
        Function parameter or member 'p' not described in 'prepend_name'
        Excess function parameter 'buffer' description in 'prepend_name'
      
      Link: https://lkml.kernel.org/r/20211011005614.26189-1-justin.he@arm.com
      Fixes: ad08ae58 ("d_path: introduce struct prepend_buffer")
      Signed-off-by: NJia He <justin.he@arm.com>
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Acked-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d41b6035
    • A
      fs/posix_acl.c: avoid -Wempty-body warning · d1cef29a
      Arnd Bergmann 提交于
      The fallthrough comment for an ignored cmpxchg() return value produces a
      harmless warning with 'make W=1':
      
        fs/posix_acl.c: In function 'get_acl':
        fs/posix_acl.c:127:36: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body]
          127 |                 /* fall through */ ;
              |                                    ^
      
      Simplify it as a step towards a clean W=1 build.  As all architectures
      define cmpxchg() as a statement expression these days, it is no longer
      necessary to evaluate its return code, and the if() can just be droped.
      
      Link: https://lkml.kernel.org/r/20210927102410.1863853-1-arnd@kernel.org
      Link: https://lore.kernel.org/all/20210322132103.qiun2rjilnlgztxe@wittgenstein/Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: James Morris <jamorris@linux.microsoft.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1cef29a
    • J
      ocfs2: do not zero pages beyond i_size · c7c14a36
      Jan Kara 提交于
      ocfs2_zero_range_for_truncate() can try to zero pages beyond current
      inode size despite the fact that underlying blocks should be already
      zeroed out and writeback will skip writing such pages anyway.  Avoid the
      pointless work.
      
      Link: https://lkml.kernel.org/r/20211025151332.11301-2-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7c14a36
    • J
      ocfs2: fix data corruption on truncate · 839b6386
      Jan Kara 提交于
      Patch series "ocfs2: Truncate data corruption fix".
      
      As further testing has shown, commit 5314454e ("ocfs2: fix data
      corruption after conversion from inline format") didn't fix all the data
      corruption issues the customer started observing after 6dbf7bb5
      ("fs: Don't invalidate page buffers in block_write_full_page()") This
      time I have tracked them down to two bugs in ocfs2 truncation code.
      
      One bug (truncating page cache before clearing tail cluster and setting
      i_size) could cause data corruption even before 6dbf7bb5, but before
      that commit it needed a race with page fault, after 6dbf7bb5 it
      started to be pretty deterministic.
      
      Another bug (zeroing pages beyond old i_size) used to be harmless
      inefficiency before commit 6dbf7bb5.  But after commit 6dbf7bb5
      in combination with the first bug it resulted in deterministic data
      corruption.
      
      Although fixing only the first problem is needed to stop data
      corruption, I've fixed both issues to make the code more robust.
      
      This patch (of 2):
      
      ocfs2_truncate_file() did unmap invalidate page cache pages before
      zeroing partial tail cluster and setting i_size.  Thus some pages could
      be left (and likely have left if the cluster zeroing happened) in the
      page cache beyond i_size after truncate finished letting user possibly
      see stale data once the file was extended again.  Also the tail cluster
      zeroing was not guaranteed to finish before truncate finished causing
      possible stale data exposure.  The problem started to be particularly
      easy to hit after commit 6dbf7bb5 "fs: Don't invalidate page buffers
      in block_write_full_page()" stopped invalidation of pages beyond i_size
      from page writeback path.
      
      Fix these problems by unmapping and invalidating pages in the page cache
      after the i_size is reduced and tail cluster is zeroed out.
      
      Link: https://lkml.kernel.org/r/20211025150008.29002-1-jack@suse.cz
      Link: https://lkml.kernel.org/r/20211025151332.11301-1-jack@suse.cz
      Fixes: ccd979bd ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      839b6386
    • C
      ocfs2/dlm: remove redundant assignment of variable ret · 848be75d
      Colin Ian King 提交于
      The variable ret is being assigned a value that is never read, it is
      updated later on with a different value.  The assignment is redundant
      and can be removed.
      
      Addresses-Coverity: ("Unused value")
      Link: https://lkml.kernel.org/r/20211007233452.30815-1-colin.king@canonical.comSigned-off-by: NColin Ian King <colin.king@canonical.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      848be75d
    • V
      ocfs2: cleanup journal init and shutdown · da5e7c87
      Valentin Vidic 提交于
      Allocate and free struct ocfs2_journal in ocfs2_journal_init and
      ocfs2_journal_shutdown.  Init and release of system inodes references
      the journal so reorder calls to make sure they work correctly.
      
      Link: https://lkml.kernel.org/r/20211009145006.3478-1-vvidic@valentin-vidic.from.hrSigned-off-by: NValentin Vidic <vvidic@valentin-vidic.from.hr>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da5e7c87