1. 07 11月, 2021 23 次提交
    • K
      kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC · 3252b1d8
      Kefeng Wang 提交于
      With KASAN_VMALLOC and NEED_PER_CPU_PAGE_FIRST_CHUNK the kernel crashes:
      
        Unable to handle kernel paging request at virtual address ffff7000028f2000
        ...
        swapper pgtable: 64k pages, 48-bit VAs, pgdp=0000000042440000
        [ffff7000028f2000] pgd=000000063e7c0003, p4d=000000063e7c0003, pud=000000063e7c0003, pmd=000000063e7b0003, pte=0000000000000000
        Internal error: Oops: 96000007 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 5.13.0-rc4-00003-gc6e6e28f3f30-dirty #62
        Hardware name: linux,dummy-virt (DT)
        pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO BTYPE=--)
        pc : kasan_check_range+0x90/0x1a0
        lr : memcpy+0x88/0xf4
        sp : ffff80001378fe20
        ...
        Call trace:
         kasan_check_range+0x90/0x1a0
         pcpu_page_first_chunk+0x3f0/0x568
         setup_per_cpu_areas+0xb8/0x184
         start_kernel+0x8c/0x328
      
      The vm area used in vm_area_register_early() has no kasan shadow memory,
      Let's add a new kasan_populate_early_vm_area_shadow() function to
      populate the vm area shadow memory to fix the issue.
      
      [wangkefeng.wang@huawei.com: fix redefinition of 'kasan_populate_early_vm_area_shadow']
        Link: https://lkml.kernel.org/r/20211011123211.3936196-1-wangkefeng.wang@huawei.com
      
      Link: https://lkml.kernel.org/r/20210910053354.26721-4-wangkefeng.wang@huawei.comSigned-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: Marco Elver <elver@google.com>		[KASAN]
      Acked-by: Andrey Konovalov <andreyknvl@gmail.com>	[KASAN]
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3252b1d8
    • P
      mm/vmalloc: don't allow VM_NO_GUARD on vmap() · bd1a8fb2
      Peter Zijlstra 提交于
      The vmalloc guard pages are added on top of each allocation, thereby
      isolating any two allocations from one another.  The top guard of the
      lower allocation is the bottom guard guard of the higher allocation etc.
      
      Therefore VM_NO_GUARD is dangerous; it breaks the basic premise of
      isolating separate allocations.
      
      There are only two in-tree users of this flag, neither of which use it
      through the exported interface.  Ensure it stays this way.
      
      Link: https://lkml.kernel.org/r/YUMfdA36fuyZ+/xt@hirez.programming.kicks-ass.netSigned-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NWill Deacon <will@kernel.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd1a8fb2
    • L
      include/linux/io-mapping.h: remove fallback for writecombine · 2e86f78b
      Lucas De Marchi 提交于
      The fallback was introduced in commit 80c33624 ("io-mapping: Fixup
      for different names of writecombine") to fix the build on microblaze.
      
      5 years later, it seems all archs now provide a pgprot_writecombine(),
      so just remove the other possible fallbacks.  For microblaze,
      pgprot_writecombine() is available since commit 97ccedd7
      ("microblaze: Provide pgprot_device/writecombine macros for nommu").
      
      This is build-tested on microblaze with a hack to always build
      mm/io-mapping.o and without DIYing on an x86-only macro
      (_PAGE_CACHE_MASK)
      
      Link: https://lkml.kernel.org/r/20211020204838.1142908-1-lucas.demarchi@intel.comSigned-off-by: NLucas De Marchi <lucas.demarchi@intel.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e86f78b
    • L
      memory: remove unused CONFIG_MEM_BLOCK_SIZE · e26e0cc3
      Lukas Bulwahn 提交于
      Commit 3947be19 ("[PATCH] memory hotplug: sysfs and add/remove
      functions") defines CONFIG_MEM_BLOCK_SIZE, but this has never been
      utilized anywhere.
      
      It is a good practice to keep the CONFIG_* defines exclusively for the
      Kbuild system.  So, drop this unused definition.
      
      This issue was noticed due to running ./scripts/checkkconfigsymbols.py.
      
      Link: https://lkml.kernel.org/r/20211006120354.7468-1-lukas.bulwahn@gmail.comSigned-off-by: NLukas Bulwahn <lukas.bulwahn@gmail.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e26e0cc3
    • P
      mm: add zap_skip_check_mapping() helper · 91b61ef3
      Peter Xu 提交于
      Use the helper for the checks.  Rename "check_mapping" into
      "zap_mapping" because "check_mapping" looks like a bool but in fact it
      stores the mapping itself.  When it's set, we check the mapping (it must
      be non-NULL).  When it's cleared we skip the check, which works like the
      old way.
      
      Move the duplicated comments to the helper too.
      
      Link: https://lkml.kernel.org/r/20210915181538.11288-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NAlistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91b61ef3
    • P
      mm: drop first_index/last_index in zap_details · 232a6a1c
      Peter Xu 提交于
      The first_index/last_index parameters in zap_details are actually only
      used in unmap_mapping_range_tree().  At the meantime, this function is
      only called by unmap_mapping_pages() once.
      
      Instead of passing these two variables through the whole stack of page
      zapping code, remove them from zap_details and let them simply be
      parameters of unmap_mapping_range_tree(), which is inlined.
      
      Link: https://lkml.kernel.org/r/20210915181535.11238-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NLiam Howlett <liam.howlett@oracle.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      232a6a1c
    • R
      f1dc0db2
    • M
      mm: memcontrol: remove the kmem states · e80216d9
      Muchun Song 提交于
      Now the kmem states is only used to indicate whether the kmem is
      offline.  However, we can set ->kmemcg_id to -1 to indicate whether the
      kmem is offline.  Finally, we can remove the kmem states to simplify the
      code.
      
      Link: https://lkml.kernel.org/r/20211025125259.56624-1-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e80216d9
    • C
      mm: simplify bdi refcounting · efee1713
      Christoph Hellwig 提交于
      Move grabbing and releasing the bdi refcount out of the common
      wb_init/wb_exit helpers into code that is only used for the non-default
      memcg driven bdi_writeback structures.
      
      [hch@lst.de: add comment]
        Link: https://lkml.kernel.org/r/20211027074207.GA12793@lst.de
      [akpm@linux-foundation.org: fix typo]
      
      Link: https://lkml.kernel.org/r/20211021124441.668816-6-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Vignesh Raghavendra <vigneshr@ti.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efee1713
    • C
      fs: explicitly unregister per-superblock BDIs · 0b3ea092
      Christoph Hellwig 提交于
      Add a new SB_I_ flag to mark superblocks that have an ephemeral bdi
      associated with them, and unregister it when the superblock is shut
      down.
      
      Link: https://lkml.kernel.org/r/20211021124441.668816-4-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Vignesh Raghavendra <vigneshr@ti.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b3ea092
    • K
      percpu: add __alloc_size attributes for better bounds checking · 17197dd4
      Kees Cook 提交于
      As already done in GrapheneOS, add the __alloc_size attribute for
      appropriate percpu allocator interfaces, to provide additional hinting
      for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
      compiler optimizations.
      
      Note that due to the implementation of the percpu API, this is unlikely
      to ever actually provide compile-time checking beyond very simple
      non-SMP builds.  But, since they are technically allocators, mark them
      as such.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-9-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Co-developed-by: NDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: NDaniel Micay <danielmicay@gmail.com>
      Acked-by: NDennis Zhou <dennis@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17197dd4
    • K
      mm/page_alloc: add __alloc_size attributes for better bounds checking · abd58f38
      Kees Cook 提交于
      As already done in GrapheneOS, add the __alloc_size attribute for
      appropriate page allocator interfaces, to provide additional hinting for
      better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
      compiler optimizations.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-8-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Co-developed-by: NDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: NDaniel Micay <danielmicay@gmail.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abd58f38
    • K
      mm/vmalloc: add __alloc_size attributes for better bounds checking · 894f24bb
      Kees Cook 提交于
      As already done in GrapheneOS, add the __alloc_size attribute for
      appropriate vmalloc allocator interfaces, to provide additional hinting
      for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
      compiler optimizations.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-7-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Co-developed-by: NDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: NDaniel Micay <danielmicay@gmail.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      894f24bb
    • K
      mm/kvmalloc: add __alloc_size attributes for better bounds checking · 56bcf40f
      Kees Cook 提交于
      As already done in GrapheneOS, add the __alloc_size attribute for
      regular kvmalloc interfaces, to provide additional hinting for better
      bounds checking, assisting CONFIG_FORTIFY_SOURCE and other compiler
      optimizations.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-6-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Co-developed-by: NDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: NDaniel Micay <danielmicay@gmail.com>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56bcf40f
    • K
      slab: add __alloc_size attributes for better bounds checking · c37495d6
      Kees Cook 提交于
      As already done in GrapheneOS, add the __alloc_size attribute for
      regular kmalloc interfaces, to provide additional hinting for better
      bounds checking, assisting CONFIG_FORTIFY_SOURCE and other compiler
      optimizations.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-5-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Co-developed-by: NDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: NDaniel Micay <danielmicay@gmail.com>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c37495d6
    • K
      slab: clean up function prototypes · 72d67229
      Kees Cook 提交于
      Based on feedback from Joe Perches and Linus Torvalds, regularize the
      slab function prototypes before making attribute changes.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-4-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72d67229
    • K
      Compiler Attributes: add __alloc_size() for better bounds checking · 86cffecd
      Kees Cook 提交于
      GCC and Clang can use the "alloc_size" attribute to better inform the
      results of __builtin_object_size() (for compile-time constant values).
      Clang can additionally use alloc_size to inform the results of
      __builtin_dynamic_object_size() (for run-time values).
      
      Because GCC sees the frequent use of struct_size() as an allocator size
      argument, and notices it can return SIZE_MAX (the overflow indication),
      it complains about these call sites overflowing (since SIZE_MAX is
      greater than the default -Walloc-size-larger-than=PTRDIFF_MAX).  This
      isn't helpful since we already know a SIZE_MAX will be caught at
      run-time (this was an intentional design).  To deal with this, we must
      disable this check as it is both a false positive and redundant.  (Clang
      does not have this warning option.)
      
      Unfortunately, just checking the -Wno-alloc-size-larger-than is not
      sufficient to make the __alloc_size attribute behave correctly under
      older GCC versions.  The attribute itself must be disabled in those
      situations too, as there appears to be no way to reliably silence the
      SIZE_MAX constant expression cases for GCC versions less than 9.1:
      
         In file included from ./include/linux/resource_ext.h:11,
                          from ./include/linux/pci.h:40,
                          from drivers/net/ethernet/intel/ixgbe/ixgbe.h:9,
                          from drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c:4:
         In function 'kmalloc_node',
             inlined from 'ixgbe_alloc_q_vector' at ./include/linux/slab.h:743:9:
         ./include/linux/slab.h:618:9: error: argument 1 value '18446744073709551615' exceeds maximum object size 9223372036854775807 [-Werror=alloc-size-larger-than=]
           return __kmalloc_node(size, flags, node);
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         ./include/linux/slab.h: In function 'ixgbe_alloc_q_vector':
         ./include/linux/slab.h:455:7: note: in a call to allocation function '__kmalloc_node' declared here
          void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_slab_alignment __malloc;
                ^~~~~~~~~~~~~~
      
      Specifically:
       '-Wno-alloc-size-larger-than' is not correctly handled by GCC < 9.1
          https://godbolt.org/z/hqsfG7q84 (doesn't disable)
          https://godbolt.org/z/P9jdrPTYh (doesn't admit to not knowing about option)
          https://godbolt.org/z/465TPMWKb (only warns when other warnings appear)
      
       '-Walloc-size-larger-than=18446744073709551615' is not handled by GCC < 8.2
          https://godbolt.org/z/73hh1EPxz (ignores numeric value)
      
      Since anything marked with __alloc_size would also qualify for marking
      with __malloc, just include __malloc along with it to avoid redundant
      markings.  (Suggested by Linus Torvalds.)
      
      Finally, make sure checkpatch.pl doesn't get confused about finding the
      __alloc_size attribute on functions.  (Thanks to Joe Perches.)
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-3-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Tested-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86cffecd
    • M
      kasan: generic: introduce kasan_record_aux_stack_noalloc() · 7cb3007c
      Marco Elver 提交于
      Introduce a variant of kasan_record_aux_stack() that does not do any
      memory allocation through stackdepot.  This will permit using it in
      contexts that cannot allocate any memory.
      
      Link: https://lkml.kernel.org/r/20210913112609.2651084-6-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Tested-by: NShuah Khan <skhan@linuxfoundation.org>
      Acked-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Walter Wu <walter-zh.wu@mediatek.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cb3007c
    • M
      lib/stackdepot: introduce __stack_depot_save() · 11ac25c6
      Marco Elver 提交于
      Add __stack_depot_save(), which provides more fine-grained control over
      stackdepot's memory allocation behaviour, in case stackdepot runs out of
      "stack slabs".
      
      Normally stackdepot uses alloc_pages() in case it runs out of space;
      passing can_alloc==false to __stack_depot_save() prohibits this, at the
      cost of more likely failure to record a stack trace.
      
      Link: https://lkml.kernel.org/r/20210913112609.2651084-4-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Tested-by: NShuah Khan <skhan@linuxfoundation.org>
      Acked-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Walter Wu <walter-zh.wu@mediatek.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11ac25c6
    • M
      lib/stackdepot: include gfp.h · 7857ccdf
      Marco Elver 提交于
      Patch series "stackdepot, kasan, workqueue: Avoid expanding stackdepot
      slabs when holding raw_spin_lock", v2.
      
      Shuah Khan reported [1]:
      
       | When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
       | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
       | it tries to allocate memory attempting to acquire spinlock in page
       | allocation code while holding workqueue pool raw_spinlock.
       |
       | There are several instances of this problem when block layer tries
       | to __queue_work(). Call trace from one of these instances is below:
       |
       |     kblockd_mod_delayed_work_on()
       |       mod_delayed_work_on()
       |         __queue_delayed_work()
       |           __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
       |             insert_work()
       |               kasan_record_aux_stack()
       |                 kasan_save_stack()
       |                   stack_depot_save()
       |                     alloc_pages()
       |                       __alloc_pages()
       |                         get_page_from_freelist()
       |                           rm_queue()
       |                             rm_queue_pcplist()
       |                               local_lock_irqsave(&pagesets.lock, flags);
       |                               [ BUG: Invalid wait context triggered ]
      
      PROVE_RAW_LOCK_NESTING is pointing out that (on RT kernels) the locking
      rules are being violated.  More generally, memory is being allocated
      from a non-preemptive context (raw_spin_lock'd c-s) where it is not
      allowed.
      
      To properly fix this, we must prevent stackdepot from replenishing its
      "stack slab" pool if memory allocations cannot be done in the current
      context: it's a bug to use either GFP_ATOMIC nor GFP_NOWAIT in certain
      non-preemptive contexts, including raw_spin_locks (see gfp.h and commit
      ab00db21).
      
      The only downside is that saving a stack trace may fail if: stackdepot
      runs out of space AND the same stack trace has not been recorded before.
      I expect this to be unlikely, and a simple experiment (boot the kernel)
      didn't result in any failure to record stack trace from insert_work().
      
      The series includes a few minor fixes to stackdepot that I noticed in
      preparing the series.  It then introduces __stack_depot_save(), which
      exposes the option to force stackdepot to not allocate any memory.
      Finally, KASAN is changed to use the new stackdepot interface and
      provide kasan_record_aux_stack_noalloc(), which is then used by
      workqueue code.
      
      [1] https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org
      
      This patch (of 6):
      
      <linux/stackdepot.h> refers to gfp_t, but doesn't include gfp.h.
      
      Fix it by including <linux/gfp.h>.
      
      Link: https://lkml.kernel.org/r/20210913112609.2651084-1-elver@google.com
      Link: https://lkml.kernel.org/r/20210913112609.2651084-2-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Tested-by: NShuah Khan <skhan@linuxfoundation.org>
      Acked-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Walter Wu <walter-zh.wu@mediatek.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
      Cc: Taras Madan <tarasmadan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7857ccdf
    • C
      mm: don't include <linux/dax.h> in <linux/mempolicy.h> · 96c84dde
      Christoph Hellwig 提交于
      Not required at all, and having this causes a huge kernel rebuild as
      soon as something in dax.h changes.
      
      Link: https://lkml.kernel.org/r/20210921082253.1859794-1-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96c84dde
    • V
      mm, slub: change percpu partial accounting from objects to pages · b47291ef
      Vlastimil Babka 提交于
      With CONFIG_SLUB_CPU_PARTIAL enabled, SLUB keeps a percpu list of
      partial slabs that can be promoted to cpu slab when the previous one is
      depleted, without accessing the shared partial list.  A slab can be
      added to this list by 1) refill of an empty list from get_partial_node()
      - once we really have to access the shared partial list, we acquire
      multiple slabs to amortize the cost of locking, and 2) first free to a
      previously full slab - instead of putting the slab on a shared partial
      list, we can more cheaply freeze it and put it on the per-cpu list.
      
      To control how large a percpu partial list can grow for a kmem cache,
      set_cpu_partial() calculates a target number of free objects on each
      cpu's percpu partial list, and this can be also set by the sysfs file
      cpu_partial.
      
      However, the tracking of actual number of objects is imprecise, in order
      to limit overhead from cpu X freeing an objects to a slab on percpu
      partial list of cpu Y.  Basically, the percpu partial slabs form a
      single linked list, and when we add a new slab to the list with current
      head "oldpage", we set in the struct page of the slab we're adding:
      
          page->pages = oldpage->pages + 1; // this is precise
          page->pobjects = oldpage->pobjects + (page->objects - page->inuse);
          page->next = oldpage;
      
      Thus the real number of free objects in the slab (objects - inuse) is
      only determined at the moment of adding the slab to the percpu partial
      list, and further freeing doesn't update the pobjects counter nor
      propagate it to the current list head.  As Jann reports [1], this can
      easily lead to large inaccuracies, where the target number of objects
      (up to 30 by default) can translate to the same number of (empty) slab
      pages on the list.  In case 2) above, we put a slab with 1 free object
      on the list, thus only increase page->pobjects by 1, even if there are
      subsequent frees on the same slab.  Jann has noticed this in practice
      and so did we [2] when investigating significant increase of kmemcg
      usage after switching from SLAB to SLUB.
      
      While this is no longer a problem in kmemcg context thanks to the
      accounting rewrite in 5.9, the memory waste is still not ideal and it's
      questionable whether it makes sense to perform free object count based
      control when object counts can easily become so much inaccurate.  So
      this patch converts the accounting to be based on number of pages only
      (which is precise) and removes the page->pobjects field completely.
      This is also ultimately simpler.
      
      To retain the existing set_cpu_partial() heuristic, first calculate the
      target number of objects as previously, but then convert it to target
      number of pages by assuming the pages will be half-filled on average.
      This assumption might obviously also be inaccurate in practice, but
      cannot degrade to actual number of pages being equal to the target
      number of objects.
      
      We could also skip the intermediate step with target number of objects
      and rewrite the heuristic in terms of pages.  However we still have the
      sysfs file cpu_partial which uses number of objects and could break
      existing users if it suddenly becomes number of pages, so this patch
      doesn't do that.
      
      In practice, after this patch the heuristics limit the size of percpu
      partial list up to 2 pages.  In case of a reported regression (which
      would mean some workload has benefited from the previous imprecise
      object based counting), we can tune the heuristics to get a better
      compromise within the new scheme, while still avoid the unexpectedly
      long percpu partial lists.
      
      [1] https://lore.kernel.org/linux-mm/CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=-umR7BLsEgjEYzA@mail.gmail.com/
      [2] https://lore.kernel.org/all/2f0f46e8-2535-410a-1859-e9cfa4e57c18@suse.cz/
      
      ==========
      Evaluation
      ==========
      
      Mel was kind enough to run v1 through mmtests machinery for netperf
      (localhost) and hackbench and, for most significant results see below.
      So there are some apparent regressions, especially with hackbench, which
      I think ultimately boils down to having shorter percpu partial lists on
      average and some benchmarks benefiting from longer ones.  Monitoring
      slab usage also indicated less memory usage by slab.  Based on that, the
      following patch will bump the defaults to allow longer percpu partial
      lists than after this patch.
      
      However the goal is certainly not such that we would limit the percpu
      partial lists to 30 pages just because previously a specific alloc/free
      pattern could lead to the limit of 30 objects translate to a limit to 30
      pages - that would make little sense.  This is a correctness patch, and
      if a workload benefits from larger lists, the sysfs tuning knobs are
      still there to allow that.
      
      Netperf
      
        2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads per socket), 384GB RAM
        TCP-RR:
          hmean before 127045.79 after 121092.94 (-4.69%, worse)
          stddev before  2634.37 after   1254.08
        UDP-RR:
          hmean before 166985.45 after 160668.94 ( -3.78%, worse)
          stddev before 4059.69 after 1943.63
      
        2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads per socket), 512GB RAM
        TCP-RR:
          hmean before 84173.25 after 76914.72 ( -8.62%, worse)
        UDP-RR:
          hmean before 93571.12 after 96428.69 ( 3.05%, better)
          stddev before 23118.54 after 16828.14
      
        2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads per socket), 64GB RAM
        TCP-RR:
          hmean before 49984.92 after 48922.27 ( -2.13%, worse)
          stddev before 6248.15 after 4740.51
        UDP-RR:
          hmean before 61854.31 after 68761.81 ( 11.17%, better)
          stddev before 4093.54 after 5898.91
      
        other machines - within 2%
      
      Hackbench
      
        (results before and after the patch, negative % means worse)
      
        2-socket AMD EPYC 7713 (64 cores, 128 threads per core), 256GB RAM
        hackbench-process-sockets
        Amean 	1 	0.5380	0.5583	( -3.78%)
        Amean 	4 	0.7510	0.8150	( -8.52%)
        Amean 	7 	0.7930	0.9533	( -20.22%)
        Amean 	12 	0.7853	1.1313	( -44.06%)
        Amean 	21 	1.1520	1.4993	( -30.15%)
        Amean 	30 	1.6223	1.9237	( -18.57%)
        Amean 	48 	2.6767	2.9903	( -11.72%)
        Amean 	79 	4.0257	5.1150	( -27.06%)
        Amean 	110	5.5193	7.4720	( -35.38%)
        Amean 	141	7.2207	9.9840	( -38.27%)
        Amean 	172	8.4770	12.1963	( -43.88%)
        Amean 	203	9.6473	14.3137	( -48.37%)
        Amean 	234	11.3960	18.7917	( -64.90%)
        Amean 	265	13.9627	22.4607	( -60.86%)
        Amean 	296	14.9163	26.0483	( -74.63%)
      
        hackbench-thread-sockets
        Amean 	1 	0.5597	0.5877	( -5.00%)
        Amean 	4 	0.7913	0.8960	( -13.23%)
        Amean 	7 	0.8190	1.0017	( -22.30%)
        Amean 	12 	0.9560	1.1727	( -22.66%)
        Amean 	21 	1.7587	1.5660	( 10.96%)
        Amean 	30 	2.4477	1.9807	( 19.08%)
        Amean 	48 	3.4573	3.0630	( 11.41%)
        Amean 	79 	4.7903	5.1733	( -8.00%)
        Amean 	110	6.1370	7.4220	( -20.94%)
        Amean 	141	7.5777	9.2617	( -22.22%)
        Amean 	172	9.2280	11.0907	( -20.18%)
        Amean 	203	10.2793	13.3470	( -29.84%)
        Amean 	234	11.2410	17.1070	( -52.18%)
        Amean 	265	12.5970	23.3323	( -85.22%)
        Amean 	296	17.1540	24.2857	( -41.57%)
      
        2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads
        per socket), 384GB RAM
        hackbench-process-sockets
        Amean 	1 	0.5760	0.4793	( 16.78%)
        Amean 	4 	0.9430	0.9707	( -2.93%)
        Amean 	7 	1.5517	1.8843	( -21.44%)
        Amean 	12 	2.4903	2.7267	( -9.49%)
        Amean 	21 	3.9560	4.2877	( -8.38%)
        Amean 	30 	5.4613	5.8343	( -6.83%)
        Amean 	48 	8.5337	9.2937	( -8.91%)
        Amean 	79 	14.0670	15.2630	( -8.50%)
        Amean 	110	19.2253	21.2467	( -10.51%)
        Amean 	141	23.7557	25.8550	( -8.84%)
        Amean 	172	28.4407	29.7603	( -4.64%)
        Amean 	203	33.3407	33.9927	( -1.96%)
        Amean 	234	38.3633	39.1150	( -1.96%)
        Amean 	265	43.4420	43.8470	( -0.93%)
        Amean 	296	48.3680	48.9300	( -1.16%)
      
        hackbench-thread-sockets
        Amean 	1 	0.6080	0.6493	( -6.80%)
        Amean 	4 	1.0000	1.0513	( -5.13%)
        Amean 	7 	1.6607	2.0260	( -22.00%)
        Amean 	12 	2.7637	2.9273	( -5.92%)
        Amean 	21 	5.0613	4.5153	( 10.79%)
        Amean 	30 	6.3340	6.1140	( 3.47%)
        Amean 	48 	9.0567	9.5577	( -5.53%)
        Amean 	79 	14.5657	15.7983	( -8.46%)
        Amean 	110	19.6213	21.6333	( -10.25%)
        Amean 	141	24.1563	26.2697	( -8.75%)
        Amean 	172	28.9687	30.2187	( -4.32%)
        Amean 	203	33.9763	34.6970	( -2.12%)
        Amean 	234	38.8647	39.3207	( -1.17%)
        Amean 	265	44.0813	44.1507	( -0.16%)
        Amean 	296	49.2040	49.4330	( -0.47%)
      
        2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads
        per socket), 512GB RAM
        hackbench-process-sockets
        Amean 	1 	0.5027	0.5017	( 0.20%)
        Amean 	4 	1.1053	1.2033	( -8.87%)
        Amean 	7 	1.8760	2.1820	( -16.31%)
        Amean 	12 	2.9053	3.1810	( -9.49%)
        Amean 	21 	4.6777	4.9920	( -6.72%)
        Amean 	30 	6.5180	6.7827	( -4.06%)
        Amean 	48 	10.0710	10.5227	( -4.48%)
        Amean 	79 	16.4250	17.5053	( -6.58%)
        Amean 	110	22.6203	24.4617	( -8.14%)
        Amean 	141	28.0967	31.0363	( -10.46%)
        Amean 	172	34.4030	36.9233	( -7.33%)
        Amean 	203	40.5933	43.0850	( -6.14%)
        Amean 	234	46.6477	48.7220	( -4.45%)
        Amean 	265	53.0530	53.9597	( -1.71%)
        Amean 	296	59.2760	59.9213	( -1.09%)
      
        hackbench-thread-sockets
        Amean 	1 	0.5363	0.5330	( 0.62%)
        Amean 	4 	1.1647	1.2157	( -4.38%)
        Amean 	7 	1.9237	2.2833	( -18.70%)
        Amean 	12 	2.9943	3.3110	( -10.58%)
        Amean 	21 	4.9987	5.1880	( -3.79%)
        Amean 	30 	6.7583	7.0043	( -3.64%)
        Amean 	48 	10.4547	10.8353	( -3.64%)
        Amean 	79 	16.6707	17.6790	( -6.05%)
        Amean 	110	22.8207	24.4403	( -7.10%)
        Amean 	141	28.7090	31.0533	( -8.17%)
        Amean 	172	34.9387	36.8260	( -5.40%)
        Amean 	203	41.1567	43.0450	( -4.59%)
        Amean 	234	47.3790	48.5307	( -2.43%)
        Amean 	265	53.9543	54.6987	( -1.38%)
        Amean 	296	60.0820	60.2163	( -0.22%)
      
        1-socket Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz (4 cores, 8 threads),
        32 GB RAM
        hackbench-process-sockets
        Amean 	1 	1.4760	1.5773	( -6.87%)
        Amean 	3 	3.9370	4.0910	( -3.91%)
        Amean 	5 	6.6797	6.9357	( -3.83%)
        Amean 	7 	9.3367	9.7150	( -4.05%)
        Amean 	12	15.7627	16.1400	( -2.39%)
        Amean 	18	23.5360	23.6890	( -0.65%)
        Amean 	24	31.0663	31.3137	( -0.80%)
        Amean 	30	38.7283	39.0037	( -0.71%)
        Amean 	32	41.3417	41.6097	( -0.65%)
      
        hackbench-thread-sockets
        Amean 	1 	1.5250	1.6043	( -5.20%)
        Amean 	3 	4.0897	4.2603	( -4.17%)
        Amean 	5 	6.7760	7.0933	( -4.68%)
        Amean 	7 	9.4817	9.9157	( -4.58%)
        Amean 	12	15.9610	16.3937	( -2.71%)
        Amean 	18	23.9543	24.3417	( -1.62%)
        Amean 	24	31.4400	31.7217	( -0.90%)
        Amean 	30	39.2457	39.5467	( -0.77%)
        Amean 	32	41.8267	42.1230	( -0.71%)
      
        2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads
        per socket), 64GB RAM
        hackbench-process-sockets
        Amean 	1 	1.0347	1.0880	( -5.15%)
        Amean 	4 	1.7267	1.8527	( -7.30%)
        Amean 	7 	2.6707	2.8110	( -5.25%)
        Amean 	12 	4.1617	4.3383	( -4.25%)
        Amean 	21 	7.0070	7.2600	( -3.61%)
        Amean 	30 	9.9187	10.2397	( -3.24%)
        Amean 	48 	15.6710	16.3923	( -4.60%)
        Amean 	79 	24.7743	26.1247	( -5.45%)
        Amean 	110	34.3000	35.9307	( -4.75%)
        Amean 	141	44.2043	44.8010	( -1.35%)
        Amean 	172	54.2430	54.7260	( -0.89%)
        Amean 	192	60.6557	60.9777	( -0.53%)
      
        hackbench-thread-sockets
        Amean 	1 	1.0610	1.1353	( -7.01%)
        Amean 	4 	1.7543	1.9140	( -9.10%)
        Amean 	7 	2.7840	2.9573	( -6.23%)
        Amean 	12 	4.3813	4.4937	( -2.56%)
        Amean 	21 	7.3460	7.5350	( -2.57%)
        Amean 	30 	10.2313	10.5190	( -2.81%)
        Amean 	48 	15.9700	16.5940	( -3.91%)
        Amean 	79 	25.3973	26.6637	( -4.99%)
        Amean 	110	35.1087	36.4797	( -3.91%)
        Amean 	141	45.8220	46.3053	( -1.05%)
        Amean 	172	55.4917	55.7320	( -0.43%)
        Amean 	192	62.7490	62.5410	( 0.33%)
      
      Link: https://lkml.kernel.org/r/20211012134651.11258-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NJann Horn <jannh@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b47291ef
    • M
      mm: move kvmalloc-related functions to slab.h · 8587ca6f
      Matthew Wilcox (Oracle) 提交于
      Not all files in the kernel should include mm.h.  Migrating callers from
      kmalloc to kvmalloc is easier if the kvmalloc functions are in slab.h.
      
      [akpm@linux-foundation.org: move the new kvrealloc() also]
      [akpm@linux-foundation.org: drivers/hwmon/occ/p9_sbe.c needs slab.h]
      
      Link: https://lkml.kernel.org/r/20210622215757.3525604-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8587ca6f
  2. 29 10月, 2021 1 次提交
    • Y
      mm: filemap: check if THP has hwpoisoned subpage for PMD page fault · eac96c3e
      Yang Shi 提交于
      When handling shmem page fault the THP with corrupted subpage could be
      PMD mapped if certain conditions are satisfied.  But kernel is supposed
      to send SIGBUS when trying to map hwpoisoned page.
      
      There are two paths which may do PMD map: fault around and regular
      fault.
      
      Before commit f9ce0be7 ("mm: Cleanup faultaround and finish_fault()
      codepaths") the thing was even worse in fault around path.  The THP
      could be PMD mapped as long as the VMA fits regardless what subpage is
      accessed and corrupted.  After this commit as long as head page is not
      corrupted the THP could be PMD mapped.
      
      In the regular fault path the THP could be PMD mapped as long as the
      corrupted page is not accessed and the VMA fits.
      
      This loophole could be fixed by iterating every subpage to check if any
      of them is hwpoisoned or not, but it is somewhat costly in page fault
      path.
      
      So introduce a new page flag called HasHWPoisoned on the first tail
      page.  It indicates the THP has hwpoisoned subpage(s).  It is set if any
      subpage of THP is found hwpoisoned by memory failure and after the
      refcount is bumped successfully, then cleared when the THP is freed or
      split.
      
      The soft offline path doesn't need this since soft offline handler just
      marks a subpage hwpoisoned when the subpage is migrated successfully.
      But shmem THP didn't get split then migrated at all.
      
      Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com
      Fixes: 800d8c63 ("shmem: add huge pages support")
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eac96c3e
  3. 27 10月, 2021 3 次提交
  4. 23 10月, 2021 1 次提交
  5. 21 10月, 2021 1 次提交
  6. 19 10月, 2021 6 次提交
    • S
      mm/secretmem: fix NULL page->mapping dereference in page_is_secretmem() · 79f9bc58
      Sean Christopherson 提交于
      Check for a NULL page->mapping before dereferencing the mapping in
      page_is_secretmem(), as the page's mapping can be nullified while gup()
      is running, e.g.  by reclaim or truncation.
      
        BUG: kernel NULL pointer dereference, address: 0000000000000068
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 6 PID: 4173897 Comm: CPU 3/KVM Tainted: G        W
        RIP: 0010:internal_get_user_pages_fast+0x621/0x9d0
        Code: <48> 81 7a 68 80 08 04 bc 0f 85 21 ff ff 8 89 c7 be
        RSP: 0018:ffffaa90087679b0 EFLAGS: 00010046
        RAX: ffffe3f37905b900 RBX: 00007f2dd561e000 RCX: ffffe3f37905b934
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffe3f37905b900
        ...
        CR2: 0000000000000068 CR3: 00000004c5898003 CR4: 00000000001726e0
        Call Trace:
         get_user_pages_fast_only+0x13/0x20
         hva_to_pfn+0xa9/0x3e0
         try_async_pf+0xa1/0x270
         direct_page_fault+0x113/0xad0
         kvm_mmu_page_fault+0x69/0x680
         vmx_handle_exit+0xe1/0x5d0
         kvm_arch_vcpu_ioctl_run+0xd81/0x1c70
         kvm_vcpu_ioctl+0x267/0x670
         __x64_sys_ioctl+0x83/0xa0
         do_syscall_64+0x56/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Link: https://lkml.kernel.org/r/20211007231502.3552715-1-seanjc@google.com
      Fixes: 1507f512 ("mm: introduce memfd_secret system call to create "secret" memory areas")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reported-by: NDarrick J. Wong <djwong@kernel.org>
      Reported-by: NStephen <stephenackerman16@gmail.com>
      Tested-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79f9bc58
    • L
      elfcore: correct reference to CONFIG_UML · b0e90128
      Lukas Bulwahn 提交于
      Commit 6e7b64b9 ("elfcore: fix building with clang") introduces
      special handling for two architectures, ia64 and User Mode Linux.
      However, the wrong name, i.e., CONFIG_UM, for the intended Kconfig
      symbol for User-Mode Linux was used.
      
      Although the directory for User Mode Linux is ./arch/um; the Kconfig
      symbol for this architecture is called CONFIG_UML.
      
      Luckily, ./scripts/checkkconfigsymbols.py warns on non-existing configs:
      
        UM
        Referencing files: include/linux/elfcore.h
        Similar symbols: UML, NUMA
      
      Correct the name of the config to the intended one.
      
      [akpm@linux-foundation.org: fix um/x86_64, per Catalin]
        Link: https://lkml.kernel.org/r/20211006181119.2851441-1-catalin.marinas@arm.com
        Link: https://lkml.kernel.org/r/YV6pejGzLy5ppEpt@arm.com
      
      Link: https://lkml.kernel.org/r/20211006082209.417-1-lukas.bulwahn@gmail.com
      Fixes: 6e7b64b9 ("elfcore: fix building with clang")
      Signed-off-by: NLukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Barret Rhoden <brho@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0e90128
    • H
      mm/migrate: fix CPUHP state to update node demotion order · a6a0251c
      Huang Ying 提交于
      The node demotion order needs to be updated during CPU hotplug.  Because
      whether a NUMA node has CPU may influence the demotion order.  The
      update function should be called during CPU online/offline after the
      node_states[N_CPU] has been updated.  That is done in
      CPUHP_AP_ONLINE_DYN during CPU online and in CPUHP_MM_VMSTAT_DEAD during
      CPU offline.  But in commit 884a6e5d ("mm/migrate: update node
      demotion order on hotplug events"), the function to update node demotion
      order is called in CPUHP_AP_ONLINE_DYN during CPU online/offline.  This
      doesn't satisfy the order requirement.
      
      For example, there are 4 CPUs (P0, P1, P2, P3) in 2 sockets (P0, P1 in S0
      and P2, P3 in S1), the demotion order is
      
       - S0 -> NUMA_NO_NODE
       - S1 -> NUMA_NO_NODE
      
      After P2 and P3 is offlined, because S1 has no CPU now, the demotion
      order should have been changed to
      
       - S0 -> S1
       - S1 -> NO_NODE
      
      but it isn't changed, because the order updating callback for CPU
      hotplug doesn't see the new nodemask.  After that, if P1 is offlined,
      the demotion order is changed to the expected order as above.
      
      So in this patch, we added CPUHP_AP_MM_DEMOTION_ONLINE and
      CPUHP_MM_DEMOTION_DEAD to be called after CPUHP_AP_ONLINE_DYN and
      CPUHP_MM_VMSTAT_DEAD during CPU online and offline, and register the
      update function on them.
      
      Link: https://lkml.kernel.org/r/20210929060351.7293-1-ying.huang@intel.com
      Fixes: 884a6e5d ("mm/migrate: update node demotion order on hotplug events")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Keith Busch <kbusch@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6a0251c
    • D
      mm/migrate: add CPU hotplug to demotion #ifdef · 76af6a05
      Dave Hansen 提交于
      Once upon a time, the node demotion updates were driven solely by memory
      hotplug events.  But now, there are handlers for both CPU and memory
      hotplug.
      
      However, the #ifdef around the code checks only memory hotplug.  A
      system that has HOTPLUG_CPU=y but MEMORY_HOTPLUG=n would miss CPU
      hotplug events.
      
      Update the #ifdef around the common code.  Add memory and CPU-specific
      #ifdefs for their handlers.  These memory/CPU #ifdefs avoid unused
      function warnings when their Kconfig option is off.
      
      [arnd@arndb.de: rework hotplug_memory_notifier() stub]
        Link: https://lkml.kernel.org/r/20211013144029.2154629-1-arnd@kernel.org
      
      Link: https://lkml.kernel.org/r/20210924161255.E5FE8F7E@davehans-spike.ostc.intel.com
      Fixes: 884a6e5d ("mm/migrate: update node demotion order on hotplug events")
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76af6a05
    • S
      tracing: Have all levels of checks prevent recursion · ed65df63
      Steven Rostedt (VMware) 提交于
      While writing an email explaining the "bit = 0" logic for a discussion on
      making ftrace_test_recursion_trylock() disable preemption, I discovered a
      path that makes the "not do the logic if bit is zero" unsafe.
      
      The recursion logic is done in hot paths like the function tracer. Thus,
      any code executed causes noticeable overhead. Thus, tricks are done to try
      to limit the amount of code executed. This included the recursion testing
      logic.
      
      Having recursion testing is important, as there are many paths that can
      end up in an infinite recursion cycle when tracing every function in the
      kernel. Thus protection is needed to prevent that from happening.
      
      Because it is OK to recurse due to different running context levels (e.g.
      an interrupt preempts a trace, and then a trace occurs in the interrupt
      handler), a set of bits are used to know which context one is in (normal,
      softirq, irq and NMI). If a recursion occurs in the same level, it is
      prevented*.
      
      Then there are infrastructure levels of recursion as well. When more than
      one callback is attached to the same function to trace, it calls a loop
      function to iterate over all the callbacks. Both the callbacks and the
      loop function have recursion protection. The callbacks use the
      "ftrace_test_recursion_trylock()" which has a "function" set of context
      bits to test, and the loop function calls the internal
      trace_test_and_set_recursion() directly, with an "internal" set of bits.
      
      If an architecture does not implement all the features supported by ftrace
      then the callbacks are never called directly, and the loop function is
      called instead, which will implement the features of ftrace.
      
      Since both the loop function and the callbacks do recursion protection, it
      was seemed unnecessary to do it in both locations. Thus, a trick was made
      to have the internal set of recursion bits at a more significant bit
      location than the function bits. Then, if any of the higher bits were set,
      the logic of the function bits could be skipped, as any new recursion
      would first have to go through the loop function.
      
      This is true for architectures that do not support all the ftrace
      features, because all functions being traced must first go through the
      loop function before going to the callbacks. But this is not true for
      architectures that support all the ftrace features. That's because the
      loop function could be called due to two callbacks attached to the same
      function, but then a recursion function inside the callback could be
      called that does not share any other callback, and it will be called
      directly.
      
      i.e.
      
       traced_function_1: [ more than one callback tracing it ]
         call loop_func
      
       loop_func:
         trace_recursion set internal bit
         call callback
      
       callback:
         trace_recursion [ skipped because internal bit is set, return 0 ]
         call traced_function_2
      
       traced_function_2: [ only traced by above callback ]
         call callback
      
       callback:
         trace_recursion [ skipped because internal bit is set, return 0 ]
         call traced_function_2
      
       [ wash, rinse, repeat, BOOM! out of shampoo! ]
      
      Thus, the "bit == 0 skip" trick is not safe, unless the loop function is
      call for all functions.
      
      Since we want to encourage architectures to implement all ftrace features,
      having them slow down due to this extra logic may encourage the
      maintainers to update to the latest ftrace features. And because this
      logic is only safe for them, remove it completely.
      
       [*] There is on layer of recursion that is allowed, and that is to allow
           for the transition between interrupt context (normal -> softirq ->
           irq -> NMI), because a trace may occur before the context update is
           visible to the trace recursion logic.
      
      Link: https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d994@linux.alibaba.com/
      Link: https://lkml.kernel.org/r/20211018154412.09fcad3c@gandalf.local.home
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Joe Lawrence <joe.lawrence@redhat.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Jisheng Zhang <jszhang@kernel.org>
      Cc: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: edc15caf ("tracing: Avoid unnecessary multiple recursion checks")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      ed65df63
    • E
      ucounts: Fix signal ucount refcounting · 15bc01ef
      Eric W. Biederman 提交于
      In commit fda31c50 ("signal: avoid double atomic counter
      increments for user accounting") Linus made a clever optimization to
      how rlimits and the struct user_struct.  Unfortunately that
      optimization does not work in the obvious way when moved to nested
      rlimits.  The problem is that the last decrement of the per user
      namespace per user sigpending counter might also be the last decrement
      of the sigpending counter in the parent user namespace as well.  Which
      means that simply freeing the leaf ucount in __free_sigqueue is not
      enough.
      
      Maintain the optimization and handle the tricky cases by introducing
      inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.
      
      By moving the entire optimization into functions that perform all of
      the work it becomes possible to ensure that every level is handled
      properly.
      
      The new function inc_rlimit_get_ucounts returns 0 on failure to
      increment the ucount.  This is different than inc_rlimit_ucounts which
      increments the ucounts and returns LONG_MAX if the ucount counter has
      exceeded it's maximum or it wrapped (to indicate the counter needs to
      decremented).
      
      I wish we had a single user to account all pending signals to across
      all of the threads of a process so this complexity was not necessary
      
      Cc: stable@vger.kernel.org
      Fixes: d6469690 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
      v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133
      Link: https://lkml.kernel.org/r/87fssytizw.fsf_-_@disp2133Reviewed-by: NAlexey Gladkov <legion@kernel.org>
      Tested-by: NRune Kleveland <rune.kleveland@infomedia.dk>
      Tested-by: NYu Zhao <yuzhao@google.com>
      Tested-by: NJordan Glover <Golden_Miller83@protonmail.ch>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      15bc01ef
  7. 16 10月, 2021 1 次提交
  8. 14 10月, 2021 1 次提交
    • M
      spi: Fix deadlock when adding SPI controllers on SPI buses · 6098475d
      Mark Brown 提交于
      Currently we have a global spi_add_lock which we take when adding new
      devices so that we can check that we're not trying to reuse a chip
      select that's already controlled.  This means that if the SPI device is
      itself a SPI controller and triggers the instantiation of further SPI
      devices we trigger a deadlock as we try to register and instantiate
      those devices while in the process of doing so for the parent controller
      and hence already holding the global spi_add_lock.  Since we only care
      about concurrency within a single SPI bus move the lock to be per
      controller, avoiding the deadlock.
      
      This can be easily triggered in the case of spi-mux.
      Reported-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: NMark Brown <broonie@kernel.org>
      6098475d
  9. 13 10月, 2021 3 次提交
    • V
      net: dsa: tag_ocelot_8021q: break circular dependency with ocelot switch lib · 49f885b2
      Vladimir Oltean 提交于
      Michael reported that when using the "ocelot-8021q" tagging protocol,
      the switch driver module must be manually loaded before the tagging
      protocol can be loaded/is available.
      
      This appears to be the same problem described here:
      https://lore.kernel.org/netdev/20210908220834.d7gmtnwrorhharna@skbuf/
      where due to the fact that DSA tagging protocols make use of symbols
      exported by the switch drivers, circular dependencies appear and this
      breaks module autoloading.
      
      The ocelot_8021q driver needs the ocelot_can_inject() and
      ocelot_port_inject_frame() functions from the switch library. Previously
      the wrong approach was taken to solve that dependency: shims were
      provided for the case where the ocelot switch library was compiled out,
      but that turns out to be insufficient, because the dependency when the
      switch lib _is_ compiled is problematic too.
      
      We cannot declare ocelot_can_inject() and ocelot_port_inject_frame() as
      static inline functions, because these access I/O functions like
      __ocelot_write_ix() which is called by ocelot_write_rix(). Making those
      static inline basically means exposing the whole guts of the ocelot
      switch library, not ideal...
      
      We already have one tagging protocol driver which calls into the switch
      driver during xmit but not using any exported symbol: sja1105_defer_xmit.
      We can do the same thing here: create a kthread worker and one work item
      per skb, and let the switch driver itself do the register accesses to
      send the skb, and then consume it.
      
      Fixes: 0a6f17c6 ("net: dsa: tag_ocelot_8021q: add support for PTP timestamping")
      Reported-by: NMichael Walle <michael@walle.cc>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      49f885b2
    • V
      net: dsa: tag_ocelot: break circular dependency with ocelot switch lib driver · deab6b1c
      Vladimir Oltean 提交于
      As explained here:
      https://lore.kernel.org/netdev/20210908220834.d7gmtnwrorhharna@skbuf/
      DSA tagging protocol drivers cannot depend on symbols exported by switch
      drivers, because this creates a circular dependency that breaks module
      autoloading.
      
      The tag_ocelot.c file depends on the ocelot_ptp_rew_op() function
      exported by the common ocelot switch lib. This function looks at
      OCELOT_SKB_CB(skb) and computes how to populate the REW_OP field of the
      DSA tag, for PTP timestamping (the command: one-step/two-step, and the
      TX timestamp identifier).
      
      None of that requires deep insight into the driver, it is quite
      stateless, as it only depends upon the skb->cb. So let's make it a
      static inline function and put it in include/linux/dsa/ocelot.h, a
      file that despite its name is used by the ocelot switch driver for
      populating the injection header too - since commit 40d3f295 ("net:
      mscc: ocelot: use common tag parsing code with DSA").
      
      With that function declared as static inline, its body is expanded
      inside each call site, so the dependency is broken and the DSA tagger
      can be built without the switch library, upon which the felix driver
      depends.
      
      Fixes: 39e5308b ("net: mscc: ocelot: support PTP Sync one-step timestamping")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      deab6b1c
    • V
      net: dsa: sja1105: break dependency between dsa_port_is_sja1105 and switch driver · 4ac0567e
      Vladimir Oltean 提交于
      It's nice to be able to test a tagging protocol with dsa_loop, but not
      at the cost of losing the ability of building the tagging protocol and
      switch driver as modules, because as things stand, there is a circular
      dependency between the two. Tagging protocol drivers cannot depend on
      switch drivers, that is a hard fact.
      
      The reasoning behind the blamed patch was that accessing dp->priv should
      first make sure that the structure behind that pointer is what we really
      think it is.
      
      Currently the "sja1105" and "sja1110" tagging protocols only operate
      with the sja1105 switch driver, just like any other tagging protocol and
      switch combination. The only way to mix and match them is by modifying
      the code, and this applies to dsa_loop as well (by default that uses
      DSA_TAG_PROTO_NONE). So while in principle there is an issue, in
      practice there isn't one.
      
      Until we extend dsa_loop to allow user space configuration, treat the
      problem as a non-issue and just say that DSA ports found by tag_sja1105
      are always sja1105 ports, which is in fact true. But keep the
      dsa_port_is_sja1105 function so that it's easy to patch it during
      testing, and rely on dead code elimination.
      
      Fixes: 994d2cbb ("net: dsa: tag_sja1105: be dsa_loop-safe")
      Link: https://lore.kernel.org/netdev/20210908220834.d7gmtnwrorhharna@skbuf/Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      4ac0567e