1. 05 5月, 2020 4 次提交
  2. 22 4月, 2020 1 次提交
  3. 21 4月, 2020 5 次提交
  4. 11 4月, 2020 12 次提交
    • P
      change email address for Pali Rohár · 149ed3d4
      Pali Rohár 提交于
      For security reasons I stopped using gmail account and kernel address is
      now up-to-date alias to my personal address.
      
      People periodically send me emails to address which they found in source
      code of drivers, so this change reflects state where people can contact
      me.
      
      [ Added .mailmap entry as per Joe Perches  - Linus ]
      Signed-off-by: NPali Rohár <pali@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joe Perches <joe@perches.com>
      Link: http://lkml.kernel.org/r/20200307104237.8199-1-pali@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      149ed3d4
    • L
      mm/memory_hotplug: add pgprot_t to mhp_params · bfeb022f
      Logan Gunthorpe 提交于
      devm_memremap_pages() is currently used by the PCI P2PDMA code to create
      struct page mappings for IO memory.  At present, these mappings are
      created with PAGE_KERNEL which implies setting the PAT bits to be WB.
      However, on x86, an mtrr register will typically override this and force
      the cache type to be UC-.  In the case firmware doesn't set this
      register it is effectively WB and will typically result in a machine
      check exception when it's accessed.
      
      Other arches are not currently likely to function correctly seeing they
      don't have any MTRR registers to fall back on.
      
      To solve this, provide a way to specify the pgprot value explicitly to
      arch_add_memory().
      
      Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a
      simple change to pass the pgprot_t down to their respective functions
      which set up the page tables.  For x86_32, set the page tables
      explicitly using _set_memory_prot() (seeing they are already mapped).
      
      For ia64, s390 and sh, reject anything but PAGE_KERNEL settings -- this
      should be fine, for now, seeing these architectures don't support
      ZONE_DEVICE.
      
      A check in __add_pages() is also added to ensure the pgprot parameter
      was set for all arches.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-7-logang@deltatee.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfeb022f
    • L
      mm/memory_hotplug: rename mhp_restrictions to mhp_params · f5637d3b
      Logan Gunthorpe 提交于
      The mhp_restrictions struct really doesn't specify anything resembling a
      restriction anymore so rename it to be mhp_params as it is a list of
      extended parameters.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-3-logang@deltatee.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5637d3b
    • L
      mm/memory_hotplug: drop the flags field from struct mhp_restrictions · 96c6b598
      Logan Gunthorpe 提交于
      Patch series "Allow setting caching mode in arch_add_memory() for
      P2PDMA", v4.
      
      Currently, the page tables created using memremap_pages() are always
      created with the PAGE_KERNEL cacheing mode.  However, the P2PDMA code is
      creating pages for PCI BAR memory which should never be accessed through
      the cache and instead use either WC or UC.  This still works in most
      cases, on x86, because the MTRR registers typically override the caching
      settings in the page tables for all of the IO memory to be UC-.
      However, this tends not to work so well on other arches or some rare x86
      machines that have firmware which does not setup the MTRR registers in
      this way.
      
      Instead of this, this series proposes a change to arch_add_memory() to
      take the pgprot required by the mapping which allows us to explicitly
      set pagetable entries for P2PDMA memory to UC.
      
      This changes is pretty routine for most of the arches: x86_64, arm64 and
      powerpc simply need to thread the pgprot through to where the page
      tables are setup.  x86_32 unfortunately sets up the page tables at boot
      so must use _set_memory_prot() to change their caching mode.  ia64, s390
      and sh don't appear to have an easy way to change the page tables so,
      for now at least, we just return -EINVAL on such mappings and thus they
      will not support P2PDMA memory until the work for this is done.  This
      should be fine as they don't yet support ZONE_DEVICE.
      
      This patch (of 7):
      
      This variable is not used anywhere and should therefore be removed from
      the structure.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-2-logang@deltatee.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96c6b598
    • A
      mm/special: create generic fallbacks for pte_special() and pte_mkspecial() · 78e7c5af
      Anshuman Khandual 提交于
      Currently there are many platforms that dont enable ARCH_HAS_PTE_SPECIAL
      but required to define quite similar fallback stubs for special page
      table entry helpers such as pte_special() and pte_mkspecial(), as they
      get build in generic MM without a config check.  This creates two
      generic fallback stub definitions for these helpers, eliminating much
      code duplication.
      
      mips platform has a special case where pte_special() and pte_mkspecial()
      visibility is wider than what ARCH_HAS_PTE_SPECIAL enablement requires.
      This restricts those symbol visibility in order to avoid redefinitions
      which is now exposed through this new generic stubs and subsequent build
      failure.  arm platform set_pte_at() definition needs to be moved into a
      C file just to prevent a build failure.
      
      [anshuman.khandual@arm.com: use defined(CONFIG_ARCH_HAS_PTE_SPECIAL) in mips per Thomas]
        Link: http://lkml.kernel.org/r/1583851924-21603-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: Guo Ren <guoren@kernel.org>			[csky]
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Acked-by: Stafford Horne <shorne@gmail.com>		[openrisc]
      Acked-by: Helge Deller <deller@gmx.de>			[parisc]
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Sam Creasey <sammy@sammy.net>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paulburton@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Link: http://lkml.kernel.org/r/1583802551-15406-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78e7c5af
    • A
      mm/vma: introduce VM_ACCESS_FLAGS · 6cb4d9a2
      Anshuman Khandual 提交于
      There are many places where all basic VMA access flags (read, write,
      exec) are initialized or checked against as a group.  One such example
      is during page fault.  Existing vma_is_accessible() wrapper already
      creates the notion of VMA accessibility as a group access permissions.
      
      Hence lets just create VM_ACCESS_FLAGS (VM_READ|VM_WRITE|VM_EXEC) which
      will not only reduce code duplication but also extend the VMA
      accessibility concept in general.
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Rob Springer <rspringer@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Link: http://lkml.kernel.org/r/1583391014-8170-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cb4d9a2
    • A
      mm/vma: define a default value for VM_DATA_DEFAULT_FLAGS · c62da0c3
      Anshuman Khandual 提交于
      There are many platforms with exact same value for VM_DATA_DEFAULT_FLAGS
      This creates a default value for VM_DATA_DEFAULT_FLAGS in line with the
      existing VM_STACK_DEFAULT_FLAGS.  While here, also define some more
      macros with standard VMA access flag combinations that are used
      frequently across many platforms.  Apart from simplification, this
      reduces code duplication as well.
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paulburton@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Chris Zankel <chris@zankel.net>
      Link: http://lkml.kernel.org/r/1583391014-8170-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c62da0c3
    • A
      mm/memory.c: add vm_insert_pages() · 8cd3984d
      Arjun Roy 提交于
      Add the ability to insert multiple pages at once to a user VM with lower
      PTE spinlock operations.
      
      The intention of this patch-set is to reduce atomic ops for tcp zerocopy
      receives, which normally hits the same spinlock multiple times
      consecutively.
      
      [akpm@linux-foundation.org: pte_alloc() no longer takes the `addr' argument]
      [arjunroy@google.com: add missing page_count() check to vm_insert_pages()]
        Link: http://lkml.kernel.org/r/20200214005929.104481-1-arjunroy.kdev@gmail.com
      [arjunroy@google.com: vm_insert_pages() checks if pte_index defined]
        Link: http://lkml.kernel.org/r/20200228054714.204424-2-arjunroy.kdev@gmail.comSigned-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200128025958.43490-2-arjunroy.kdev@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cd3984d
    • R
      mm: hugetlb: optionally allocate gigantic hugepages using cma · cf11e85f
      Roman Gushchin 提交于
      Commit 944d9fec ("hugetlb: add support for gigantic page allocation
      at runtime") has added the run-time allocation of gigantic pages.
      
      However it actually works only at early stages of the system loading,
      when the majority of memory is free.  After some time the memory gets
      fragmented by non-movable pages, so the chances to find a contiguous 1GB
      block are getting close to zero.  Even dropping caches manually doesn't
      help a lot.
      
      At large scale rebooting servers in order to allocate gigantic hugepages
      is quite expensive and complex.  At the same time keeping some constant
      percentage of memory in reserved hugepages even if the workload isn't
      using it is a big waste: not all workloads can benefit from using 1 GB
      pages.
      
      The following solution can solve the problem:
      1) On boot time a dedicated cma area* is reserved. The size is passed
         as a kernel argument.
      2) Run-time allocations of gigantic hugepages are performed using the
         cma allocator and the dedicated cma area
      
      In this case gigantic hugepages can be allocated successfully with a
      high probability, however the memory isn't completely wasted if nobody
      is using 1GB hugepages: it can be used for pagecache, anon memory, THPs,
      etc.
      
      * On a multi-node machine a per-node cma area is allocated on each node.
        Following gigantic hugetlb allocation are using the first available
        numa node if the mask isn't specified by a user.
      
      Usage:
      1) configure the kernel to allocate a cma area for hugetlb allocations:
         pass hugetlb_cma=10G as a kernel argument
      
      2) allocate hugetlb pages as usual, e.g.
         echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
      
      If the option isn't enabled or the allocation of the cma area failed,
      the current behavior of the system is preserved.
      
      x86 and arm-64 are covered by this patch, other architectures can be
      trivially added later.
      
      The patch contains clean-ups and fixes proposed and implemented by Aslan
      Bakirov and Randy Dunlap.  It also contains ideas and suggestions
      proposed by Rik van Riel, Michal Hocko and Mike Kravetz.  Thanks!
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NAndreas Schaufler <andreas.schaufler@gmx.de>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Cc: Aslan Bakirov <aslan@fb.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Link: http://lkml.kernel.org/r/20200407163840.92263-3-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf11e85f
    • A
      mm: cma: NUMA node interface · 8676af1f
      Aslan Bakirov 提交于
      I've noticed that there is no interface exposed by CMA which would let
      me to declare contigous memory on particular NUMA node.
      
      This patchset adds the ability to try to allocate contiguous memory on a
      specific node.  It will fallback to other nodes if the specified one
      doesn't work.
      
      Implement a new method for declaring contigous memory on particular node
      and keep cma_declare_contiguous() as a wrapper.
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NAslan Bakirov <aslan@fb.com>
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Cc: Andreas Schaufler <andreas.schaufler@gmx.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Link: http://lkml.kernel.org/r/20200407163840.92263-2-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8676af1f
    • M
      docs: mm: slab.h: fix a broken cross-reference · 2370ae4b
      Mauro Carvalho Chehab 提交于
      There is a typo at the cross-reference link, causing this warning:
      
        include/linux/slab.h:11: WARNING: undefined label: memory-allocation (if the link has no caption the label must precede a section header)
      Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/0aeac24235d356ebd935d11e147dcc6edbb6465c.1586359676.git.mchehab+huawei@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2370ae4b
    • S
      printk: queue wake_up_klogd irq_work only if per-CPU areas are ready · ab6f762f
      Sergey Senozhatsky 提交于
      printk_deferred(), similarly to printk_safe/printk_nmi, does not
      immediately attempt to print a new message on the consoles, avoiding
      calls into non-reentrant kernel paths, e.g. scheduler or timekeeping,
      which potentially can deadlock the system.
      
      Those printk() flavors, instead, rely on per-CPU flush irq_work to print
      messages from safer contexts.  For same reasons (recursive scheduler or
      timekeeping calls) printk() uses per-CPU irq_work in order to wake up
      user space syslog/kmsg readers.
      
      However, only printk_safe/printk_nmi do make sure that per-CPU areas
      have been initialised and that it's safe to modify per-CPU irq_work.
      This means that, for instance, should printk_deferred() be invoked "too
      early", that is before per-CPU areas are initialised, printk_deferred()
      will perform illegal per-CPU access.
      
      Lech Perczak [0] reports that after commit 1b710b1b ("char/random:
      silence a lockdep splat with printk()") user-space syslog/kmsg readers
      are not able to read new kernel messages.
      
      The reason is printk_deferred() being called too early (as was pointed
      out by Petr and John).
      
      Fix printk_deferred() and do not queue per-CPU irq_work before per-CPU
      areas are initialized.
      
      Link: https://lore.kernel.org/lkml/aa0732c6-5c4e-8a8b-a1c1-75ebe3dca05b@camlintechnologies.com/Reported-by: NLech Perczak <l.perczak@camlintechnologies.com>
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Tested-by: NJann Horn <jannh@google.com>
      Reviewed-by: NPetr Mladek <pmladek@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: John Ogness <john.ogness@linutronix.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab6f762f
  5. 10 4月, 2020 1 次提交
    • E
      proc: Use a dedicated lock in struct pid · 63f818f4
      Eric W. Biederman 提交于
      syzbot wrote:
      > ========================================================
      > WARNING: possible irq lock inversion dependency detected
      > 5.6.0-syzkaller #0 Not tainted
      > --------------------------------------------------------
      > swapper/1/0 just changed the state of lock:
      > ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
      > but this lock took another, SOFTIRQ-unsafe lock in the past:
      >  (&pid->wait_pidfd){+.+.}-{2:2}
      >
      >
      > and interrupts could create inverse lock ordering between them.
      >
      >
      > other info that might help us debug this:
      >  Possible interrupt unsafe locking scenario:
      >
      >        CPU0                    CPU1
      >        ----                    ----
      >   lock(&pid->wait_pidfd);
      >                                local_irq_disable();
      >                                lock(tasklist_lock);
      >                                lock(&pid->wait_pidfd);
      >   <Interrupt>
      >     lock(tasklist_lock);
      >
      >  *** DEADLOCK ***
      >
      > 4 locks held by swapper/1/0:
      
      The problem is that because wait_pidfd.lock is taken under the tasklist
      lock.  It must always be taken with irqs disabled as tasklist_lock can be
      taken from interrupt context and if wait_pidfd.lock was already taken this
      would create a lock order inversion.
      
      Oleg suggested just disabling irqs where I have added extra calls to
      wait_pidfd.lock.  That should be safe and I think the code will eventually
      do that.  It was rightly pointed out by Christian that sharing the
      wait_pidfd.lock was a premature optimization.
      
      It is also true that my pre-merge window testing was insufficient.  So
      remove the premature optimization and give struct pid a dedicated lock of
      it's own for struct pid things.  I have verified that lockdep sees all 3
      paths where we take the new pid->lock and lockdep does not complain.
      
      It is my current day dream that one day pid->lock can be used to guard the
      task lists as well and then the tasklist_lock won't need to be held to
      deliver signals.  That will require taking pid->lock with irqs disabled.
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
      Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
      Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
      Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
      Fixes: 7bc3e6e5 ("proc: Use a list of inodes to flush from proc")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      63f818f4
  6. 08 4月, 2020 17 次提交