1. 22 5月, 2020 1 次提交
  2. 11 4月, 2020 12 次提交
    • P
      change email address for Pali Rohár · 149ed3d4
      Pali Rohár 提交于
      For security reasons I stopped using gmail account and kernel address is
      now up-to-date alias to my personal address.
      
      People periodically send me emails to address which they found in source
      code of drivers, so this change reflects state where people can contact
      me.
      
      [ Added .mailmap entry as per Joe Perches  - Linus ]
      Signed-off-by: NPali Rohár <pali@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joe Perches <joe@perches.com>
      Link: http://lkml.kernel.org/r/20200307104237.8199-1-pali@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      149ed3d4
    • L
      mm/memory_hotplug: add pgprot_t to mhp_params · bfeb022f
      Logan Gunthorpe 提交于
      devm_memremap_pages() is currently used by the PCI P2PDMA code to create
      struct page mappings for IO memory.  At present, these mappings are
      created with PAGE_KERNEL which implies setting the PAT bits to be WB.
      However, on x86, an mtrr register will typically override this and force
      the cache type to be UC-.  In the case firmware doesn't set this
      register it is effectively WB and will typically result in a machine
      check exception when it's accessed.
      
      Other arches are not currently likely to function correctly seeing they
      don't have any MTRR registers to fall back on.
      
      To solve this, provide a way to specify the pgprot value explicitly to
      arch_add_memory().
      
      Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a
      simple change to pass the pgprot_t down to their respective functions
      which set up the page tables.  For x86_32, set the page tables
      explicitly using _set_memory_prot() (seeing they are already mapped).
      
      For ia64, s390 and sh, reject anything but PAGE_KERNEL settings -- this
      should be fine, for now, seeing these architectures don't support
      ZONE_DEVICE.
      
      A check in __add_pages() is also added to ensure the pgprot parameter
      was set for all arches.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-7-logang@deltatee.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfeb022f
    • L
      mm/memory_hotplug: rename mhp_restrictions to mhp_params · f5637d3b
      Logan Gunthorpe 提交于
      The mhp_restrictions struct really doesn't specify anything resembling a
      restriction anymore so rename it to be mhp_params as it is a list of
      extended parameters.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-3-logang@deltatee.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5637d3b
    • L
      mm/memory_hotplug: drop the flags field from struct mhp_restrictions · 96c6b598
      Logan Gunthorpe 提交于
      Patch series "Allow setting caching mode in arch_add_memory() for
      P2PDMA", v4.
      
      Currently, the page tables created using memremap_pages() are always
      created with the PAGE_KERNEL cacheing mode.  However, the P2PDMA code is
      creating pages for PCI BAR memory which should never be accessed through
      the cache and instead use either WC or UC.  This still works in most
      cases, on x86, because the MTRR registers typically override the caching
      settings in the page tables for all of the IO memory to be UC-.
      However, this tends not to work so well on other arches or some rare x86
      machines that have firmware which does not setup the MTRR registers in
      this way.
      
      Instead of this, this series proposes a change to arch_add_memory() to
      take the pgprot required by the mapping which allows us to explicitly
      set pagetable entries for P2PDMA memory to UC.
      
      This changes is pretty routine for most of the arches: x86_64, arm64 and
      powerpc simply need to thread the pgprot through to where the page
      tables are setup.  x86_32 unfortunately sets up the page tables at boot
      so must use _set_memory_prot() to change their caching mode.  ia64, s390
      and sh don't appear to have an easy way to change the page tables so,
      for now at least, we just return -EINVAL on such mappings and thus they
      will not support P2PDMA memory until the work for this is done.  This
      should be fine as they don't yet support ZONE_DEVICE.
      
      This patch (of 7):
      
      This variable is not used anywhere and should therefore be removed from
      the structure.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-2-logang@deltatee.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96c6b598
    • A
      mm/special: create generic fallbacks for pte_special() and pte_mkspecial() · 78e7c5af
      Anshuman Khandual 提交于
      Currently there are many platforms that dont enable ARCH_HAS_PTE_SPECIAL
      but required to define quite similar fallback stubs for special page
      table entry helpers such as pte_special() and pte_mkspecial(), as they
      get build in generic MM without a config check.  This creates two
      generic fallback stub definitions for these helpers, eliminating much
      code duplication.
      
      mips platform has a special case where pte_special() and pte_mkspecial()
      visibility is wider than what ARCH_HAS_PTE_SPECIAL enablement requires.
      This restricts those symbol visibility in order to avoid redefinitions
      which is now exposed through this new generic stubs and subsequent build
      failure.  arm platform set_pte_at() definition needs to be moved into a
      C file just to prevent a build failure.
      
      [anshuman.khandual@arm.com: use defined(CONFIG_ARCH_HAS_PTE_SPECIAL) in mips per Thomas]
        Link: http://lkml.kernel.org/r/1583851924-21603-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: Guo Ren <guoren@kernel.org>			[csky]
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Acked-by: Stafford Horne <shorne@gmail.com>		[openrisc]
      Acked-by: Helge Deller <deller@gmx.de>			[parisc]
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Sam Creasey <sammy@sammy.net>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paulburton@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Link: http://lkml.kernel.org/r/1583802551-15406-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78e7c5af
    • A
      mm/vma: introduce VM_ACCESS_FLAGS · 6cb4d9a2
      Anshuman Khandual 提交于
      There are many places where all basic VMA access flags (read, write,
      exec) are initialized or checked against as a group.  One such example
      is during page fault.  Existing vma_is_accessible() wrapper already
      creates the notion of VMA accessibility as a group access permissions.
      
      Hence lets just create VM_ACCESS_FLAGS (VM_READ|VM_WRITE|VM_EXEC) which
      will not only reduce code duplication but also extend the VMA
      accessibility concept in general.
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Rob Springer <rspringer@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Link: http://lkml.kernel.org/r/1583391014-8170-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cb4d9a2
    • A
      mm/vma: define a default value for VM_DATA_DEFAULT_FLAGS · c62da0c3
      Anshuman Khandual 提交于
      There are many platforms with exact same value for VM_DATA_DEFAULT_FLAGS
      This creates a default value for VM_DATA_DEFAULT_FLAGS in line with the
      existing VM_STACK_DEFAULT_FLAGS.  While here, also define some more
      macros with standard VMA access flag combinations that are used
      frequently across many platforms.  Apart from simplification, this
      reduces code duplication as well.
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paulburton@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Chris Zankel <chris@zankel.net>
      Link: http://lkml.kernel.org/r/1583391014-8170-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c62da0c3
    • A
      mm/memory.c: add vm_insert_pages() · 8cd3984d
      Arjun Roy 提交于
      Add the ability to insert multiple pages at once to a user VM with lower
      PTE spinlock operations.
      
      The intention of this patch-set is to reduce atomic ops for tcp zerocopy
      receives, which normally hits the same spinlock multiple times
      consecutively.
      
      [akpm@linux-foundation.org: pte_alloc() no longer takes the `addr' argument]
      [arjunroy@google.com: add missing page_count() check to vm_insert_pages()]
        Link: http://lkml.kernel.org/r/20200214005929.104481-1-arjunroy.kdev@gmail.com
      [arjunroy@google.com: vm_insert_pages() checks if pte_index defined]
        Link: http://lkml.kernel.org/r/20200228054714.204424-2-arjunroy.kdev@gmail.comSigned-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200128025958.43490-2-arjunroy.kdev@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cd3984d
    • R
      mm: hugetlb: optionally allocate gigantic hugepages using cma · cf11e85f
      Roman Gushchin 提交于
      Commit 944d9fec ("hugetlb: add support for gigantic page allocation
      at runtime") has added the run-time allocation of gigantic pages.
      
      However it actually works only at early stages of the system loading,
      when the majority of memory is free.  After some time the memory gets
      fragmented by non-movable pages, so the chances to find a contiguous 1GB
      block are getting close to zero.  Even dropping caches manually doesn't
      help a lot.
      
      At large scale rebooting servers in order to allocate gigantic hugepages
      is quite expensive and complex.  At the same time keeping some constant
      percentage of memory in reserved hugepages even if the workload isn't
      using it is a big waste: not all workloads can benefit from using 1 GB
      pages.
      
      The following solution can solve the problem:
      1) On boot time a dedicated cma area* is reserved. The size is passed
         as a kernel argument.
      2) Run-time allocations of gigantic hugepages are performed using the
         cma allocator and the dedicated cma area
      
      In this case gigantic hugepages can be allocated successfully with a
      high probability, however the memory isn't completely wasted if nobody
      is using 1GB hugepages: it can be used for pagecache, anon memory, THPs,
      etc.
      
      * On a multi-node machine a per-node cma area is allocated on each node.
        Following gigantic hugetlb allocation are using the first available
        numa node if the mask isn't specified by a user.
      
      Usage:
      1) configure the kernel to allocate a cma area for hugetlb allocations:
         pass hugetlb_cma=10G as a kernel argument
      
      2) allocate hugetlb pages as usual, e.g.
         echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
      
      If the option isn't enabled or the allocation of the cma area failed,
      the current behavior of the system is preserved.
      
      x86 and arm-64 are covered by this patch, other architectures can be
      trivially added later.
      
      The patch contains clean-ups and fixes proposed and implemented by Aslan
      Bakirov and Randy Dunlap.  It also contains ideas and suggestions
      proposed by Rik van Riel, Michal Hocko and Mike Kravetz.  Thanks!
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NAndreas Schaufler <andreas.schaufler@gmx.de>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Cc: Aslan Bakirov <aslan@fb.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Link: http://lkml.kernel.org/r/20200407163840.92263-3-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf11e85f
    • A
      mm: cma: NUMA node interface · 8676af1f
      Aslan Bakirov 提交于
      I've noticed that there is no interface exposed by CMA which would let
      me to declare contigous memory on particular NUMA node.
      
      This patchset adds the ability to try to allocate contiguous memory on a
      specific node.  It will fallback to other nodes if the specified one
      doesn't work.
      
      Implement a new method for declaring contigous memory on particular node
      and keep cma_declare_contiguous() as a wrapper.
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NAslan Bakirov <aslan@fb.com>
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Cc: Andreas Schaufler <andreas.schaufler@gmx.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Link: http://lkml.kernel.org/r/20200407163840.92263-2-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8676af1f
    • M
      docs: mm: slab.h: fix a broken cross-reference · 2370ae4b
      Mauro Carvalho Chehab 提交于
      There is a typo at the cross-reference link, causing this warning:
      
        include/linux/slab.h:11: WARNING: undefined label: memory-allocation (if the link has no caption the label must precede a section header)
      Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/0aeac24235d356ebd935d11e147dcc6edbb6465c.1586359676.git.mchehab+huawei@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2370ae4b
    • S
      printk: queue wake_up_klogd irq_work only if per-CPU areas are ready · ab6f762f
      Sergey Senozhatsky 提交于
      printk_deferred(), similarly to printk_safe/printk_nmi, does not
      immediately attempt to print a new message on the consoles, avoiding
      calls into non-reentrant kernel paths, e.g. scheduler or timekeeping,
      which potentially can deadlock the system.
      
      Those printk() flavors, instead, rely on per-CPU flush irq_work to print
      messages from safer contexts.  For same reasons (recursive scheduler or
      timekeeping calls) printk() uses per-CPU irq_work in order to wake up
      user space syslog/kmsg readers.
      
      However, only printk_safe/printk_nmi do make sure that per-CPU areas
      have been initialised and that it's safe to modify per-CPU irq_work.
      This means that, for instance, should printk_deferred() be invoked "too
      early", that is before per-CPU areas are initialised, printk_deferred()
      will perform illegal per-CPU access.
      
      Lech Perczak [0] reports that after commit 1b710b1b ("char/random:
      silence a lockdep splat with printk()") user-space syslog/kmsg readers
      are not able to read new kernel messages.
      
      The reason is printk_deferred() being called too early (as was pointed
      out by Petr and John).
      
      Fix printk_deferred() and do not queue per-CPU irq_work before per-CPU
      areas are initialized.
      
      Link: https://lore.kernel.org/lkml/aa0732c6-5c4e-8a8b-a1c1-75ebe3dca05b@camlintechnologies.com/Reported-by: NLech Perczak <l.perczak@camlintechnologies.com>
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Tested-by: NJann Horn <jannh@google.com>
      Reviewed-by: NPetr Mladek <pmladek@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: John Ogness <john.ogness@linutronix.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab6f762f
  3. 10 4月, 2020 1 次提交
    • E
      proc: Use a dedicated lock in struct pid · 63f818f4
      Eric W. Biederman 提交于
      syzbot wrote:
      > ========================================================
      > WARNING: possible irq lock inversion dependency detected
      > 5.6.0-syzkaller #0 Not tainted
      > --------------------------------------------------------
      > swapper/1/0 just changed the state of lock:
      > ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
      > but this lock took another, SOFTIRQ-unsafe lock in the past:
      >  (&pid->wait_pidfd){+.+.}-{2:2}
      >
      >
      > and interrupts could create inverse lock ordering between them.
      >
      >
      > other info that might help us debug this:
      >  Possible interrupt unsafe locking scenario:
      >
      >        CPU0                    CPU1
      >        ----                    ----
      >   lock(&pid->wait_pidfd);
      >                                local_irq_disable();
      >                                lock(tasklist_lock);
      >                                lock(&pid->wait_pidfd);
      >   <Interrupt>
      >     lock(tasklist_lock);
      >
      >  *** DEADLOCK ***
      >
      > 4 locks held by swapper/1/0:
      
      The problem is that because wait_pidfd.lock is taken under the tasklist
      lock.  It must always be taken with irqs disabled as tasklist_lock can be
      taken from interrupt context and if wait_pidfd.lock was already taken this
      would create a lock order inversion.
      
      Oleg suggested just disabling irqs where I have added extra calls to
      wait_pidfd.lock.  That should be safe and I think the code will eventually
      do that.  It was rightly pointed out by Christian that sharing the
      wait_pidfd.lock was a premature optimization.
      
      It is also true that my pre-merge window testing was insufficient.  So
      remove the premature optimization and give struct pid a dedicated lock of
      it's own for struct pid things.  I have verified that lockdep sees all 3
      paths where we take the new pid->lock and lockdep does not complain.
      
      It is my current day dream that one day pid->lock can be used to guard the
      task lists as well and then the tasklist_lock won't need to be held to
      deliver signals.  That will require taking pid->lock with irqs disabled.
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
      Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
      Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
      Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
      Fixes: 7bc3e6e5 ("proc: Use a list of inodes to flush from proc")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      63f818f4
  4. 08 4月, 2020 26 次提交
    • J
      locking/refcount: Document interaction with PID_MAX_LIMIT · a13f58a0
      Jann Horn 提交于
      Document the circumstances under which refcount_t's saturation mechanism
      works deterministically.
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NWill Deacon <will@kernel.org>
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20200303105427.260620-1-jannh@google.com
      a13f58a0
    • R
      linux/bits.h: add compile time sanity check of GENMASK inputs · 295bcca8
      Rikard Falkeborn 提交于
      GENMASK() and GENMASK_ULL() are supposed to be called with the high bit as
      the first argument and the low bit as the second argument.  Mixing them
      will return a mask with zero bits set.
      
      Recent commits show getting this wrong is not uncommon, see e.g.  commit
      aa4c0c90 ("net: stmmac: Fix misuses of GENMASK macro") and commit
      9bdd7bb3 ("clocksource/drivers/npcm: Fix misuse of GENMASK macro").
      
      To prevent such mistakes from appearing again, add compile time sanity
      checking to the arguments of GENMASK() and GENMASK_ULL().  If both
      arguments are known at compile time, and the low bit is higher than the
      high bit, break the build to detect the mistake immediately.
      
      Since GENMASK() is used in declarations, BUILD_BUG_ON_ZERO() must be used
      instead of BUILD_BUG_ON().
      
      __builtin_constant_p does not evaluate is argument, it only checks if it
      is a constant or not at compile time, and __builtin_choose_expr does not
      evaluate the expression that is not chosen.  Therefore, GENMASK(x++, 0)
      does only evaluate x++ once.
      
      Commit 95b980d6 ("linux/bits.h: make BIT(), GENMASK(), and friends
      available in assembly") made the macros in linux/bits.h available in
      assembly.  Since BUILD_BUG_OR_ZERO() is not asm compatible, disable the
      checks if the file is included in an asm file.
      
      Due to bugs in GCC versions before 4.9 [0], disable the check if building
      with a too old GCC compiler.
      
      [0]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19449Signed-off-by: NRikard Falkeborn <rikard.falkeborn@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Haren Myneni <haren@us.ibm.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: lkml <linux-kernel@vger.kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20200308193954.2372399-1-rikard.falkeborn@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      295bcca8
    • Q
      percpu_counter: fix a data race at vm_committed_as · 7e234520
      Qian Cai 提交于
      "vm_committed_as.count" could be accessed concurrently as reported by
      KCSAN,
      
       BUG: KCSAN: data-race in __vm_enough_memory / percpu_counter_add_batch
      
       write to 0xffffffff9451c538 of 8 bytes by task 65879 on cpu 35:
        percpu_counter_add_batch+0x83/0xd0
        percpu_counter_add_batch at lib/percpu_counter.c:91
        __vm_enough_memory+0xb9/0x260
        dup_mm+0x3a4/0x8f0
        copy_process+0x2458/0x3240
        _do_fork+0xaa/0x9f0
        __do_sys_clone+0x125/0x160
        __x64_sys_clone+0x70/0x90
        do_syscall_64+0x91/0xb05
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
       read to 0xffffffff9451c538 of 8 bytes by task 66773 on cpu 19:
        __vm_enough_memory+0x199/0x260
        percpu_counter_read_positive at include/linux/percpu_counter.h:81
        (inlined by) __vm_enough_memory at mm/util.c:839
        mmap_region+0x1b2/0xa10
        do_mmap+0x45c/0x700
        vm_mmap_pgoff+0xc0/0x130
        ksys_mmap_pgoff+0x6e/0x300
        __x64_sys_mmap+0x33/0x40
        do_syscall_64+0x91/0xb05
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The read is outside percpu_counter::lock critical section which results in
      a data race.  Fix it by adding a READ_ONCE() in
      percpu_counter_read_positive() which could also service as the existing
      compiler memory barrier.
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMarco Elver <elver@google.com>
      Link: http://lkml.kernel.org/r/1582302724-2804-1-git-send-email-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e234520
    • A
      kasan: stackdepot: move filter_irq_stacks() to stackdepot.c · 505a0ef1
      Alexander Potapenko 提交于
      filter_irq_stacks() can be used by other tools (e.g.  KMSAN), so it needs
      to be moved to a common location.  lib/stackdepot.c seems a good place, as
      filter_irq_stacks() is usually applied to the output of
      stack_trace_save().
      
      This patch has been previously mailed as part of KMSAN RFC patch series.
      
      [glider@google.co: nds32: linker script: add SOFTIRQENTRY_TEXT\
        Link: http://lkml.kernel.org/r/20200311121002.241430-1-glider@google.com
      [glider@google.com: add IRQENTRY_TEXT and SOFTIRQENTRY_TEXT to linker script]
        Link: http://lkml.kernel.org/r/20200311121124.243352-1-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Link: http://lkml.kernel.org/r/20200220141916.55455-3-glider@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      505a0ef1
    • J
      bitops: always inline sign extension helpers · f80ac98a
      Josh Poimboeuf 提交于
      With CONFIG_CC_OPTIMIZE_FOR_SIZE, objtool reports:
      
        drivers/gpu/drm/i915/gem/i915_gem_execbuffer.o: warning: objtool: i915_gem_execbuffer2_ioctl()+0x5b7: call to gen8_canonical_addr() with UACCESS enabled
      
      This means i915_gem_execbuffer2_ioctl() is calling gen8_canonical_addr()
      from the user_access_begin/end critical region (i.e, with SMAP disabled).
      
      While it's probably harmless in this case, in general we like to avoid
      extra function calls in SMAP-disabled regions because it can open up
      inadvertent security holes.
      
      Fix the warning by changing the sign extension helpers to __always_inline.
      This convinces GCC to inline gen8_canonical_addr().
      
      The sign extension functions are trivial anyway, so it makes sense to
      always inline them.  With my test optimize-for-size-based config, this
      actually shrinks the text size of i915_gem_execbuffer.o by 45 bytes -- and
      no change for vmlinux.
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Link: http://lkml.kernel.org/r/740179324b2b18b750b16295c48357f00b5fa9ed.1582982020.git.jpoimboe@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f80ac98a
    • V
      compiler.h: fix error in BUILD_BUG_ON() reporting · af9c5d2e
      Vegard Nossum 提交于
      compiletime_assert() uses __LINE__ to create a unique function name.  This
      means that if you have more than one BUILD_BUG_ON() in the same source
      line (which can happen if they appear e.g.  in a macro), then the error
      message from the compiler might output the wrong condition.
      
      For this source file:
      
      	#include <linux/build_bug.h>
      
      	#define macro() \
      		BUILD_BUG_ON(1); \
      		BUILD_BUG_ON(0);
      
      	void foo()
      	{
      		macro();
      	}
      
      gcc would output:
      
      ./include/linux/compiler.h:350:38: error: call to `__compiletime_assert_9' declared with attribute error: BUILD_BUG_ON failed: 0
        _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
      
      However, it was not the BUILD_BUG_ON(0) that failed, so it should say 1
      instead of 0. With this patch, we use __COUNTER__ instead of __LINE__, so
      each BUILD_BUG_ON() gets a different function name and the correct
      condition is printed:
      
      ./include/linux/compiler.h:350:38: error: call to `__compiletime_assert_0' declared with attribute error: BUILD_BUG_ON failed: 1
        _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Reviewed-by: NDaniel Santos <daniel.santos@pobox.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Ian Abbott <abbotti@mev.co.uk>
      Cc: Joe Perches <joe@perches.com>
      Link: http://lkml.kernel.org/r/20200331112637.25047-1-vegard.nossum@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af9c5d2e
    • M
      compiler: remove CONFIG_OPTIMIZE_INLINING entirely · 889b3c12
      Masahiro Yamada 提交于
      Commit ac7c3e4f ("compiler: enable CONFIG_OPTIMIZE_INLINING
      forcibly") made this always-on option. We released v5.4 and v5.5
      including that commit.
      
      Remove the CONFIG option and clean up the code now.
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20200220110807.32534-2-masahiroy@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      889b3c12
    • M
      seq_file: remove m->version · b829a0f0
      Matthew Wilcox (Oracle) 提交于
      The process maps file was the only user of version (introduced back in
      2005).  Now that it uses ppos instead, we can remove it.
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200317193201.9924-4-adobriyan@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b829a0f0
    • A
      proc: faster open/read/close with "permanent" files · d919b33d
      Alexey Dobriyan 提交于
      Now that "struct proc_ops" exist we can start putting there stuff which
      could not fly with VFS "struct file_operations"...
      
      Most of fs/proc/inode.c file is dedicated to make open/read/.../close
      reliable in the event of disappearing /proc entries which usually happens
      if module is getting removed.  Files like /proc/cpuinfo which never
      disappear simply do not need such protection.
      
      Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
      "permanent" files.
      
      Enable "permanent" flag for
      
      	/proc/cpuinfo
      	/proc/kmsg
      	/proc/modules
      	/proc/slabinfo
      	/proc/stat
      	/proc/sysvipc/*
      	/proc/swaps
      
      More will come once I figure out foolproof way to prevent out module
      authors from marking their stuff "permanent" for performance reasons
      when it is not.
      
      This should help with scalability: benchmark is "read /proc/cpuinfo R times
      by N threads scattered over the system".
      
      	N	R	t, s (before)	t, s (after)
      	-----------------------------------------------------
      	64	4096	1.582458	1.530502	-3.2%
      	256	4096	6.371926	6.125168	-3.9%
      	1024	4096	25.64888	24.47528	-4.6%
      
      Benchmark source:
      
      #include <chrono>
      #include <iostream>
      #include <thread>
      #include <vector>
      
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <unistd.h>
      
      const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
      int N;
      const char *filename;
      int R;
      
      int xxx = 0;
      
      int glue(int n)
      {
      	cpu_set_t m;
      	CPU_ZERO(&m);
      	CPU_SET(n, &m);
      	return sched_setaffinity(0, sizeof(cpu_set_t), &m);
      }
      
      void f(int n)
      {
      	glue(n % NR_CPUS);
      
      	while (*(volatile int *)&xxx == 0) {
      	}
      
      	for (int i = 0; i < R; i++) {
      		int fd = open(filename, O_RDONLY);
      		char buf[4096];
      		ssize_t rv = read(fd, buf, sizeof(buf));
      		asm volatile ("" :: "g" (rv));
      		close(fd);
      	}
      }
      
      int main(int argc, char *argv[])
      {
      	if (argc < 4) {
      		std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
      ";
      		return 1;
      	}
      
      	N = atoi(argv[1]);
      	filename = argv[2];
      	R = atoi(argv[3]);
      
      	for (int i = 0; i < NR_CPUS; i++) {
      		if (glue(i) == 0)
      			break;
      	}
      
      	std::vector<std::thread> T;
      	T.reserve(N);
      	for (int i = 0; i < N; i++) {
      		T.emplace_back(f, i);
      	}
      
      	auto t0 = std::chrono::system_clock::now();
      	{
      		*(volatile int *)&xxx = 1;
      		for (auto& t: T) {
      			t.join();
      		}
      	}
      	auto t1 = std::chrono::system_clock::now();
      	std::chrono::duration<double> dt = t1 - t0;
      	std::cout << dt.count() << '
      ';
      
      	return 0;
      }
      
      P.S.:
      Explicit randomization marker is added because adding non-function pointer
      will silently disable structure layout randomization.
      
      [akpm@linux-foundation.org: coding style fixes]
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Joe Perches <joe@perches.com>
      Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d919b33d
    • W
      mm: remove dummy struct bootmem_data/bootmem_data_t · 6218d740
      Waiman Long 提交于
      Both bootmem_data and bootmem_data_t structures are no longer defined.
      Remove the dummy forward declarations.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Link: http://lkml.kernel.org/r/20200326022617.26208-1-longman@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6218d740
    • I
      include/linux/memremap.h: remove stale comments · 1d90b649
      Ira Weiny 提交于
      Fixes: 80a72d0a ("memremap: remove the data field in struct dev_pagemap")
      Fixes: fdc029b1 ("memremap: remove the dev field in struct dev_pagemap")
      Signed-off-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Link: http://lkml.kernel.org/r/20200316213205.145333-1-ira.weiny@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d90b649
    • S
      include/linux/swapops.h: correct guards for non_swap_entry() · 3f3673d7
      Steven Price 提交于
      If CONFIG_DEVICE_PRIVATE is defined, but neither CONFIG_MEMORY_FAILURE nor
      CONFIG_MIGRATION, then non_swap_entry() will return 0, meaning that the
      condition (non_swap_entry(entry) && is_device_private_entry(entry)) in
      zap_pte_range() will never be true even if the entry is a device private
      one.
      
      Equally any other code depending on non_swap_entry() will not function as
      expected.
      
      I originally spotted this just by looking at the code, I haven't actually
      observed any problems.
      
      Looking a bit more closely it appears that actually this situation
      (currently at least) cannot occur:
      
      DEVICE_PRIVATE depends on ZONE_DEVICE
      ZONE_DEVICE depends on MEMORY_HOTREMOVE
      MEMORY_HOTREMOVE depends on MIGRATION
      
      Fixes: 5042db43 ("mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory")
      Signed-off-by: NSteven Price <steven.price@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Link: http://lkml.kernel.org/r/20200305130550.22693-1-steven.price@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f3673d7
    • C
      mm: fix ambiguous comments for better code readability · 552657b7
      chenqiwu 提交于
      The parameter of remap_pfn_range() @pfn passed from the caller is actually
      a page-frame number converted by corresponding physical address of kernel
      memory, the original comment is ambiguous that may mislead the users.
      
      Meanwhile, there is an ambiguous typo "VMM" in the comment of
      vm_area_struct.  So fixing them will make the code more readable.
      Signed-off-by: Nchenqiwu <chenqiwu@xiaomi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1583026921-15279-1-git-send-email-qiwuchen55@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      552657b7
    • D
      mm/memory_hotplug: allow to specify a default online_type · 5f47adf7
      David Hildenbrand 提交于
      For now, distributions implement advanced udev rules to essentially
      - Don't online any hotplugged memory (s390x)
      - Online all memory to ZONE_NORMAL (e.g., most virt environments like
        hyperv)
      - Online all memory to ZONE_MOVABLE in case the zone imbalance is taken
        care of (e.g., bare metal, special virt environments)
      
      In summary: All memory is usually onlined the same way, however, the
      kernel always has to ask user space to come up with the same answer.
      E.g., Hyper-V always waits for a memory block to get onlined before
      continuing, otherwise it might end up adding memory faster than
      onlining it, which can result in strange OOM situations.  This waiting
      slows down adding of a bigger amount of memory.
      
      Let's allow to specify a default online_type, not just "online" and
      "offline".  This allows distributions to configure the default online_type
      when booting up and be done with it.
      
      We can now specify "offline", "online", "online_movable" and
      "online_kernel" via
      - "memhp_default_state=" on the kernel cmdline
      - /sys/devices/system/memory/auto_online_blocks
      just like we are able to specify for a single memory block via
      /sys/devices/system/memory/memoryX/state
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-9-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f47adf7
    • D
      mm/memory_hotplug: convert memhp_auto_online to store an online_type · 862919e5
      David Hildenbrand 提交于
      ...  and rename it to memhp_default_online_type.  This is a preparation
      for more detailed default online behavior.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-8-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      862919e5
    • D
      drivers/base/memory: map MMOP_OFFLINE to 0 · efc978ad
      David Hildenbrand 提交于
      Historically, we used the value -1.  Just treat 0 as the special case now.
      Clarify a comment (which was wrong, when we come via device_online() the
      first time, the online_type would have been 0 / MEM_ONLINE).  The default
      is now always MMOP_OFFLINE.  This removes the last user of the manual
      "-1", which didn't use the enum value.
      
      This is a preparation to use the online_type as an array index.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efc978ad
    • D
      drivers/base/memory: rename MMOP_ONLINE_KEEP to MMOP_ONLINE · 956f8b44
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: allow to specify a default online_type", v3.
      
      Distributions nowadays use udev rules ([1] [2]) to specify if and how to
      online hotplugged memory.  The rules seem to get more complex with many
      special cases.  Due to the various special cases,
      CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used.  All memory hotplug
      is handled via udev rules.
      
      Every time we hotplug memory, the udev rule will come to the same
      conclusion.  Especially Hyper-V (but also soon virtio-mem) add a lot of
      memory in separate memory blocks and wait for memory to get onlined by
      user space before continuing to add more memory blocks (to not add memory
      faster than it is getting onlined).  This of course slows down the whole
      memory hotplug process.
      
      To make the job of distributions easier and to avoid udev rules that get
      more and more complicated, let's extend the mechanism provided by
      - /sys/devices/system/memory/auto_online_blocks
      - "memhp_default_state=" on the kernel cmdline
      to be able to specify also "online_movable" as well as "online_kernel"
      
      === Example /usr/libexec/config-memhotplug ===
      
      #!/bin/bash
      
      VIRT=`systemd-detect-virt --vm`
      ARCH=`uname -p`
      
      sense_virtio_mem() {
        if [ -d "/sys/bus/virtio/drivers/virtio_mem/" ]; then
          DEVICES=`find /sys/bus/virtio/drivers/virtio_mem/ -maxdepth 1 -type l | wc -l`
          if [ $DEVICES != "0" ]; then
              return 0
          fi
        fi
        return 1
      }
      
      if [ ! -e "/sys/devices/system/memory/auto_online_blocks" ]; then
        echo "Memory hotplug configuration support missing in the kernel"
        exit 1
      fi
      
      if grep "memhp_default_state=" /proc/cmdline > /dev/null; then
        echo "Memory hotplug configuration overridden in kernel cmdline (memhp_default_state=)"
        exit 1
      fi
      
      if [ $VIRT == "microsoft" ]; then
        echo "Detected Hyper-V on $ARCH"
        # Hyper-V wants all memory in ZONE_NORMAL
        ONLINE_TYPE="online_kernel"
      elif sense_virtio_mem; then
        echo "Detected virtio-mem on $ARCH"
        # virtio-mem wants all memory in ZONE_NORMAL
        ONLINE_TYPE="online_kernel"
      elif [ $ARCH == "s390x" ] || [ $ARCH == "s390" ]; then
        echo "Detected $ARCH"
        # standby memory should not be onlined automatically
        ONLINE_TYPE="offline"
      elif [ $ARCH == "ppc64" ] || [ $ARCH == "ppc64le" ]; then
        echo "Detected" $ARCH
        # PPC64 onlines all hotplugged memory right from the kernel
        ONLINE_TYPE="offline"
      elif [ $VIRT == "none" ]; then
        echo "Detected bare-metal on $ARCH"
        # Bare metal users expect hotplugged memory to be unpluggable. We assume
        # that ZONE imbalances on such enterpise servers cannot happen and is
        # properly documented
        ONLINE_TYPE="online_movable"
      else
        # TODO: Hypervisors that want to unplug DIMMs and can guarantee that ZONE
        # imbalances won't happen
        echo "Detected $VIRT on $ARCH"
        # Usually, ballooning is used in virtual environments, so memory should go to
        # ZONE_NORMAL. However, sometimes "movable_node" is relevant.
        ONLINE_TYPE="online"
      fi
      
      echo "Selected online_type:" $ONLINE_TYPE
      
      # Configure what to do with memory that will be hotplugged in the future
      echo $ONLINE_TYPE 2>/dev/null > /sys/devices/system/memory/auto_online_blocks
      if [ $? != "0" ]; then
        echo "Memory hotplug cannot be configured (e.g., old kernel or missing permissions)"
        # A backup udev rule should handle old kernels if necessary
        exit 1
      fi
      
      # Process all already pluggedd blocks (e.g., DIMMs, but also Hyper-V or virtio-mem)
      if [ $ONLINE_TYPE != "offline" ]; then
        for MEMORY in /sys/devices/system/memory/memory*; do
          STATE=`cat $MEMORY/state`
          if [ $STATE == "offline" ]; then
              echo $ONLINE_TYPE > $MEMORY/state
          fi
        done
      fi
      
      === Example /usr/lib/systemd/system/config-memhotplug.service ===
      
      [Unit]
      Description=Configure memory hotplug behavior
      DefaultDependencies=no
      Conflicts=shutdown.target
      Before=sysinit.target shutdown.target
      After=systemd-modules-load.service
      ConditionPathExists=|/sys/devices/system/memory/auto_online_blocks
      
      [Service]
      ExecStart=/usr/libexec/config-memhotplug
      Type=oneshot
      TimeoutSec=0
      RemainAfterExit=yes
      
      [Install]
      WantedBy=sysinit.target
      
      === Example modification to the 40-redhat.rules [2] ===
      
      : diff --git a/40-redhat.rules b/40-redhat.rules-new
      : index 2c690e5..168fd03 100644
      : --- a/40-redhat.rules
      : +++ b/40-redhat.rules-new
      : @@ -6,6 +6,9 @@ SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}
      :  # Memory hotadd request
      :  SUBSYSTEM!="memory", GOTO="memory_hotplug_end"
      :  ACTION!="add", GOTO="memory_hotplug_end"
      : +# memory hotplug behavior configured
      : +PROGRAM=="grep online /sys/devices/system/memory/auto_online_blocks", GOTO="memory_hotplug_end"
      : +
      :  PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
      :
      :  ENV{.state}="online"
      
      ===
      
      [1] https://github.com/lnykryn/systemd-rhel/pull/281
      [2] https://github.com/lnykryn/systemd-rhel/blob/staging/rules/40-redhat.rules
      
      This patch (of 8):
      
      The name is misleading and it's not really clear what is "kept".  Let's
      just name it like the online_type name we expose to user space ("online").
      
      Add some documentation to the types.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: http://lkml.kernel.org/r/20200319131221.14044-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20200317104942.11178-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      956f8b44
    • B
      mm/sparse.c: only use subsection map in VMEMMAP case · 0a9f9f62
      Baoquan He 提交于
      Currently, to support subsection aligned memory region adding for pmem,
      subsection map is added to track which subsection is present.
      
      However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP.  It means
      subsection map only makes sense when SPARSEMEM_VMEMMAP enabled.  For the
      classic sparse, it's meaningless.  Even worse, it may confuse people when
      checking code related to the classic sparse.
      
      About the classic sparse which doesn't support subsection hotplug, Dan
      said it's more because the effort and maintenance burden outweighs the
      benefit.  Besides, the current 64 bit ARCHes all enable
      SPARSEMEM_VMEMMAP_ENABLE by default.
      
      Combining the above reasons, no need to provide subsection map and the
      relevant handling for the classic sparse.  Let's remove them.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Link: http://lkml.kernel.org/r/20200312124414.439-4-bhe@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a9f9f62
    • D
      drivers/base/memory.c: drop section_count · 68c3a6ac
      David Hildenbrand 提交于
      Patch series "mm: drop superfluous section checks when onlining/offlining".
      
      Let's drop some superfluous section checks on the onlining/offlining path.
      
      This patch (of 3):
      
      Since commit c5e79ef5 ("mm/memory_hotplug.c: don't allow to
      online/offline memory blocks with holes") we have a generic check in
      offline_pages() that disallows offlining memory blocks with holes.
      
      Memory blocks with missing sections are just another variant of these type
      of blocks.  We can stop checking (and especially storing) present
      sections.  A proper error message is now printed why offlining failed.
      
      section_count was initially introduced in commit 07681215 ("Driver
      core: Add section count to memory_block struct") in order to detect when
      it is okay to remove a memory block.  It was used in commit 26bbe7ef
      ("drivers/base/memory.c: prohibit offlining of memory blocks with missing
      sections") to disallow offlining memory blocks with missing sections.  As
      we refactored creation/removal of memory devices and have a proper check
      for holes in place, we can drop the section_count.
      
      This also removes a leftover comment regarding the mem_sysfs_mutex, which
      was removed in commit 848e19ad ("drivers/base/memory.c: drop the
      mem_sysfs_mutex").
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Link: http://lkml.kernel.org/r/20200127110424.5757-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68c3a6ac
    • S
      userfaultfd: wp: support write protection for userfault vma range · ffd05793
      Shaohua Li 提交于
      Add API to enable/disable writeprotect a vma range.  Unlike mprotect, this
      doesn't split/merge vmas.
      
      [peterx@redhat.com:
       - use the helper to find VMA;
       - return -ENOENT if not found to match mcopy case;
       - use the new MM_CP_UFFD_WP* flags for change_protection
       - check against mmap_changing for failures
       - replace find_dst_vma with vma_find_uffd]
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220163112.11409-13-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ffd05793
    • P
      userfaultfd: wp: support swap and page migration · f45ec5ff
      Peter Xu 提交于
      For either swap and page migration, we all use the bit 2 of the entry to
      identify whether this entry is uffd write-protected.  It plays a similar
      role as the existing soft dirty bit in swap entries but only for keeping
      the uffd-wp tracking for a specific PTE/PMD.
      
      Something special here is that when we want to recover the uffd-wp bit
      from a swap/migration entry to the PTE bit we'll also need to take care of
      the _PAGE_RW bit and make sure it's cleared, otherwise even with the
      _PAGE_UFFD_WP bit we can't trap it at all.
      
      In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
      That can lead to data mismatch if the page that we are going to write
      protect is swapped out when sending the UFFDIO_WRITEPROTECT.  This patch
      also applies/removes the uffd-wp bit even for the swap entries.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f45ec5ff
    • P
      userfaultfd: wp: apply _PAGE_UFFD_WP bit · 292924b2
      Peter Xu 提交于
      Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
      change_protection() when used with uffd-wp and make sure the two new flags
      are exclusively used.  Then,
      
        - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
          when a range of memory is write protected by uffd
      
        - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
          _PAGE_RW when write protection is resolved from userspace
      
      And use this new interface in mwriteprotect_range() to replace the old
      MM_CP_DIRTY_ACCT.
      
      Do this change for both PTEs and huge PMDs.  Then we can start to identify
      which PTE/PMD is write protected by general (e.g., COW or soft dirty
      tracking), and which is for userfaultfd-wp.
      
      Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
      into _PAGE_CHG_MASK as well.  Meanwhile, since we have this new bit, we
      can be even more strict when detecting uffd-wp page faults in either
      do_wp_page() or wp_huge_pmd().
      
      After we're with _PAGE_UFFD_WP, a special case is when a page is both
      protected by the general COW logic and also userfault-wp.  Here the
      userfault-wp will have higher priority and will be handled first.  Only
      after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
      the general COW.  These are the steps on what will happen with such a
      page:
      
        1. CPU accesses write protected shared page (so both protected by
           general COW and uffd-wp), blocked by uffd-wp first because in
           do_wp_page we'll handle uffd-wp first, so it has higher priority
           than general COW.
      
        2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
           to remove the uffd-wp bit upon the PTE/PMD.  However here we
           still keep the write bit cleared.  Notify the blocked CPU.
      
        3. The blocked CPU resumes the page fault process with a fault
           retry, during retry it'll notice it was not with the uffd-wp bit
           this time but it is still write protected by general COW, then
           it'll go though the COW path in the fault handler, copy the page,
           apply write bit where necessary, and retry again.
      
        4. The CPU will be able to access this page with write bit set.
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      292924b2
    • P
      mm: merge parameters for change_protection() · 58705444
      Peter Xu 提交于
      change_protection() was used by either the NUMA or mprotect() code,
      there's one parameter for each of the callers (dirty_accountable and
      prot_numa).  Further, these parameters are passed along the calls:
      
        - change_protection_range()
        - change_p4d_range()
        - change_pud_range()
        - change_pmd_range()
        - ...
      
      Now we introduce a flag for change_protect() and all these helpers to
      replace these parameters.  Then we can avoid passing multiple parameters
      multiple times along the way.
      
      More importantly, it'll greatly simplify the work if we want to introduce
      any new parameters to change_protection().  In the follow up patches, a
      new parameter for userfaultfd write protection will be introduced.
      
      No functional change at all.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58705444
    • A
      userfaultfd: wp: add UFFDIO_COPY_MODE_WP · 72981e0e
      Andrea Arcangeli 提交于
      This allows UFFDIO_COPY to map pages write-protected.
      
      [peterx@redhat.com: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets
       around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and
       commit messages]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-6-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72981e0e
    • A
      userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers · 55adf4de
      Andrea Arcangeli 提交于
      Implement helpers methods to invoke userfaultfd wp faults more
      selectively: not only when a wp fault triggers on a vma with vma->vm_flags
      VM_UFFD_WP set, but only if the _PAGE_UFFD_WP bit is set in the pagetable
      too.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-5-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55adf4de
    • S
      userfaultfd: wp: add helper for writeprotect check · 1df319e0
      Shaohua Li 提交于
      Patch series "userfaultfd: write protection support", v6.
      
      Overview
      ========
      
      The uffd-wp work was initialized by Shaohua Li [1], and later continued by
      Andrea [2].  This series is based upon Andrea's latest userfaultfd tree,
      and it is a continuous works from both Shaohua and Andrea.  Many of the
      follow up ideas come from Andrea too.
      
      Besides the old MISSING register mode of userfaultfd, the new uffd-wp
      support provides another alternative register mode called
      UFFDIO_REGISTER_MODE_WP that can be used to listen to not only missing
      page faults but also write protection page faults, or even they can be
      registered together.  At the same time, the new feature also provides a
      new userfaultfd ioctl called UFFDIO_WRITEPROTECT which allows the
      userspace to write protect a range or memory or fixup write permission of
      faulted pages.
      
      Please refer to the document patch "userfaultfd: wp:
      UFFDIO_REGISTER_MODE_WP documentation update" for more information on the
      new interface and what it can do.
      
      The major workflow of an uffd-wp program should be:
      
        1. Register a memory region with WP mode using UFFDIO_REGISTER_MODE_WP
      
        2. Write protect part of the whole registered region using
           UFFDIO_WRITEPROTECT, passing in UFFDIO_WRITEPROTECT_MODE_WP to
           show that we want to write protect the range.
      
        3. Start a working thread that modifies the protected pages,
           meanwhile listening to UFFD messages.
      
        4. When a write is detected upon the protected range, page fault
           happens, a UFFD message will be generated and reported to the
           page fault handling thread
      
        5. The page fault handler thread resolves the page fault using the
           new UFFDIO_WRITEPROTECT ioctl, but this time passing in
           !UFFDIO_WRITEPROTECT_MODE_WP instead showing that we want to
           recover the write permission.  Before this operation, the fault
           handler thread can do anything it wants, e.g., dumps the page to
           a persistent storage.
      
        6. The worker thread will continue running with the correctly
           applied write permission from step 5.
      
      Currently there are already two projects that are based on this new
      userfaultfd feature.
      
      QEMU Live Snapshot: The project provides a way to allow the QEMU
                          hypervisor to take snapshot of VMs without
                          stopping the VM [3].
      
      LLNL umap library:  The project provides a mmap-like interface and
                          "allow to have an application specific buffer of
                          pages cached from a large file, i.e. out-of-core
                          execution using memory map" [4][5].
      
      Before posting the patchset, this series was smoke tested against QEMU
      live snapshot and the LLNL umap library (by doing parallel quicksort using
      128 sorting threads + 80 uffd servicing threads).  My sincere thanks to
      Marty Mcfadden and Denis Plotnikov for the help along the way.
      
      TODO
      ====
      
      - hugetlbfs/shmem support
      - performance
      - more architectures
      - cooperate with mprotect()-allowed processes (???)
      - ...
      
      References
      ==========
      
      [1] https://lwn.net/Articles/666187/
      [2] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/?h=userfault
      [3] https://github.com/denis-plotnikov/qemu/commits/background-snapshot-kvm
      [4] https://github.com/LLNL/umap
      [5] https://llnl-umap.readthedocs.io/en/develop/
      [6] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=b245ecf6cf59156966f3da6e6b674f6695a5ffa5
      [7] https://lkml.org/lkml/2018/11/21/370
      [8] https://lkml.org/lkml/2018/12/30/64
      
      This patch (of 19):
      
      Add helper for writeprotect check. Will use it later.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220163112.11409-2-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1df319e0