1. 08 4月, 2020 40 次提交
    • D
      drivers/base/memory: rename MMOP_ONLINE_KEEP to MMOP_ONLINE · 956f8b44
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: allow to specify a default online_type", v3.
      
      Distributions nowadays use udev rules ([1] [2]) to specify if and how to
      online hotplugged memory.  The rules seem to get more complex with many
      special cases.  Due to the various special cases,
      CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used.  All memory hotplug
      is handled via udev rules.
      
      Every time we hotplug memory, the udev rule will come to the same
      conclusion.  Especially Hyper-V (but also soon virtio-mem) add a lot of
      memory in separate memory blocks and wait for memory to get onlined by
      user space before continuing to add more memory blocks (to not add memory
      faster than it is getting onlined).  This of course slows down the whole
      memory hotplug process.
      
      To make the job of distributions easier and to avoid udev rules that get
      more and more complicated, let's extend the mechanism provided by
      - /sys/devices/system/memory/auto_online_blocks
      - "memhp_default_state=" on the kernel cmdline
      to be able to specify also "online_movable" as well as "online_kernel"
      
      === Example /usr/libexec/config-memhotplug ===
      
      #!/bin/bash
      
      VIRT=`systemd-detect-virt --vm`
      ARCH=`uname -p`
      
      sense_virtio_mem() {
        if [ -d "/sys/bus/virtio/drivers/virtio_mem/" ]; then
          DEVICES=`find /sys/bus/virtio/drivers/virtio_mem/ -maxdepth 1 -type l | wc -l`
          if [ $DEVICES != "0" ]; then
              return 0
          fi
        fi
        return 1
      }
      
      if [ ! -e "/sys/devices/system/memory/auto_online_blocks" ]; then
        echo "Memory hotplug configuration support missing in the kernel"
        exit 1
      fi
      
      if grep "memhp_default_state=" /proc/cmdline > /dev/null; then
        echo "Memory hotplug configuration overridden in kernel cmdline (memhp_default_state=)"
        exit 1
      fi
      
      if [ $VIRT == "microsoft" ]; then
        echo "Detected Hyper-V on $ARCH"
        # Hyper-V wants all memory in ZONE_NORMAL
        ONLINE_TYPE="online_kernel"
      elif sense_virtio_mem; then
        echo "Detected virtio-mem on $ARCH"
        # virtio-mem wants all memory in ZONE_NORMAL
        ONLINE_TYPE="online_kernel"
      elif [ $ARCH == "s390x" ] || [ $ARCH == "s390" ]; then
        echo "Detected $ARCH"
        # standby memory should not be onlined automatically
        ONLINE_TYPE="offline"
      elif [ $ARCH == "ppc64" ] || [ $ARCH == "ppc64le" ]; then
        echo "Detected" $ARCH
        # PPC64 onlines all hotplugged memory right from the kernel
        ONLINE_TYPE="offline"
      elif [ $VIRT == "none" ]; then
        echo "Detected bare-metal on $ARCH"
        # Bare metal users expect hotplugged memory to be unpluggable. We assume
        # that ZONE imbalances on such enterpise servers cannot happen and is
        # properly documented
        ONLINE_TYPE="online_movable"
      else
        # TODO: Hypervisors that want to unplug DIMMs and can guarantee that ZONE
        # imbalances won't happen
        echo "Detected $VIRT on $ARCH"
        # Usually, ballooning is used in virtual environments, so memory should go to
        # ZONE_NORMAL. However, sometimes "movable_node" is relevant.
        ONLINE_TYPE="online"
      fi
      
      echo "Selected online_type:" $ONLINE_TYPE
      
      # Configure what to do with memory that will be hotplugged in the future
      echo $ONLINE_TYPE 2>/dev/null > /sys/devices/system/memory/auto_online_blocks
      if [ $? != "0" ]; then
        echo "Memory hotplug cannot be configured (e.g., old kernel or missing permissions)"
        # A backup udev rule should handle old kernels if necessary
        exit 1
      fi
      
      # Process all already pluggedd blocks (e.g., DIMMs, but also Hyper-V or virtio-mem)
      if [ $ONLINE_TYPE != "offline" ]; then
        for MEMORY in /sys/devices/system/memory/memory*; do
          STATE=`cat $MEMORY/state`
          if [ $STATE == "offline" ]; then
              echo $ONLINE_TYPE > $MEMORY/state
          fi
        done
      fi
      
      === Example /usr/lib/systemd/system/config-memhotplug.service ===
      
      [Unit]
      Description=Configure memory hotplug behavior
      DefaultDependencies=no
      Conflicts=shutdown.target
      Before=sysinit.target shutdown.target
      After=systemd-modules-load.service
      ConditionPathExists=|/sys/devices/system/memory/auto_online_blocks
      
      [Service]
      ExecStart=/usr/libexec/config-memhotplug
      Type=oneshot
      TimeoutSec=0
      RemainAfterExit=yes
      
      [Install]
      WantedBy=sysinit.target
      
      === Example modification to the 40-redhat.rules [2] ===
      
      : diff --git a/40-redhat.rules b/40-redhat.rules-new
      : index 2c690e5..168fd03 100644
      : --- a/40-redhat.rules
      : +++ b/40-redhat.rules-new
      : @@ -6,6 +6,9 @@ SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}
      :  # Memory hotadd request
      :  SUBSYSTEM!="memory", GOTO="memory_hotplug_end"
      :  ACTION!="add", GOTO="memory_hotplug_end"
      : +# memory hotplug behavior configured
      : +PROGRAM=="grep online /sys/devices/system/memory/auto_online_blocks", GOTO="memory_hotplug_end"
      : +
      :  PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
      :
      :  ENV{.state}="online"
      
      ===
      
      [1] https://github.com/lnykryn/systemd-rhel/pull/281
      [2] https://github.com/lnykryn/systemd-rhel/blob/staging/rules/40-redhat.rules
      
      This patch (of 8):
      
      The name is misleading and it's not really clear what is "kept".  Let's
      just name it like the online_type name we expose to user space ("online").
      
      Add some documentation to the types.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: http://lkml.kernel.org/r/20200319131221.14044-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20200317104942.11178-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      956f8b44
    • B
      mm/sparse.c: move subsection_map related functions together · 6ecb0fc6
      Baoquan He 提交于
      No functional change.
      
      [bhe@redhat.com: move functions into CONFIG_MEMORY_HOTPLUG ifdeffery scope]
        Link: http://lkml.kernel.org/r/20200316045804.GC3486@MiWiFi-R3L-srvSigned-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200312124414.439-6-bhe@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ecb0fc6
    • B
      mm/sparse.c: add note about only VMEMMAP supporting sub-section hotplug · 95a5a34d
      Baoquan He 提交于
      And tell check_pfn_span() gating the porper alignment and size of hot
      added memory region.
      
      And also move the code comments from inside section_deactivate() to being
      above it.  The code comments are reasonable for the whole function, and
      the moving makes code cleaner.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Link: http://lkml.kernel.org/r/20200312124414.439-5-bhe@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95a5a34d
    • B
      mm/sparse.c: only use subsection map in VMEMMAP case · 0a9f9f62
      Baoquan He 提交于
      Currently, to support subsection aligned memory region adding for pmem,
      subsection map is added to track which subsection is present.
      
      However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP.  It means
      subsection map only makes sense when SPARSEMEM_VMEMMAP enabled.  For the
      classic sparse, it's meaningless.  Even worse, it may confuse people when
      checking code related to the classic sparse.
      
      About the classic sparse which doesn't support subsection hotplug, Dan
      said it's more because the effort and maintenance burden outweighs the
      benefit.  Besides, the current 64 bit ARCHes all enable
      SPARSEMEM_VMEMMAP_ENABLE by default.
      
      Combining the above reasons, no need to provide subsection map and the
      relevant handling for the classic sparse.  Let's remove them.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Link: http://lkml.kernel.org/r/20200312124414.439-4-bhe@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a9f9f62
    • B
      mm/sparse.c: introduce a new function clear_subsection_map() · 37bc1502
      Baoquan He 提交于
      Factor out the code which clear subsection map of one memory region from
      section_deactivate() into clear_subsection_map().
      
      And also add helper function is_subsection_map_empty() to check if the
      current subsection map is empty or not.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Link: http://lkml.kernel.org/r/20200312124414.439-3-bhe@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37bc1502
    • B
      mm/sparse.c: introduce new function fill_subsection_map() · 5d87255c
      Baoquan He 提交于
      Patch series "mm/hotplug: Only use subsection map for VMEMMAP", v4.
      
      Memory sub-section hotplug was added to fix the issue that nvdimm could be
      mapped at non-section aligned starting address.  A subsection map is added
      into struct mem_section_usage to implement it.
      
      However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP.  It means
      subsection map only makes sense when SPARSEMEM_VMEMMAP enabled.  For the
      classic sparse, subsection map is meaningless and confusing.
      
      About the classic sparse which doesn't support subsection hotplug, Dan
      said it's more because the effort and maintenance burden outweighs the
      benefit.  Besides, the current 64 bit ARCHes all enable
      SPARSEMEM_VMEMMAP_ENABLE by default.
      
      This patch (of 5):
      
      Factor out the code that fills the subsection map from section_activate()
      into fill_subsection_map(), this makes section_activate() cleaner and
      easier to follow.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200312124414.439-2-bhe@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d87255c
    • D
      mm/memory_hotplug.c: cleanup __add_pages() · 6cdd0b30
      David Hildenbrand 提交于
      Let's drop the basically unused section stuff and simplify.  The logic now
      matches the logic in __remove_pages().
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Link: http://lkml.kernel.org/r/20200228095819.10750-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cdd0b30
    • D
      mm/memory_hotplug.c: simplify calculation of number of pages in __remove_pages() · a11b9419
      David Hildenbrand 提交于
      In commit 52fb87c8 ("mm/memory_hotplug: cleanup __remove_pages()"), we
      cleaned up __remove_pages(), and introduced a shorter variant to calculate
      the number of pages to the next section boundary.
      
      Turns out we can make this calculation easier to read.  We always want to
      have the number of pages (> 0) to the next section boundary, starting from
      the current pfn.
      
      We'll clean up __remove_pages() in a follow-up patch and directly make use
      of this computation.
      Suggested-by: NSegher Boessenkool <segher@kernel.crashing.org>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Link: http://lkml.kernel.org/r/20200228095819.10750-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11b9419
    • B
      mm/memory_hotplug.c: only respect mem= parameter during boot stage · f3cd4c86
      Baoquan He 提交于
      In commit 357b4da5 ("x86: respect memory size limiting via mem=
      parameter") a global varialbe max_mem_size is added to store the value
      parsed from 'mem= ', then checked when memory region is added.  This truly
      stops those DIMMs from being added into system memory during boot-time.
      
      However, it also limits the later memory hotplug functionality.  Any DIMM
      can't be hotplugged any more if its region is beyond the max_mem_size.  We
      will get errors like:
      
      [  216.387164] acpi PNP0C80:02: add_memory failed
      [  216.389301] acpi PNP0C80:02: acpi_memory_enable_device() error
      [  216.392187] acpi PNP0C80:02: Enumeration failure
      
      This will cause issue in a known use case where 'mem=' is added to the
      hypervisor.  The memory that lies after 'mem=' boundary will be assigned
      to KVM guests.  After commit 357b4da5 merged, memory can't be extended
      dynamically if system memory on hypervisor is not sufficient.
      
      So fix it by also checking if it's during boot-time restricting to add
      memory.  Otherwise, skip the restriction.
      
      And also add this use case to document of 'mem=' kernel parameter.
      
      Fixes: 357b4da5 ("x86: respect memory size limiting via mem= parameter")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200204050643.20925-1-bhe@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3cd4c86
    • D
      mm/page_ext.c: drop pfn_present() check when onlining · dccacf8d
      David Hildenbrand 提交于
      Since commit c5e79ef5 ("mm/memory_hotplug.c: don't allow to
      online/offline memory blocks with holes") we disallow to offline any
      memory with holes.  As all boot memory is online and hotplugged memory
      cannot contain holes, we never online memory with holes.
      
      This present check can be dropped.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Link: http://lkml.kernel.org/r/20200127110424.5757-4-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dccacf8d
    • D
      drivers/base/memory.c: drop pages_correctly_probed() · fada9ae3
      David Hildenbrand 提交于
      pages_correctly_probed() is a leftover from ancient times.  It dates back
      to commit 3947be19 ("[PATCH] memory hotplug: sysfs and add/remove
      functions"), where Pg_reserved checks were added as a sfety net:
      
      	/*
      	 * The probe routines leave the pages reserved, just
      	 * as the bootmem code does.  Make sure they're still
      	 * that way.
      	 */
      
      The checks were refactored quite a bit over the years, especially in
      commit b77eab70 ("mm/memory_hotplug: optimize probe routine"), where
      checks for present, valid, and online sections were added.
      
      Hotplugged memory is added via add_memory(), which will create the full
      memmap for the hotplugged memory, and mark all sections valid and present.
      
      Only full memory blocks are onlined/offlined, so we also cannot have an
      inconsistency in that regard (especially, memory blocks with some sections
      being online and some being offline).
      
      1. Boot memory always starts online.  Since commit c5e79ef5
         ("mm/memory_hotplug.c: don't allow to online/offline memory blocks with
         holes") we disallow to offline any memory with holes.  Therefore, we
         never online memory with holes.  Present and validity checks are
         superfluous.
      
      2. Only complete memory blocks are onlined/offlined (and especially,
         the state - online or offline - is stored for whole memory blocks).
         Besides the core, only arch/powerpc/platforms/powernv/memtrace.c
         manually calls offline_pages() and fiddels with memory block states.
         But it also only offlines complete memory blocks.
      
      3. To make any of these conditions trigger, something would have to be
         terribly messed up in the core.  (e.g., online/offline only some
         sections of a memory block).
      
      4. Memory unplug properly makes sure that all sysfs attributes were
         removed (and therefore, that all threads left the sysfs handlers).  We
         don't have to worry about zombie devices at this point.
      
      5. The valid_section_nr(section_nr) check is actually dead code, as it
         would never have been reached due to the WARN_ON_ONCE(!pfn_valid(pfn)).
      
      No wonder we haven't seen any of these errors in a long time (or even
         ever, according to my search).  Let's just get rid of them.  Now, all
         checks that could hinder onlining and offlining are completely
         contained in online_pages()/offline_pages().
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Link: http://lkml.kernel.org/r/20200127110424.5757-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fada9ae3
    • D
      drivers/base/memory.c: drop section_count · 68c3a6ac
      David Hildenbrand 提交于
      Patch series "mm: drop superfluous section checks when onlining/offlining".
      
      Let's drop some superfluous section checks on the onlining/offlining path.
      
      This patch (of 3):
      
      Since commit c5e79ef5 ("mm/memory_hotplug.c: don't allow to
      online/offline memory blocks with holes") we have a generic check in
      offline_pages() that disallows offlining memory blocks with holes.
      
      Memory blocks with missing sections are just another variant of these type
      of blocks.  We can stop checking (and especially storing) present
      sections.  A proper error message is now printed why offlining failed.
      
      section_count was initially introduced in commit 07681215 ("Driver
      core: Add section count to memory_block struct") in order to detect when
      it is okay to remove a memory block.  It was used in commit 26bbe7ef
      ("drivers/base/memory.c: prohibit offlining of memory blocks with missing
      sections") to disallow offlining memory blocks with missing sections.  As
      we refactored creation/removal of memory devices and have a proper check
      for holes in place, we can drop the section_count.
      
      This also removes a leftover comment regarding the mem_sysfs_mutex, which
      was removed in commit 848e19ad ("drivers/base/memory.c: drop the
      mem_sysfs_mutex").
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Link: http://lkml.kernel.org/r/20200127110424.5757-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68c3a6ac
    • P
      userfaultfd: selftests: add write-protect test · 9b12488a
      Peter Xu 提交于
      Add uffd tests for write protection.
      
      Instead of introducing new tests for it, let's simply squashing uffd-wp
      tests into existing uffd-missing test cases.  Changes are:
      
      (1) Bouncing tests
      
        We do the write-protection in two ways during the bouncing test:
      
        - By using UFFDIO_COPY_MODE_WP when resolving MISSING pages: then
          we'll make sure for each bounce process every single page will be
          at least fault twice: once for MISSING, once for WP.
      
        - By direct call UFFDIO_WRITEPROTECT on existing faulted memories:
          To further torture the explicit page protection procedures of
          uffd-wp, we split each bounce procedure into two halves (in the
          background thread): the first half will be MISSING+WP for each
          page as explained above.  After the first half, we write protect
          the faulted region in the background thread to make sure at least
          half of the pages will be write protected again which is the first
          half to test the new UFFDIO_WRITEPROTECT call.  Then we continue
          with the 2nd half, which will contain both MISSING and WP faulting
          tests for the 2nd half and WP-only faults from the 1st half.
      
      (2) Event/Signal test
      
        Mostly previous tests but will do MISSING+WP for each page.  For
        sigbus-mode test we'll need to provide standalone path to handle the
        write protection faults.
      
      For all tests, do statistics as well for uffd-wp pages.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-20-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b12488a
    • P
      userfaultfd: selftests: refactor statistics · 5c8aed6c
      Peter Xu 提交于
      Introduce uffd_stats structure for statistics of the self test, at the
      same time refactor the code to always pass in the uffd_stats for either
      read() or poll() typed fault handling threads instead of using two
      different ways to return the statistic results.  No functional change.
      
      With the new structure, it's very easy to introduce new statistics.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-19-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c8aed6c
    • P
      userfaultfd: wp: declare _UFFDIO_WRITEPROTECT conditionally · 14819305
      Peter Xu 提交于
      Only declare _UFFDIO_WRITEPROTECT if the user specified
      UFFDIO_REGISTER_MODE_WP and if all the checks passed.  Then when the user
      registers regions with shmem/hugetlbfs we won't expose the new ioctl to
      them.  Even with complete anonymous memory range, we'll only expose the
      new WP ioctl bit if the register mode has MODE_WP.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-18-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14819305
    • M
      userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update · 57e5d4f2
      Martin Cracauer 提交于
      Add documentation about the write protection support.
      
      [peterx@redhat.com: rewrite in rst format; fixups here and there]
      Signed-off-by: NMartin Cracauer <cracauer@cons.org>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-17-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57e5d4f2
    • P
      userfaultfd: wp: don't wake up when doing write protect · 23080e27
      Peter Xu 提交于
      It does not make sense to try to wake up any waiting thread when we're
      write-protecting a memory region.  Only wake up when resolving a write
      protected page fault.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-16-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23080e27
    • S
      userfaultfd: wp: enabled write protection in userfaultfd API · e06f1e1d
      Shaohua Li 提交于
      Now it's safe to enable write protection in userfaultfd API
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220163112.11409-15-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e06f1e1d
    • A
      userfaultfd: wp: add the writeprotect API to userfaultfd ioctl · 63b2d417
      Andrea Arcangeli 提交于
      Introduce the new uffd-wp APIs for userspace.
      
      Firstly, we'll allow to do UFFDIO_REGISTER with write protection tracking
      using the new UFFDIO_REGISTER_MODE_WP flag.  Note that this flag can
      co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in which case the
      userspace program can not only resolve missing page faults, and at the
      same time tracking page data changes along the way.
      
      Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page level
      write protection tracking.  Note that we will need to register the memory
      region with UFFDIO_REGISTER_MODE_WP before that.
      
      [peterx@redhat.com: write up the commit message]
      [peterx@redhat.com: remove useless block, write commit message, check against
       VM_MAYWRITE rather than VM_WRITE when register]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-14-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63b2d417
    • S
      userfaultfd: wp: support write protection for userfault vma range · ffd05793
      Shaohua Li 提交于
      Add API to enable/disable writeprotect a vma range.  Unlike mprotect, this
      doesn't split/merge vmas.
      
      [peterx@redhat.com:
       - use the helper to find VMA;
       - return -ENOENT if not found to match mcopy case;
       - use the new MM_CP_UFFD_WP* flags for change_protection
       - check against mmap_changing for failures
       - replace find_dst_vma with vma_find_uffd]
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220163112.11409-13-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ffd05793
    • P
      khugepaged: skip collapse if uffd-wp detected · e1e267c7
      Peter Xu 提交于
      Don't collapse the huge PMD if there is any userfault write protected
      small PTEs.  The problem is that the write protection is in small page
      granularity and there's no way to keep all these write protection
      information if the small pages are going to be merged into a huge PMD.
      
      The same thing needs to be considered for swap entries and migration
      entries.  So do the check as well disregarding khugepaged_max_ptes_swap.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-12-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1e267c7
    • P
      userfaultfd: wp: support swap and page migration · f45ec5ff
      Peter Xu 提交于
      For either swap and page migration, we all use the bit 2 of the entry to
      identify whether this entry is uffd write-protected.  It plays a similar
      role as the existing soft dirty bit in swap entries but only for keeping
      the uffd-wp tracking for a specific PTE/PMD.
      
      Something special here is that when we want to recover the uffd-wp bit
      from a swap/migration entry to the PTE bit we'll also need to take care of
      the _PAGE_RW bit and make sure it's cleared, otherwise even with the
      _PAGE_UFFD_WP bit we can't trap it at all.
      
      In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
      That can lead to data mismatch if the page that we are going to write
      protect is swapped out when sending the UFFDIO_WRITEPROTECT.  This patch
      also applies/removes the uffd-wp bit even for the swap entries.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f45ec5ff
    • P
      userfaultfd: wp: add pmd_swp_*uffd_wp() helpers · 2e3d5dc5
      Peter Xu 提交于
      Adding these missing helpers for uffd-wp operations with pmd
      swap/migration entries.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-10-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e3d5dc5
    • P
      userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork · b569a176
      Peter Xu 提交于
      UFFD_EVENT_FORK support for uffd-wp should be already there, except that
      we should clean the uffd-wp bit if uffd fork event is not enabled.  Detect
      that to avoid _PAGE_UFFD_WP being set even if the VMA is not being tracked
      by VM_UFFD_WP.  Do this for both small PTEs and huge PMDs.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-9-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b569a176
    • P
      userfaultfd: wp: apply _PAGE_UFFD_WP bit · 292924b2
      Peter Xu 提交于
      Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
      change_protection() when used with uffd-wp and make sure the two new flags
      are exclusively used.  Then,
      
        - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
          when a range of memory is write protected by uffd
      
        - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
          _PAGE_RW when write protection is resolved from userspace
      
      And use this new interface in mwriteprotect_range() to replace the old
      MM_CP_DIRTY_ACCT.
      
      Do this change for both PTEs and huge PMDs.  Then we can start to identify
      which PTE/PMD is write protected by general (e.g., COW or soft dirty
      tracking), and which is for userfaultfd-wp.
      
      Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
      into _PAGE_CHG_MASK as well.  Meanwhile, since we have this new bit, we
      can be even more strict when detecting uffd-wp page faults in either
      do_wp_page() or wp_huge_pmd().
      
      After we're with _PAGE_UFFD_WP, a special case is when a page is both
      protected by the general COW logic and also userfault-wp.  Here the
      userfault-wp will have higher priority and will be handled first.  Only
      after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
      the general COW.  These are the steps on what will happen with such a
      page:
      
        1. CPU accesses write protected shared page (so both protected by
           general COW and uffd-wp), blocked by uffd-wp first because in
           do_wp_page we'll handle uffd-wp first, so it has higher priority
           than general COW.
      
        2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
           to remove the uffd-wp bit upon the PTE/PMD.  However here we
           still keep the write bit cleared.  Notify the blocked CPU.
      
        3. The blocked CPU resumes the page fault process with a fault
           retry, during retry it'll notice it was not with the uffd-wp bit
           this time but it is still write protected by general COW, then
           it'll go though the COW path in the fault handler, copy the page,
           apply write bit where necessary, and retry again.
      
        4. The CPU will be able to access this page with write bit set.
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      292924b2
    • P
      mm: merge parameters for change_protection() · 58705444
      Peter Xu 提交于
      change_protection() was used by either the NUMA or mprotect() code,
      there's one parameter for each of the callers (dirty_accountable and
      prot_numa).  Further, these parameters are passed along the calls:
      
        - change_protection_range()
        - change_p4d_range()
        - change_pud_range()
        - change_pmd_range()
        - ...
      
      Now we introduce a flag for change_protect() and all these helpers to
      replace these parameters.  Then we can avoid passing multiple parameters
      multiple times along the way.
      
      More importantly, it'll greatly simplify the work if we want to introduce
      any new parameters to change_protection().  In the follow up patches, a
      new parameter for userfaultfd write protection will be introduced.
      
      No functional change at all.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58705444
    • A
      userfaultfd: wp: add UFFDIO_COPY_MODE_WP · 72981e0e
      Andrea Arcangeli 提交于
      This allows UFFDIO_COPY to map pages write-protected.
      
      [peterx@redhat.com: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets
       around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and
       commit messages]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-6-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72981e0e
    • A
      userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers · 55adf4de
      Andrea Arcangeli 提交于
      Implement helpers methods to invoke userfaultfd wp faults more
      selectively: not only when a wp fault triggers on a vma with vma->vm_flags
      VM_UFFD_WP set, but only if the _PAGE_UFFD_WP bit is set in the pagetable
      too.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-5-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55adf4de
    • A
      userfaultfd: wp: add WP pagetable tracking to x86 · 5a281062
      Andrea Arcangeli 提交于
      Accurate userfaultfd WP tracking is possible by tracking exactly which
      virtual memory ranges were writeprotected by userland.  We can't relay
      only on the RW bit of the mapped pagetable because that information is
      destroyed by fork() or KSM or swap.  If we were to relay on that, we'd
      need to stay on the safe side and generate false positive wp faults for
      every swapped out page.
      
      [peterx@redhat.com: append _PAGE_UFD_WP to _PAGE_CHG_MASK]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-4-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a281062
    • A
      userfaultfd: wp: hook userfault handler to write protection fault · 529b930b
      Andrea Arcangeli 提交于
      There are several cases write protection fault happens.  It could be a
      write to zero page, swaped page or userfault write protected page.  When
      the fault happens, there is no way to know if userfault write protect the
      page before.  Here we just blindly issue a userfault notification for vma
      with VM_UFFD_WP regardless if app write protects it yet.  Application
      should be ready to handle such wp fault.
      
      In the swapin case, always swapin as readonly.  This will cause false
      positive userfaults.  We need to decide later if to eliminate them with a
      flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY).
      
      hugetlbfs wouldn't need to worry about swapouts but and tmpfs would be
      handled by a swap entry bit like anonymous memory.
      
      The main problem with no easy solution to eliminate the false positives,
      will be if/when userfaultfd is extended to real filesystem pagecache.
      When the pagecache is freed by reclaim we can't leave the radix tree
      pinned if the inode and in turn the radix tree is reclaimed as well.
      
      The estimation is that full accuracy and lack of false positives could be
      easily provided only to anonymous memory (as long as there's no fork or as
      long as MADV_DONTFORK is used on the userfaultfd anonymous range) tmpfs
      and hugetlbfs, it's most certainly worth to achieve it but in a later
      incremental patch.
      
      [peterx@redhat.com: don't conditionally drop FAULT_FLAG_WRITE in do_swap_page]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-3-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      529b930b
    • S
      userfaultfd: wp: add helper for writeprotect check · 1df319e0
      Shaohua Li 提交于
      Patch series "userfaultfd: write protection support", v6.
      
      Overview
      ========
      
      The uffd-wp work was initialized by Shaohua Li [1], and later continued by
      Andrea [2].  This series is based upon Andrea's latest userfaultfd tree,
      and it is a continuous works from both Shaohua and Andrea.  Many of the
      follow up ideas come from Andrea too.
      
      Besides the old MISSING register mode of userfaultfd, the new uffd-wp
      support provides another alternative register mode called
      UFFDIO_REGISTER_MODE_WP that can be used to listen to not only missing
      page faults but also write protection page faults, or even they can be
      registered together.  At the same time, the new feature also provides a
      new userfaultfd ioctl called UFFDIO_WRITEPROTECT which allows the
      userspace to write protect a range or memory or fixup write permission of
      faulted pages.
      
      Please refer to the document patch "userfaultfd: wp:
      UFFDIO_REGISTER_MODE_WP documentation update" for more information on the
      new interface and what it can do.
      
      The major workflow of an uffd-wp program should be:
      
        1. Register a memory region with WP mode using UFFDIO_REGISTER_MODE_WP
      
        2. Write protect part of the whole registered region using
           UFFDIO_WRITEPROTECT, passing in UFFDIO_WRITEPROTECT_MODE_WP to
           show that we want to write protect the range.
      
        3. Start a working thread that modifies the protected pages,
           meanwhile listening to UFFD messages.
      
        4. When a write is detected upon the protected range, page fault
           happens, a UFFD message will be generated and reported to the
           page fault handling thread
      
        5. The page fault handler thread resolves the page fault using the
           new UFFDIO_WRITEPROTECT ioctl, but this time passing in
           !UFFDIO_WRITEPROTECT_MODE_WP instead showing that we want to
           recover the write permission.  Before this operation, the fault
           handler thread can do anything it wants, e.g., dumps the page to
           a persistent storage.
      
        6. The worker thread will continue running with the correctly
           applied write permission from step 5.
      
      Currently there are already two projects that are based on this new
      userfaultfd feature.
      
      QEMU Live Snapshot: The project provides a way to allow the QEMU
                          hypervisor to take snapshot of VMs without
                          stopping the VM [3].
      
      LLNL umap library:  The project provides a mmap-like interface and
                          "allow to have an application specific buffer of
                          pages cached from a large file, i.e. out-of-core
                          execution using memory map" [4][5].
      
      Before posting the patchset, this series was smoke tested against QEMU
      live snapshot and the LLNL umap library (by doing parallel quicksort using
      128 sorting threads + 80 uffd servicing threads).  My sincere thanks to
      Marty Mcfadden and Denis Plotnikov for the help along the way.
      
      TODO
      ====
      
      - hugetlbfs/shmem support
      - performance
      - more architectures
      - cooperate with mprotect()-allowed processes (???)
      - ...
      
      References
      ==========
      
      [1] https://lwn.net/Articles/666187/
      [2] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/?h=userfault
      [3] https://github.com/denis-plotnikov/qemu/commits/background-snapshot-kvm
      [4] https://github.com/LLNL/umap
      [5] https://llnl-umap.readthedocs.io/en/develop/
      [6] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=b245ecf6cf59156966f3da6e6b674f6695a5ffa5
      [7] https://lkml.org/lkml/2018/11/21/370
      [8] https://lkml.org/lkml/2018/12/30/64
      
      This patch (of 19):
      
      Add helper for writeprotect check. Will use it later.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220163112.11409-2-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1df319e0
    • D
      virtio-balloon: switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM · da10329c
      David Hildenbrand 提交于
      Commit 71994620 ("virtio_balloon: replace oom notifier with shrinker")
      changed the behavior when deflation happens automatically.  Instead of
      deflating when called by the OOM handler, the shrinker is used.
      
      However, the balloon is not simply some other slab cache that should be
      shrunk when under memory pressure.  The shrinker does not have a concept
      of priorities yet, so this behavior cannot be configured.  Eventually once
      that is in place, we might want to switch back after doing proper testing.
      
      There was a report that this results in undesired side effects when
      inflating the balloon to shrink the page cache. [1]
      	"When inflating the balloon against page cache (i.e. no free memory
      	 remains) vmscan.c will both shrink page cache, but also invoke the
      	 shrinkers -- including the balloon's shrinker. So the balloon
      	 driver allocates memory which requires reclaim, vmscan gets this
      	 memory by shrinking the balloon, and then the driver adds the
      	 memory back to the balloon. Basically a busy no-op."
      
      The name "deflate on OOM" makes it pretty clear when deflation should
      happen - after other approaches to reclaim memory failed, not while
      reclaiming. This allows to minimize the footprint of a guest - memory
      will only be taken out of the balloon when really needed.
      
      Keep using the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT, because
      this has no such side effects. Always register the shrinker with
      VIRTIO_BALLOON_F_FREE_PAGE_HINT now. We are always allowed to reuse free
      pages that are still to be processed by the guest. The hypervisor takes
      care of identifying and resolving possible races between processing a
      hinting request and the guest reusing a page.
      
      In contrast to pre commit 71994620 ("virtio_balloon: replace oom
      notifier with shrinker"), don't add a module parameter to configure the
      number of pages to deflate on OOM. Can be re-added if really needed.
      Also, pay attention that leak_balloon() returns the number of 4k pages -
      convert it properly in virtio_balloon_oom_notify().
      
      Testing done by Tyler for future reference:
        Test setup: VM with 16 CPU, 64GB RAM. Running Debian 10. We have a 42
        GB file full of random bytes that we continually cat to /dev/null.
        This fills the page cache as the file is read. Meanwhile, we trigger
        the balloon to inflate, with a target size of 53 GB. This setup causes
        the balloon inflation to pressure the page cache as the page cache is
        also trying to grow. Afterwards we shrink the balloon back to zero (so
        total deflate == total inflate).
      
        Without this patch (kernel 4.19.0-5):
        Inflation never reaches the target until we stop the "cat file >
        /dev/null" process. Total inflation time was 542 seconds. The longest
        period that made no net forward progress was 315 seconds.
          Result of "grep balloon /proc/vmstat" after the test:
          balloon_inflate 154828377
          balloon_deflate 154828377
      
        With this patch (kernel 5.6.0-rc4+):
        Total inflation duration was 63 seconds. No deflate-queue activity
        occurs when pressuring the page-cache.
          Result of "grep balloon /proc/vmstat" after the test:
          balloon_inflate 12968539
          balloon_deflate 12968539
      
        Conclusion: This patch fixes the issue.  In the test it reduced
        inflate/deflate activity by 12x, and reduced inflation time by 8.6x.
        But more importantly, if we hadn't killed the "cat file > /dev/null"
        process then, without the patch, the inflation process would never reach
        the target.
      
      [1] https://www.spinics.net/lists/linux-virtualization/msg40863.html
      
      Link: http://lkml.kernel.org/r/20200311135523.18512-2-david@redhat.com
      Fixes: 71994620 ("virtio_balloon: replace oom notifier with shrinker")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NTyler Sanderson <tysand@google.com>
      Tested-by: NTyler Sanderson <tysand@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da10329c
    • A
      mm/page_reporting: add free page reporting documentation · 1edca85e
      Alexander Duyck 提交于
      Add documentation for free page reporting.  Currently the only consumer is
      virtio-balloon, however it is possible that other drivers might make use
      of this so it is best to add a bit of documetation explaining at a high
      level how to use the API.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224730.29318.43815.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1edca85e
    • A
      mm/page_reporting: add budget limit on how many pages can be reported per pass · 43b76f29
      Alexander Duyck 提交于
      In order to keep ourselves from reporting pages that are just going to be
      reused again in the case of heavy churn we can put a limit on how many
      total pages we will process per pass.  Doing this will allow the worker
      thread to go into idle much more quickly so that we avoid competing with
      other threads that might be allocating or freeing pages.
      
      The logic added here will limit the worker thread to no more than one
      sixteenth of the total free pages in a given area per list.  Once that
      limit is reached it will update the state so that at the end of the pass
      we will reschedule the worker to try again in 2 seconds when the memory
      churn has hopefully settled down.
      
      Again this optimization doesn't show much of a benefit in the standard
      case as the memory churn is minmal.  However with page allocator shuffling
      enabled the gain is quite noticeable.  Below are the results with a THP
      enabled version of the will-it-scale page_fault1 test showing the
      improvement in iterations for 16 processes or threads.
      
      Without:
      tasks   processes       processes_idle  threads         threads_idle
      16      8283274.75      0.17            5594261.00      38.15
      
      With:
      tasks   processes       processes_idle  threads         threads_idle
      16      8767010.50      0.21            5791312.75      36.98
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224719.29318.72113.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43b76f29
    • A
      mm/page_reporting: rotate reported pages to the tail of the list · 02cf8719
      Alexander Duyck 提交于
      Rather than walking over the same pages again and again to get to the
      pages that have yet to be reported we can save ourselves a significant
      amount of time by simply rotating the list so that when we have a full
      list of reported pages the head of the list is pointing to the next
      non-reported page.  Doing this should save us some significant time when
      processing each free list.
      
      This doesn't gain us much in the standard case as all of the non-reported
      pages should be near the top of the list already.  However in the case of
      page shuffling this results in a noticeable improvement.  Below are the
      will-it-scale page_fault1 w/ THP numbers for 16 tasks with and without
      this patch.
      
      Without:
      tasks   processes       processes_idle  threads         threads_idle
      16      8093776.25      0.17            5393242.00      38.20
      
      With:
      tasks   processes       processes_idle  threads         threads_idle
      16      8283274.75      0.17            5594261.00      38.15
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224708.29318.16862.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02cf8719
    • A
      virtio-balloon: add support for providing free page reports to host · b0c504f1
      Alexander Duyck 提交于
      Add support for the page reporting feature provided by virtio-balloon.
      Reporting differs from the regular balloon functionality in that is is
      much less durable than a standard memory balloon.  Instead of creating a
      list of pages that cannot be accessed the pages are only inaccessible
      while they are being indicated to the virtio interface.  Once the
      interface has acknowledged them they are placed back into their respective
      free lists and are once again accessible by the guest system.
      
      Unlike a standard balloon we don't inflate and deflate the pages.  Instead
      we perform the reporting, and once the reporting is completed it is
      assumed that the page has been dropped from the guest and will be faulted
      back in the next time the page is accessed.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224657.29318.68624.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0c504f1
    • A
      virtio-balloon: pull page poisoning config out of free page hinting · d74b78fa
      Alexander Duyck 提交于
      Currently the page poisoning setting wasn't being enabled unless free page
      hinting was enabled.  However we will need the page poisoning tracking
      logic as well for free page reporting.  As such pull it out and make it a
      separate bit of config in the probe function.
      
      In addition we need to add support for the more recent init_on_free
      feature which expects a behavior similar to page poisoning in that we
      expect the page to be pre-zeroed.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224646.29318.695.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d74b78fa
    • A
      mm: introduce Reported pages · 36e66c55
      Alexander Duyck 提交于
      In order to pave the way for free page reporting in virtualized
      environments we will need a way to get pages out of the free lists and
      identify those pages after they have been returned.  To accomplish this,
      this patch adds the concept of a Reported Buddy, which is essentially
      meant to just be the Uptodate flag used in conjunction with the Buddy page
      type.
      
      To prevent the reported pages from leaking outside of the buddy lists I
      added a check to clear the PageReported bit in the del_page_from_free_list
      function.  As a result any reported page that is split, merged, or
      allocated will have the flag cleared prior to the PageBuddy value being
      cleared.
      
      The process for reporting pages is fairly simple.  Once we free a page
      that meets the minimum order for page reporting we will schedule a worker
      thread to start 2s or more in the future.  That worker thread will begin
      working from the lowest supported page reporting order up to MAX_ORDER - 1
      pulling unreported pages from the free list and storing them in the
      scatterlist.
      
      When processing each individual free list it is necessary for the worker
      thread to release the zone lock when it needs to stop and report the full
      scatterlist of pages.  To reduce the work of the next iteration the worker
      thread will rotate the free list so that the first unreported page in the
      free list becomes the first entry in the list.
      
      It will then call a reporting function providing information on how many
      entries are in the scatterlist.  Once the function completes it will
      return the pages to the free area from which they were allocated and start
      over pulling more pages from the free areas until there are no longer
      enough pages to report on to keep the worker busy, or we have processed as
      many pages as were contained in the free area when we started processing
      the list.
      
      The worker thread will work in a round-robin fashion making its way though
      each zone requesting reporting, and through each reportable free list
      within that zone.  Once all free areas within the zone have been processed
      it will check to see if there have been any requests for reporting while
      it was processing.  If so it will reschedule the worker thread to start up
      again in roughly 2s and exit.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36e66c55
    • A
      mm: add function __putback_isolated_page · 624f58d8
      Alexander Duyck 提交于
      There are cases where we would benefit from avoiding having to go through
      the allocation and free cycle to return an isolated page.
      
      Examples for this might include page poisoning in which we isolate a page
      and then put it back in the free list without ever having actually
      allocated it.
      
      This will enable us to also avoid notifiers for the future free page
      reporting which will need to avoid retriggering page reporting when
      returning pages that have been reported on.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224624.29318.89287.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      624f58d8
    • A
      mm: use zone and order instead of free area in free_list manipulators · 6ab01363
      Alexander Duyck 提交于
      In order to enable the use of the zone from the list manipulator functions
      I will need access to the zone pointer.  As it turns out most of the
      accessors were always just being directly passed &zone->free_area[order]
      anyway so it would make sense to just fold that into the function itself
      and pass the zone and order as arguments instead of the free area.
      
      In order to be able to reference the zone we need to move the declaration
      of the functions down so that we have the zone defined before we define
      the list manipulation functions.  Since the functions are only used in the
      file mm/page_alloc.c we can just move them there to reduce noise in the
      header.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPankaj Gupta <pagupta@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224613.29318.43080.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ab01363