1. 19 10月, 2021 5 次提交
    • H
      mm/migrate: fix CPUHP state to update node demotion order · a6a0251c
      Huang Ying 提交于
      The node demotion order needs to be updated during CPU hotplug.  Because
      whether a NUMA node has CPU may influence the demotion order.  The
      update function should be called during CPU online/offline after the
      node_states[N_CPU] has been updated.  That is done in
      CPUHP_AP_ONLINE_DYN during CPU online and in CPUHP_MM_VMSTAT_DEAD during
      CPU offline.  But in commit 884a6e5d ("mm/migrate: update node
      demotion order on hotplug events"), the function to update node demotion
      order is called in CPUHP_AP_ONLINE_DYN during CPU online/offline.  This
      doesn't satisfy the order requirement.
      
      For example, there are 4 CPUs (P0, P1, P2, P3) in 2 sockets (P0, P1 in S0
      and P2, P3 in S1), the demotion order is
      
       - S0 -> NUMA_NO_NODE
       - S1 -> NUMA_NO_NODE
      
      After P2 and P3 is offlined, because S1 has no CPU now, the demotion
      order should have been changed to
      
       - S0 -> S1
       - S1 -> NO_NODE
      
      but it isn't changed, because the order updating callback for CPU
      hotplug doesn't see the new nodemask.  After that, if P1 is offlined,
      the demotion order is changed to the expected order as above.
      
      So in this patch, we added CPUHP_AP_MM_DEMOTION_ONLINE and
      CPUHP_MM_DEMOTION_DEAD to be called after CPUHP_AP_ONLINE_DYN and
      CPUHP_MM_VMSTAT_DEAD during CPU online and offline, and register the
      update function on them.
      
      Link: https://lkml.kernel.org/r/20210929060351.7293-1-ying.huang@intel.com
      Fixes: 884a6e5d ("mm/migrate: update node demotion order on hotplug events")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Keith Busch <kbusch@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6a0251c
    • D
      mm/migrate: add CPU hotplug to demotion #ifdef · 76af6a05
      Dave Hansen 提交于
      Once upon a time, the node demotion updates were driven solely by memory
      hotplug events.  But now, there are handlers for both CPU and memory
      hotplug.
      
      However, the #ifdef around the code checks only memory hotplug.  A
      system that has HOTPLUG_CPU=y but MEMORY_HOTPLUG=n would miss CPU
      hotplug events.
      
      Update the #ifdef around the common code.  Add memory and CPU-specific
      #ifdefs for their handlers.  These memory/CPU #ifdefs avoid unused
      function warnings when their Kconfig option is off.
      
      [arnd@arndb.de: rework hotplug_memory_notifier() stub]
        Link: https://lkml.kernel.org/r/20211013144029.2154629-1-arnd@kernel.org
      
      Link: https://lkml.kernel.org/r/20210924161255.E5FE8F7E@davehans-spike.ostc.intel.com
      Fixes: 884a6e5d ("mm/migrate: update node demotion order on hotplug events")
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76af6a05
    • D
      mm/migrate: optimize hotplug-time demotion order updates · 295be91f
      Dave Hansen 提交于
      Patch series "mm/migrate: 5.15 fixes for automatic demotion", v2.
      
      This contains two fixes for the "automatic demotion" code which was
      merged into 5.15:
      
       * Fix memory hotplug performance regression by watching
         suppressing any real action on irrelevant hotplug events.
      
       * Ensure CPU hotplug handler is registered when memory hotplug
         is disabled.
      
      This patch (of 2):
      
      == tl;dr ==
      
      Automatic demotion opted for a simple, lazy approach to handling hotplug
      events.  This noticeably slows down memory hotplug[1].  Optimize away
      updates to the demotion order when memory hotplug events should have no
      effect.
      
      This has no effect on CPU hotplug.  There is no known problem on the CPU
      side and any work there will be in a separate series.
      
      == Background ==
      
      Automatic demotion is a memory migration strategy to ensure that new
      allocations have room in faster memory tiers on tiered memory systems.
      The kernel maintains an array (node_demotion[]) to drive these
      migrations.
      
      The node_demotion[] path is calculated by starting at nodes with CPUs
      and then "walking" to nodes with memory.  Only hotplug events which
      online or offline a node with memory (N_ONLINE) or CPUs (N_CPU) will
      actually affect the migration order.
      
      == Problem ==
      
      However, the current code is lazy.  It completely regenerates the
      migration order on *any* CPU or memory hotplug event.  The logic was
      that these events are extremely rare and that the overhead from
      indiscriminate order regeneration is minimal.
      
      Part of the update logic involves a synchronize_rcu(), which is a pretty
      big hammer.  Its overhead was large enough to be detected by some 0day
      tests that watch memory hotplug performance[1].
      
      == Solution ==
      
      Add a new helper (node_demotion_topo_changed()) which can differentiate
      between superfluous and impactful hotplug events.  Skip the expensive
      update operation for superfluous events.
      
      == Aside: Locking ==
      
      It took me a few moments to declare the locking to be safe enough for
      node_demotion_topo_changed() to work.  It all hinges on the memory
      hotplug lock:
      
      During memory hotplug events, 'mem_hotplug_lock' is held for write.
      This ensures that two memory hotplug events can not be called
      simultaneously.
      
      CPU hotplug has a similar lock (cpuhp_state_mutex) which also provides
      mutual exclusion between CPU hotplug events.  In addition, the demotion
      code acquire and hold the mem_hotplug_lock for read during its CPU
      hotplug handlers.  This provides mutual exclusion between the demotion
      memory hotplug callbacks and the CPU hotplug callbacks.
      
      This effectively allows treating the migration target generation code to
      act as if it is single-threaded.
      
      1. https://lore.kernel.org/all/20210905135932.GE15026@xsang-OptiPlex-9020/
      
      Link: https://lkml.kernel.org/r/20210924161251.093CCD06@davehans-spike.ostc.intel.com
      Link: https://lkml.kernel.org/r/20210924161253.D7673E31@davehans-spike.ostc.intel.com
      Fixes: 884a6e5d ("mm/migrate: update node demotion order on hotplug events")
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      295be91f
    • N
      userfaultfd: fix a race between writeprotect and exit_mmap() · cb185d5f
      Nadav Amit 提交于
      A race is possible when a process exits, its VMAs are removed by
      exit_mmap() and at the same time userfaultfd_writeprotect() is called.
      
      The race was detected by KASAN on a development kernel, but it appears
      to be possible on vanilla kernels as well.
      
      Use mmget_not_zero() to prevent the race as done in other userfaultfd
      operations.
      
      Link: https://lkml.kernel.org/r/20210921200247.25749-1-namit@vmware.com
      Fixes: 63b2d417 ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Tested-by: NLi  Wang <liwang@redhat.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb185d5f
    • P
      mm/userfaultfd: selftests: fix memory corruption with thp enabled · 8913970c
      Peter Xu 提交于
      In RHEL's gating selftests we've encountered memory corruption in the
      uffd event test even with upstream kernel:
      
              # ./userfaultfd anon 128 4
              nr_pages: 32768, nr_pages_per_cpu: 32768
              bounces: 3, mode: rnd racing read, userfaults: 6240 missing (6240) 14729 wp (14729)
              bounces: 2, mode: racing read, userfaults: 1444 missing (1444) 28877 wp (28877)
              bounces: 1, mode: rnd read, userfaults: 6055 missing (6055) 14699 wp (14699)
              bounces: 0, mode: read, userfaults: 82 missing (82) 25196 wp (25196)
              testing uffd-wp with pagemap (pgsize=4096): done
              testing uffd-wp with pagemap (pgsize=2097152): done
              testing events (fork, remap, remove): ERROR: nr 32427 memory corruption 0 1 (errno=0, line=963)
              ERROR: faulting process failed (errno=0, line=1117)
      
      It can be easily reproduced when global thp enabled, which is the
      default for RHEL.
      
      It's also known as a side effect of commit 0db282ba ("selftest: use
      mmap instead of posix_memalign to allocate memory", 2021-07-23), which
      is imho right itself on using mmap() to make sure the addresses will be
      untagged even on arm.
      
      The problem is, for each test we allocate buffers using two
      allocate_area() calls.  We assumed these two buffers won't affect each
      other, however they could, because mmap() could have found that the two
      buffers are near each other and having the same VMA flags, so they got
      merged into one VMA.
      
      It won't be a big problem if thp is not enabled, but when thp is
      agressively enabled it means when initializing the src buffer it could
      accidentally setup part of the dest buffer too when there's a shared THP
      that overlaps the two regions.  Then some of the dest buffer won't be
      able to be trapped by userfaultfd missing mode, then it'll cause memory
      corruption as described.
      
      To fix it, do release_pages() after initializing the src buffer.
      
      Since the previous two release_pages() calls are after
      uffd_test_ctx_clear() which will unmap all the buffers anyway (which is
      stronger than release pages; as unmap() also tear town pgtables), drop
      them as they shouldn't really be anything useful.
      
      We can mark the Fixes tag upon 0db282ba as it's reported to only
      happen there, however the real "Fixes" IMHO should be 8ba6e864, as
      before that commit we'll always do explicit release_pages() before
      registration of uffd, and 8ba6e864 changed that logic by adding
      extra unmap/map and we didn't release the pages at the right place.
      Meanwhile I don't have a solid glue anyway on whether posix_memalign()
      could always avoid triggering this bug, hence it's safer to attach this
      fix to commit 8ba6e864.
      
      Link: https://lkml.kernel.org/r/20210923232512.210092-1-peterx@redhat.com
      Fixes: 8ba6e864 ("userfaultfd/selftests: reinitialize test context in each test")
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1994931Signed-off-by: NPeter Xu <peterx@redhat.com>
      Reported-by: NLi Wang <liwan@redhat.com>
      Tested-by: NLi Wang <liwang@redhat.com>
      Reviewed-by: NAxel Rasmussen <axelrasmussen@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8913970c
  2. 18 10月, 2021 17 次提交
  3. 17 10月, 2021 13 次提交
  4. 16 10月, 2021 5 次提交