1. 20 6月, 2019 2 次提交
  2. 14 6月, 2019 6 次提交
    • D
      mm/devm_memremap_pages: fix final page put race · 50f44ee7
      Dan Williams 提交于
      Logan noticed that devm_memremap_pages_release() kills the percpu_ref
      drops all the page references that were acquired at init and then
      immediately proceeds to unplug, arch_remove_memory(), the backing pages
      for the pagemap.  If for some reason device shutdown actually collides
      with a busy / elevated-ref-count page then arch_remove_memory() should
      be deferred until after that reference is dropped.
      
      As it stands the "wait for last page ref drop" happens *after*
      devm_memremap_pages_release() returns, which is obviously too late and
      can lead to crashes.
      
      Fix this situation by assigning the responsibility to wait for the
      percpu_ref to go idle to devm_memremap_pages() with a new ->cleanup()
      callback.  Implement the new cleanup callback for all
      devm_memremap_pages() users: pmem, devdax, hmm, and p2pdma.
      
      Link: http://lkml.kernel.org/r/155727339156.292046.5432007428235387859.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: 41e94a85 ("add devm_memremap_pages")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50f44ee7
    • D
      lib/genalloc: introduce chunk owners · 795ee306
      Dan Williams 提交于
      The p2pdma facility enables a provider to publish a pool of dma
      addresses for a consumer to allocate.  A genpool is used internally by
      p2pdma to collect dma resources, 'chunks', to be handed out to
      consumers.  Whenever a consumer allocates a resource it needs to pin the
      'struct dev_pagemap' instance that backs the chunk selected by
      pci_alloc_p2pmem().
      
      Currently that reference is taken globally on the entire provider
      device.  That sets up a lifetime mismatch whereby the p2pdma core needs
      to maintain hacks to make sure the percpu_ref is not released twice.
      
      This lifetime mismatch also stands in the way of a fix to
      devm_memremap_pages() whereby devm_memremap_pages_release() must wait for
      the percpu_ref ->release() callback to complete before it can proceed to
      teardown pages.
      
      So, towards fixing this situation, introduce the ability to store a 'chunk
      owner' at gen_pool_add() time, and a facility to retrieve the owner at
      gen_pool_{alloc,free}() time.  For p2pdma this will be used to store and
      recall individual dev_pagemap reference counter instances per-chunk.
      
      Link: http://lkml.kernel.org/r/155727338118.292046.13407378933221579644.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      795ee306
    • D
      mm/devm_memremap_pages: introduce devm_memunmap_pages · 2e3f139e
      Dan Williams 提交于
      Use the new devm_release_action() facility to allow
      devm_memremap_pages_release() to be manually triggered.
      
      Link: http://lkml.kernel.org/r/155727337088.292046.5774214552136776763.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e3f139e
    • D
      drivers/base/devres: introduce devm_release_action() · 2374b682
      Dan Williams 提交于
      Patch series "mm/devm_memremap_pages: Fix page release race", v2.
      
      Logan audited the devm_memremap_pages() shutdown path and noticed that
      it was possible to proceed to arch_remove_memory() before all potential
      page references have been reaped.
      
      Introduce a new ->cleanup() callback to do the work of waiting for any
      straggling page references and then perform the percpu_ref_exit() in
      devm_memremap_pages_release() context.
      
      For p2pdma this involves some deeper reworks to reference count
      resources on a per-instance basis rather than a per pci-device basis.  A
      modified genalloc api is introduced to convey a driver-private pointer
      through gen_pool_{alloc,free}() interfaces.  Also, a
      devm_memunmap_pages() api is introduced since p2pdma does not
      auto-release resources on a setup failure.
      
      The dax and pmem changes pass the nvdimm unit tests, and the p2pdma
      changes should now pass testing with the pci_p2pdma_release() fix.
      Jrme, how does this look for HMM?
      
      This patch (of 6):
      
      The devm_add_action() facility allows a resource allocation routine to
      add custom devm semantics.  One such user is devm_memremap_pages().
      
      There is now a need to manually trigger
      devm_memremap_pages_release().  Introduce devm_release_action() so the
      release action can be triggered via a new devm_memunmap_pages() api in a
      follow-on change.
      
      Link: http://lkml.kernel.org/r/155727336530.292046.2926860263201336366.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2374b682
    • A
      coredump: fix race condition between collapse_huge_page() and core dumping · 59ea6d06
      Andrea Arcangeli 提交于
      When fixing the race conditions between the coredump and the mmap_sem
      holders outside the context of the process, we focused on
      mmget_not_zero()/get_task_mm() callers in 04f5866e ("coredump: fix
      race condition between mmget_not_zero()/get_task_mm() and core
      dumping"), but those aren't the only cases where the mmap_sem can be
      taken outside of the context of the process as Michal Hocko noticed
      while backporting that commit to older -stable kernels.
      
      If mmgrab() is called in the context of the process, but then the
      mm_count reference is transferred outside the context of the process,
      that can also be a problem if the mmap_sem has to be taken for writing
      through that mm_count reference.
      
      khugepaged registration calls mmgrab() in the context of the process,
      but the mmap_sem for writing is taken later in the context of the
      khugepaged kernel thread.
      
      collapse_huge_page() after taking the mmap_sem for writing doesn't
      modify any vma, so it's not obvious that it could cause a problem to the
      coredump, but it happens to modify the pmd in a way that breaks an
      invariant that pmd_trans_huge_lock() relies upon.  collapse_huge_page()
      needs the mmap_sem for writing just to block concurrent page faults that
      call pmd_trans_huge_lock().
      
      Specifically the invariant that "!pmd_trans_huge()" cannot become a
      "pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.
      
      The coredump will call __get_user_pages() without mmap_sem for reading,
      which eventually can invoke a lockless page fault which will need a
      functional pmd_trans_huge_lock().
      
      So collapse_huge_page() needs to use mmget_still_valid() to check it's
      not running concurrently with the coredump...  as long as the coredump
      can invoke page faults without holding the mmap_sem for reading.
      
      This has "Fixes: khugepaged" to facilitate backporting, but in my view
      it's more a bug in the coredump code that will eventually have to be
      rewritten to stop invoking page faults without the mmap_sem for reading.
      So the long term plan is still to drop all mmget_still_valid().
      
      Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
      Fixes: ba76149f ("thp: khugepaged")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59ea6d06
    • J
      mm: memcontrol: don't batch updates of local VM stats and events · 815744d7
      Johannes Weiner 提交于
      The kernel test robot noticed a 26% will-it-scale pagefault regression
      from commit 42a30035 ("mm: memcontrol: fix recursive statistics
      correctness & scalabilty").  This appears to be caused by bouncing the
      additional cachelines from the new hierarchical statistics counters.
      
      We can fix this by getting rid of the batched local counters instead.
      
      Originally, there were *only* group-local counters, and they were fully
      maintained per cpu.  A reader of a stats file high up in the cgroup tree
      would have to walk the entire subtree and collect each level's per-cpu
      counters to get the recursive view.  This was prohibitively expensive,
      and so we switched to per-cpu batched updates of the local counters
      during a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting"), reducing the complexity from nr_subgroups *
      nr_cpus to nr_subgroups.
      
      With growing machines and cgroup trees, the tree walk itself became too
      expensive for monitoring top-level groups, and this is when the culprit
      patch added hierarchy counters on each cgroup level.  When the per-cpu
      batch size would be reached, both the local and the hierarchy counters
      would get batch-updated from the per-cpu delta simultaneously.
      
      This makes local and hierarchical counter reads blazingly fast, but it
      unfortunately makes the write-side too cache line intense.
      
      Since local counter reads were never a problem - we only centralized
      them to accelerate the hierarchy walk - and use of the local counters
      are becoming rarer due to replacement with hierarchical views (ongoing
      rework in the page reclaim and workingset code), we can make those local
      counters unbatched per-cpu counters again.
      
      The scheme will then be as such:
      
         when a memcg statistic changes, the writer will:
         - update the local counter (per-cpu)
         - update the batch counter (per-cpu). If the batch is full:
         - spill the batch into the group's atomic_t
         - spill the batch into all ancestors' atomic_ts
         - empty out the batch counter (per-cpu)
      
         when a local memcg counter is read, the reader will:
         - collect the local counter from all cpus
      
         when a hiearchy memcg counter is read, the reader will:
         - read the atomic_t
      
      We might be able to simplify this further and make the recursive
      counters unbatched per-cpu counters as well (batch upward propagation,
      but leave per-cpu collection to the readers), but that will require a
      more in-depth analysis and testing of all the callsites.  Deal with the
      immediate regression for now.
      
      Link: http://lkml.kernel.org/r/20190521151647.GB2870@cmpxchg.org
      Fixes: 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Tested-by: Nkernel test robot <rong.a.chen@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      815744d7
  3. 10 6月, 2019 1 次提交
  4. 08 6月, 2019 1 次提交
  5. 07 6月, 2019 1 次提交
  6. 06 6月, 2019 1 次提交
    • X
      ipv6: fix the check before getting the cookie in rt6_get_cookie · b7999b07
      Xin Long 提交于
      In Jianlin's testing, netperf was broken with 'Connection reset by peer',
      as the cookie check failed in rt6_check() and ip6_dst_check() always
      returned NULL.
      
      It's caused by Commit 93531c67 ("net/ipv6: separate handling of FIB
      entries from dst based routes"), where the cookie can be got only when
      'c1'(see below) for setting dst_cookie whereas rt6_check() is called
      when !'c1' for checking dst_cookie, as we can see in ip6_dst_check().
      
      Since in ip6_dst_check() both rt6_dst_from_check() (c1) and rt6_check()
      (!c1) will check the 'from' cookie, this patch is to remove the c1 check
      in rt6_get_cookie(), so that the dst_cookie can always be set properly.
      
      c1:
        (rt->rt6i_flags & RTF_PCPU || unlikely(!list_empty(&rt->rt6i_uncached)))
      
      Fixes: 93531c67 ("net/ipv6: separate handling of FIB entries from dst based routes")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7999b07
  7. 05 6月, 2019 28 次提交