1. 11 8月, 2017 1 次提交
    • M
      mm: fix KSM data corruption · b3a81d08
      Minchan Kim 提交于
      Nadav reported KSM can corrupt the user data by the TLB batching
      race[1].  That means data user written can be lost.
      
      Quote from Nadav Amit:
       "For this race we need 4 CPUs:
      
        CPU0: Caches a writable and dirty PTE entry, and uses the stale value
        for write later.
      
        CPU1: Runs madvise_free on the range that includes the PTE. It would
        clear the dirty-bit. It batches TLB flushes.
      
        CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty.
        We care about the fact that it clears the PTE write-bit, and of
        course, batches TLB flushes.
      
        CPU3: Runs KSM. Our purpose is to pass the following test in
        write_protect_page():
      
      	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
      	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))
      
        Since it will avoid TLB flush. And we want to do it while the PTE is
        stale. Later, and before replacing the page, we would be able to
        change the page.
      
        Note that all the operations the CPU1-3 perform canhappen in parallel
        since they only acquire mmap_sem for read.
      
        We start with two identical pages. Everything below regards the same
        page/PTE.
      
        CPU0        CPU1        CPU2        CPU3
        ----        ----        ----        ----
        Write the same
        value on page
      
        [cache PTE as
         dirty in TLB]
      
                    MADV_FREE
                    pte_mkclean()
      
                                4 > clear_refs
                                pte_wrprotect()
      
                                            write_protect_page()
                                            [ success, no flush ]
      
                                            pages_indentical()
                                            [ ok ]
      
        Write to page
        different value
      
        [Ok, using stale
         PTE]
      
                                            replace_page()
      
        Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late.
        CPU0 already wrote on the page, but KSM ignored this write, and it got
        lost"
      
      In above scenario, MADV_FREE is fixed by changing TLB batching API
      including [set|clear]_tlb_flush_pending.  Remained thing is soft-dirty
      part.
      
      This patch changes soft-dirty uses TLB batching API instead of
      flush_tlb_mm and KSM checks pending TLB flush by using
      mm_tlb_flush_pending so that it will flush TLB to avoid data lost if
      there are other parallel threads pending TLB flush.
      
      [1] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-8-namit@vmware.comSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Reported-by: NNadav Amit <namit@vmware.com>
      Tested-by: NNadav Amit <namit@vmware.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3a81d08
  2. 07 7月, 2017 5 次提交
    • A
      ksm: optimize refile of stable_node_dup at the head of the chain · 80b18dfa
      Andrea Arcangeli 提交于
      If a candidate stable_node_dup has been found and it can accept further
      merges it can be refiled to the head of the list to speedup next
      searches without altering which dup is found and how the dups accumulate
      in the chain.
      
      We already refiled it back to the head in the prune_stale_stable_nodes
      case, but we didn't refile it if not pruning (which is more common).
      And we also refiled it when it was already at the head which is
      unnecessary (in the prune_stale_stable_nodes case, nr > 1 means there's
      more than one dup in the chain, it doesn't mean it's not already at the
      head of the chain).
      
      The stable_node_chain list is single threaded and there's no SMP locking
      contention so it should be faster to refile it to the head of the list
      also if prune_stale_stable_nodes is false.
      
      Profiling shows the refile happens 1.9% of the time when a dup is found
      with a max_page_sharing limit setting of 3 (with max_page_sharing of 2
      the refile never happens of course as there's never space for one more
      merge) which is reasonably low.  At higher max_page_sharing values it
      should be much less frequent.
      
      This is just an optimization.
      
      Link: http://lkml.kernel.org/r/20170518173721.22316-4-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Evgheni Dereveanchin <ederevea@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80b18dfa
    • A
      ksm: swap the two output parameters of chain/chain_prune · 8dc5ffcd
      Andrea Arcangeli 提交于
      Some static checker complains if chain/chain_prune returns a potentially
      stale pointer.
      
      There are two output parameters to chain/chain_prune, one is tree_page
      the other is stable_node_dup.  Like in get_ksm_page the caller has to
      check tree_page is NULL before touching the stable_node.  Similarly in
      chain/chain_prune the caller has to check tree_page before touching the
      stable_node_dup returned or the original stable_node passed as
      parameter.
      
      Because the tree_page is never returned as a stale pointer, it may be
      more intuitive to return tree_page and to pass stable_node_dup for
      reference instead of the reverse.
      
      This patch purely swaps the two output parameters of chain/chain_prune
      as a cleanup for the static checker and to mimic the get_ksm_page
      behavior more closely.  There's no change to the caller at all except
      the swap, it's purely a cleanup and it is a noop from the caller point
      of view.
      
      Link: http://lkml.kernel.org/r/20170518173721.22316-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Tested-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Evgheni Dereveanchin <ederevea@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8dc5ffcd
    • A
      ksm: cleanup stable_node chain collapse case · 0ba1d0f7
      Andrea Arcangeli 提交于
      Patch series "KSMscale cleanup/optimizations".
      
      There are no fixes here it's just minor cleanups and optimizations.
      
      1/3 removes makes the "fix" for the stale stable_node fall in the
          standard case without introducing new cases.  Setting stable_node to
          NULL was marginally safer, but stale pointer is still wiped from the
          caller, this looks cleaner.
      
      2/3 should fix the false positive from Dan's static checker.
      
      3/3 is a microoptimization to apply the the refile of future merge
          candidate dups at the head of the chain in all cases and to skip it in
          one case where we did it and but it was a noop (to avoid checking if
          it was already at the head but now we've to check it anyway so it got
          optimized away).
      
      This patch (of 3):
      
      When the stable_node chain is collapsed we can as well set the caller
      stable_node to match the returned stable_node_dup in chain_prune().
      
      This way the collapse case becomes indistinguishable from the regular
      stable_node case and we can remove two branches from the KSM page
      migration handling slow paths.
      
      While it was all correct this looks cleaner (and faster) as the caller has
      to deal with fewer special cases.
      
      Link: http://lkml.kernel.org/r/20170518173721.22316-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Evgheni Dereveanchin <ederevea@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ba1d0f7
    • A
      ksm: fix use after free with merge_across_nodes = 0 · b4fecc67
      Andrea Arcangeli 提交于
      If merge_across_nodes was manually set to 0 (not the default value) by
      the admin or a tuned profile on NUMA systems triggering cross-NODE page
      migrations, a stable_node use after free could materialize.
      
      If the chain is collapsed stable_node would point to the old chain that
      was already freed.  stable_node_dup would be the stable_node dup now
      converted to a regular stable_node and indexed in the rbtree in
      replacement of the freed stable_node chain (not anymore a dup).
      
      This special case where the chain is collapsed in the NUMA replacement
      path, is now detected by setting stable_node to NULL by the chain_prune
      callee if it decides to collapse the chain.  This tells the NUMA
      replacement code that even if stable_node and stable_node_dup are
      different, this is not a chain if stable_node is NULL, as the
      stable_node_dup was converted to a regular stable_node and the chain was
      collapsed.
      
      It is generally safer for the callee to force the caller stable_node to
      NULL the moment it become stale so any other mistake like this would
      result in an instant Oops easier to debug than an use after free.
      
      Otherwise the replace logic would act like if stable_node was a valid
      chain, when in fact it was freed.  Notably
      stable_node_chain_add_dup(page_node, stable_node) would run on a stable
      stable_node.
      
      Andrey Ryabinin found the source of the use after free in chain_prune().
      
      Link: http://lkml.kernel.org/r/20170512193805.8807-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: NEvgheni Dereveanchin <ederevea@redhat.com>
      Tested-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4fecc67
    • A
      ksm: introduce ksm_max_page_sharing per page deduplication limit · 2c653d0e
      Andrea Arcangeli 提交于
      Without a max deduplication limit for each KSM page, the list of the
      rmap_items associated to each stable_node can grow infinitely large.
      
      During the rmap walk each entry can take up to ~10usec to process
      because of IPIs for the TLB flushing (both for the primary MMU and the
      secondary MMUs with the MMU notifier).  With only 16GB of address space
      shared in the same KSM page, that would amount to dozens of seconds of
      kernel runtime.
      
      A ~256 max deduplication factor will reduce the latencies of the rmap
      walks on KSM pages to order of a few msec.  Just doing the
      cond_resched() during the rmap walks is not enough, the list size must
      have a limit too, otherwise the caller could get blocked in (schedule
      friendly) kernel computations for seconds, unexpectedly.
      
      There's room for optimization to significantly reduce the IPI delivery
      cost during the page_referenced(), but at least for page_migration in
      the KSM case (used by hard NUMA bindings, compaction and NUMA balancing)
      it may be inevitable to send lots of IPIs if each rmap_item->mm is
      active on a different CPU and there are lots of CPUs.  Even if we ignore
      the IPI delivery cost, we've still to walk the whole KSM rmap list, so
      we can't allow millions or billions (ulimited) number of entries in the
      KSM stable_node rmap_item lists.
      
      The limit is enforced efficiently by adding a second dimension to the
      stable rbtree.  So there are three types of stable_nodes: the regular
      ones (identical as before, living in the first flat dimension of the
      stable rbtree), the "chains" and the "dups".
      
      Every "chain" and all "dups" linked into a "chain" enforce the invariant
      that they represent the same write protected memory content, even if
      each "dup" will be pointed by a different KSM page copy of that content.
      This way the stable rbtree lookup computational complexity is unaffected
      if compared to an unlimited max_sharing_limit.  It is still enforced
      that there cannot be KSM page content duplicates in the stable rbtree
      itself.
      
      Adding the second dimension to the stable rbtree only after the
      max_page_sharing limit hits, provides for a zero memory footprint
      increase on 64bit archs.  The memory overhead of the per-KSM page
      stable_tree and per virtual mapping rmap_item is unchanged.  Only after
      the max_page_sharing limit hits, we need to allocate a stable_tree
      "chain" and rb_replace() the "regular" stable_node with the newly
      allocated stable_node "chain".  After that we simply add the "regular"
      stable_node to the chain as a stable_node "dup" by linking hlist_dup in
      the stable_node_chain->hlist.  This way the "regular" (flat) stable_node
      is converted to a stable_node "dup" living in the second dimension of
      the stable rbtree.
      
      During stable rbtree lookups the stable_node "chain" is identified as
      stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka
      is_stable_node_chain()).
      
      When dropping stable_nodes, the stable_node "dup" is identified as
      stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()).
      
      The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
      elsewhere in any stable_node->head/node to avoid a clashes with the
      stable_node->node.rb_parent_color pointer, and different from
      &migrate_nodes.  So the second field of &migrate_nodes is picked and
      verified as always safe with a BUILD_BUG_ON in case the list_head
      implementation changes in the future.
      
      The STABLE_NODE_DUP is picked as a random negative value in
      stable_node->rmap_hlist_len.  rmap_hlist_len cannot become negative when
      it's a "regular" stable_node or a stable_node "dup".
      
      The stable_node_chain->nid is irrelevant.  The stable_node_chain->kpfn
      is aliased in a union with a time field used to rate limit the
      stable_node_chain->hlist prunes.
      
      The garbage collection of the stable_node_chain happens lazily during
      stable rbtree lookups (as for all other kind of stable_nodes), or while
      disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the
      entire stable rbtree.
      
      While the "regular" stable_nodes and the stable_node "dups" must wait
      for their underlying tree_page to be freed before they can be freed
      themselves, the stable_node "chains" can be freed immediately if the
      stable_node->hlist turns empty.  This is because the "chains" are never
      pointed by any page->mapping and they're effectively stable rbtree KSM
      self contained metadata.
      
      [akpm@linux-foundation.org: fix non-NUMA build]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: NPetr Holasek <pholasek@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Evgheni Dereveanchin <ederevea@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c653d0e
  3. 03 6月, 2017 1 次提交
    • A
      ksm: prevent crash after write_protect_page fails · a7306c34
      Andrea Arcangeli 提交于
      "err" needs to be left set to -EFAULT if split_huge_page succeeds.
      Otherwise if "err" gets clobbered with zero and write_protect_page
      fails, try_to_merge_one_page() will succeed instead of returning -EFAULT
      and then try_to_merge_with_ksm_page() will continue thinking kpage is a
      PageKsm when in fact it's still an anonymous page.  Eventually it'll
      crash in page_add_anon_rmap.
      
      This has been reproduced on Fedora25 kernel but I can reproduce with
      upstream too.
      
      The bug was introduced in commit f765f540 ("ksm: prepare to new THP
      semantics") introduced in v4.5.
      
          page:fffff67546ce1cc0 count:4 mapcount:2 mapping:ffffa094551e36e1 index:0x7f0f46673
          flags: 0x2ffffc0004007c(referenced|uptodate|dirty|lru|active|swapbacked)
          page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
          page->mem_cgroup:ffffa09674bf0000
          ------------[ cut here ]------------
          kernel BUG at mm/rmap.c:1222!
          CPU: 1 PID: 76 Comm: ksmd Not tainted 4.9.3-200.fc25.x86_64 #1
          RIP: do_page_add_anon_rmap+0x1c4/0x240
          Call Trace:
            page_add_anon_rmap+0x18/0x20
            try_to_merge_with_ksm_page+0x50b/0x780
            ksm_scan_thread+0x1211/0x1410
            ? prepare_to_wait_event+0x100/0x100
            ? try_to_merge_with_ksm_page+0x780/0x780
            kthread+0xd9/0xf0
            ? kthread_park+0x60/0x60
            ret_from_fork+0x25/0x30
      
      Fixes: f765f540 ("ksm: prepare to new THP semantics")
      Link: http://lkml.kernel.org/r/20170513131040.21732-1-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NFederico Simoncelli <fsimonce@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7306c34
  4. 04 5月, 2017 2 次提交
  5. 02 3月, 2017 2 次提交
  6. 28 2月, 2017 1 次提交
  7. 25 2月, 2017 3 次提交
  8. 08 10月, 2016 1 次提交
  9. 29 9月, 2016 1 次提交
    • Z
      mm,ksm: fix endless looping in allocating memory when ksm enable · 5b398e41
      zhong jiang 提交于
      I hit the following hung task when runing a OOM LTP test case with 4.1
      kernel.
      
      Call trace:
      [<ffffffc000086a88>] __switch_to+0x74/0x8c
      [<ffffffc000a1bae0>] __schedule+0x23c/0x7bc
      [<ffffffc000a1c09c>] schedule+0x3c/0x94
      [<ffffffc000a1eb84>] rwsem_down_write_failed+0x214/0x350
      [<ffffffc000a1e32c>] down_write+0x64/0x80
      [<ffffffc00021f794>] __ksm_exit+0x90/0x19c
      [<ffffffc0000be650>] mmput+0x118/0x11c
      [<ffffffc0000c3ec4>] do_exit+0x2dc/0xa74
      [<ffffffc0000c46f8>] do_group_exit+0x4c/0xe4
      [<ffffffc0000d0f34>] get_signal+0x444/0x5e0
      [<ffffffc000089fcc>] do_signal+0x1d8/0x450
      [<ffffffc00008a35c>] do_notify_resume+0x70/0x78
      
      The oom victim cannot terminate because it needs to take mmap_sem for
      write while the lock is held by ksmd for read which loops in the page
      allocator
      
      ksm_do_scan
      	scan_get_next_rmap_item
      		down_read
      		get_next_rmap_item
      			alloc_rmap_item   #ksmd will loop permanently.
      
      There is no way forward because the oom victim cannot release any memory
      in 4.1 based kernel.  Since 4.6 we have the oom reaper which would solve
      this problem because it would release the memory asynchronously.
      Nevertheless we can relax alloc_rmap_item requirements and use
      __GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan
      would just retry later after the lock got dropped.
      
      Such a patch would be also easy to backport to older stable kernels which
      do not have oom_reaper.
      
      While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed
      by the allocation failure.
      
      Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Suggested-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b398e41
  10. 27 7月, 2016 2 次提交
    • K
    • M
      mm: migrate: support non-lru movable page migration · bda807d4
      Minchan Kim 提交于
      We have allowed migration for only LRU pages until now and it was enough
      to make high-order pages.  But recently, embedded system(e.g., webOS,
      android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
      have seen several reports about troubles of small high-order allocation.
      For fixing the problem, there were several efforts (e,g,.  enhance
      compaction algorithm, SLUB fallback to 0-order page, reserved memory,
      vmalloc and so on) but if there are lots of non-movable pages in system,
      their solutions are void in the long run.
      
      So, this patch is to support facility to change non-movable pages with
      movable.  For the feature, this patch introduces functions related to
      migration to address_space_operations as well as some page flags.
      
      If a driver want to make own pages movable, it should define three
      functions which are function pointers of struct
      address_space_operations.
      
      1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);
      
      What VM expects on isolate_page function of driver is to return *true*
      if driver isolates page successfully.  On returing true, VM marks the
      page as PG_isolated so concurrent isolation in several CPUs skip the
      page for isolation.  If a driver cannot isolate the page, it should
      return *false*.
      
      Once page is successfully isolated, VM uses page.lru fields so driver
      shouldn't expect to preserve values in that fields.
      
      2. int (*migratepage) (struct address_space *mapping,
      		struct page *newpage, struct page *oldpage, enum migrate_mode);
      
      After isolation, VM calls migratepage of driver with isolated page.  The
      function of migratepage is to move content of the old page to new page
      and set up fields of struct page newpage.  Keep in mind that you should
      indicate to the VM the oldpage is no longer movable via
      __ClearPageMovable() under page_lock if you migrated the oldpage
      successfully and returns 0.  If driver cannot migrate the page at the
      moment, driver can return -EAGAIN.  On -EAGAIN, VM will retry page
      migration in a short time because VM interprets -EAGAIN as "temporal
      migration failure".  On returning any error except -EAGAIN, VM will give
      up the page migration without retrying in this time.
      
      Driver shouldn't touch page.lru field VM using in the functions.
      
      3. void (*putback_page)(struct page *);
      
      If migration fails on isolated page, VM should return the isolated page
      to the driver so VM calls driver's putback_page with migration failed
      page.  In this function, driver should put the isolated page back to the
      own data structure.
      
      4. non-lru movable page flags
      
      There are two page flags for supporting non-lru movable page.
      
      * PG_movable
      
      Driver should use the below function to make page movable under
      page_lock.
      
      	void __SetPageMovable(struct page *page, struct address_space *mapping)
      
      It needs argument of address_space for registering migration family
      functions which will be called by VM.  Exactly speaking, PG_movable is
      not a real flag of struct page.  Rather than, VM reuses page->mapping's
      lower bits to represent it.
      
      	#define PAGE_MAPPING_MOVABLE 0x2
      	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
      
      so driver shouldn't access page->mapping directly.  Instead, driver
      should use page_mapping which mask off the low two bits of page->mapping
      so it can get right struct address_space.
      
      For testing of non-lru movable page, VM supports __PageMovable function.
      However, it doesn't guarantee to identify non-lru movable page because
      page->mapping field is unified with other variables in struct page.  As
      well, if driver releases the page after isolation by VM, page->mapping
      doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
      __ClearPageMovable).  But __PageMovable is cheap to catch whether page
      is LRU or non-lru movable once the page has been isolated.  Because LRU
      pages never can have PAGE_MAPPING_MOVABLE in page->mapping.  It is also
      good for just peeking to test non-lru movable pages before more
      expensive checking with lock_page in pfn scanning to select victim.
      
      For guaranteeing non-lru movable page, VM provides PageMovable function.
      Unlike __PageMovable, PageMovable functions validates page->mapping and
      mapping->a_ops->isolate_page under lock_page.  The lock_page prevents
      sudden destroying of page->mapping.
      
      Driver using __SetPageMovable should clear the flag via
      __ClearMovablePage under page_lock before the releasing the page.
      
      * PG_isolated
      
      To prevent concurrent isolation among several CPUs, VM marks isolated
      page as PG_isolated under lock_page.  So if a CPU encounters PG_isolated
      non-lru movable page, it can skip it.  Driver doesn't need to manipulate
      the flag because VM will set/clear it automatically.  Keep in mind that
      if driver sees PG_isolated page, it means the page have been isolated by
      VM so it shouldn't touch page.lru field.  PG_isolated is alias with
      PG_reclaim flag so driver shouldn't use the flag for own purpose.
      
      [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
        Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
      Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.orgSigned-off-by: NGioh Kim <gi-oh.kim@profitbricks.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: John Einar Reitan <john.reitan@foss.arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bda807d4
  11. 13 5月, 2016 1 次提交
    • Z
      ksm: fix conflict between mmput and scan_get_next_rmap_item · 7496fea9
      Zhou Chengming 提交于
      A concurrency issue about KSM in the function scan_get_next_rmap_item.
      
      task A (ksmd):				|task B (the mm's task):
      					|
      mm = slot->mm;				|
      down_read(&mm->mmap_sem);		|
      					|
      ...					|
      					|
      spin_lock(&ksm_mmlist_lock);		|
      					|
      ksm_scan.mm_slot go to the next slot;	|
      					|
      spin_unlock(&ksm_mmlist_lock);		|
      					|mmput() ->
      					|	ksm_exit():
      					|
      					|spin_lock(&ksm_mmlist_lock);
      					|if (mm_slot && ksm_scan.mm_slot != mm_slot) {
      					|	if (!mm_slot->rmap_list) {
      					|		easy_to_free = 1;
      					|		...
      					|
      					|if (easy_to_free) {
      					|	mmdrop(mm);
      					|	...
      					|
      					|So this mm_struct may be freed in the mmput().
      					|
      up_read(&mm->mmap_sem);			|
      
      As we can see above, the ksmd thread may access a mm_struct that already
      been freed to the kmem_cache.  Suppose a fork will get this mm_struct from
      the kmem_cache, the ksmd thread then call up_read(&mm->mmap_sem), will
      cause mmap_sem.count to become -1.
      
      As suggested by Andrea Arcangeli, unmerge_and_remove_all_rmap_items has
      the same SMP race condition, so fix it too.  My prev fix in function
      scan_get_next_rmap_item will introduce a different SMP race condition, so
      just invert the up_read/spin_unlock order as Andrea Arcangeli said.
      
      Link: http://lkml.kernel.org/r/1462708815-31301-1-git-send-email-zhouchengming1@huawei.comSigned-off-by: NZhou Chengming <zhouchengming1@huawei.com>
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Geliang Tang <geliangtang@163.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Ding Tianhong <dingtianhong@huawei.com>
      Cc: Li Bin <huawei.libin@huawei.com>
      Cc: Zhen Lei <thunder.leizhen@huawei.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7496fea9
  12. 19 2月, 2016 1 次提交
    • D
      mm/core: Do not enforce PKEY permissions on remote mm access · 1b2ee126
      Dave Hansen 提交于
      We try to enforce protection keys in software the same way that we
      do in hardware.  (See long example below).
      
      But, we only want to do this when accessing our *own* process's
      memory.  If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
      tried to PTRACE_POKE a target process which just happened to have
      some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
      debugger access to that memory.  PKRU is fundamentally a
      thread-local structure and we do not want to enforce it on access
      to _another_ thread's data.
      
      This gets especially tricky when we have workqueues or other
      delayed-work mechanisms that might run in a random process's context.
      We can check that we only enforce pkeys when operating on our *own* mm,
      but delayed work gets performed when a random user context is active.
      We might end up with a situation where a delayed-work gup fails when
      running randomly under its "own" task but succeeds when running under
      another process.  We want to avoid that.
      
      To avoid that, we use the new GUP flag: FOLL_REMOTE and add a
      fault flag: FAULT_FLAG_REMOTE.  They indicate that we are
      walking an mm which is not guranteed to be the same as
      current->mm and should not be subject to protection key
      enforcement.
      
      Thanks to Jerome Glisse for pointing out this scenario.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Geliang Tang <geliangtang@163.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: iommu@lists.linux-foundation.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-s390@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1b2ee126
  13. 16 2月, 2016 1 次提交
    • D
      mm/gup: Switch all callers of get_user_pages() to not pass tsk/mm · d4edcf0d
      Dave Hansen 提交于
      We will soon modify the vanilla get_user_pages() so it can no
      longer be used on mm/tasks other than 'current/current->mm',
      which is by far the most common way it is called.  For now,
      we allow the old-style calls, but warn when they are used.
      (implemented in previous patch)
      
      This patch switches all callers of:
      
      	get_user_pages()
      	get_user_pages_unlocked()
      	get_user_pages_locked()
      
      to stop passing tsk/mm so they will no longer see the warnings.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: jack@suse.cz
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d4edcf0d
  14. 16 1月, 2016 4 次提交
    • M
      mm/ksm.c: mark stable page dirty · 337ed7eb
      Minchan Kim 提交于
      The MADV_FREE patchset changes page reclaim to simply free a clean
      anonymous page with no dirty ptes, instead of swapping it out; but KSM
      uses clean write-protected ptes to reference the stable ksm page.  So be
      sure to mark that page dirty, so it's never mistakenly discarded.
      
      [hughd@google.com: adjusted comments]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Evans <je@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      337ed7eb
    • K
      ksm: prepare to new THP semantics · f765f540
      Kirill A. Shutemov 提交于
      We don't need special code to stabilize THP.  If you've got reference to
      any subpage of THP it will not be split under you.
      
      New split_huge_page() also accepts tail pages: no need in special code
      to get reference to head page.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f765f540
    • K
      rmap: add argument to charge compound page · d281ee61
      Kirill A. Shutemov 提交于
      We're going to allow mapping of individual 4k pages of THP compound
      page.  It means we cannot rely on PageTransHuge() check to decide if
      map/unmap small page or THP.
      
      The patch adds new argument to rmap functions to indicate whether we
      want to operate on whole compound page or only the small page.
      
      [n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d281ee61
    • K
      page-flags: define PG_locked behavior on compound pages · 48c935ad
      Kirill A. Shutemov 提交于
      lock_page() must operate on the whole compound page.  It doesn't make
      much sense to lock part of compound page.  Change code to use head
      page's PG_locked, if tail page is passed.
      
      This patch also gets rid of custom helper functions --
      __set_page_locked() and __clear_page_locked().  They are replaced with
      helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG.  Tail pages to these
      helper would trigger VM_BUG_ON().
      
      SLUB uses PG_locked as a bit spin locked.  IIUC, tail pages should never
      appear there.  VM_BUG_ON() is added to make sure that this assumption is
      correct.
      
      [akpm@linux-foundation.org: fix fs/cifs/file.c]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48c935ad
  15. 15 1月, 2016 1 次提交
  16. 06 11月, 2015 5 次提交
  17. 16 4月, 2015 1 次提交
  18. 11 2月, 2015 1 次提交
  19. 30 1月, 2015 1 次提交
    • L
      vm: add VM_FAULT_SIGSEGV handling support · 33692f27
      Linus Torvalds 提交于
      The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
      "you should SIGSEGV" error, because the SIGSEGV case was generally
      handled by the caller - usually the architecture fault handler.
      
      That results in lots of duplication - all the architecture fault
      handlers end up doing very similar "look up vma, check permissions, do
      retries etc" - but it generally works.  However, there are cases where
      the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
      
      In particular, when accessing the stack guard page, libsigsegv expects a
      SIGSEGV.  And it usually got one, because the stack growth is handled by
      that duplicated architecture fault handler.
      
      However, when the generic VM layer started propagating the error return
      from the stack expansion in commit fee7e49d ("mm: propagate error
      from stack expansion even for guard page"), that now exposed the
      existing VM_FAULT_SIGBUS result to user space.  And user space really
      expected SIGSEGV, not SIGBUS.
      
      To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
      duplicate architecture fault handlers about it.  They all already have
      the code to handle SIGSEGV, so it's about just tying that new return
      value to the existing code, but it's all a bit annoying.
      
      This is the mindless minimal patch to do this.  A more extensive patch
      would be to try to gather up the mostly shared fault handling logic into
      one generic helper routine, and long-term we really should do that
      cleanup.
      
      Just from this patch, you can generally see that most architectures just
      copied (directly or indirectly) the old x86 way of doing things, but in
      the meantime that original x86 model has been improved to hold the VM
      semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
      "newer" things, so it would be a good idea to bring all those
      improvements to the generic case and teach other architectures about
      them too.
      Reported-and-tested-by: NTakashi Iwai <tiwai@suse.de>
      Tested-by: NJan Engelhardt <jengelh@inai.de>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
      Cc: linux-arch@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33692f27
  20. 10 10月, 2014 1 次提交
  21. 16 7月, 2014 1 次提交
    • N
      sched: Remove proliferation of wait_on_bit() action functions · 74316201
      NeilBrown 提交于
      The current "wait_on_bit" interface requires an 'action'
      function to be provided which does the actual waiting.
      There are over 20 such functions, many of them identical.
      Most cases can be satisfied by one of just two functions, one
      which uses io_schedule() and one which just uses schedule().
      
      So:
       Rename wait_on_bit and        wait_on_bit_lock to
              wait_on_bit_action and wait_on_bit_lock_action
       to make it explicit that they need an action function.
      
       Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
       which are *not* given an action function but implicitly use
       a standard one.
       The decision to error-out if a signal is pending is now made
       based on the 'mode' argument rather than being encoded in the action
       function.
      
       All instances of the old wait_on_bit and wait_on_bit_lock which
       can use the new version have been changed accordingly and their
       action functions have been discarded.
       wait_on_bit{_lock} does not return any specific error code in the
       event of a signal so the caller must check for non-zero and
       interpolate their own error code as appropriate.
      
      The wait_on_bit() call in __fscache_wait_on_invalidate() was
      ambiguous as it specified TASK_UNINTERRUPTIBLE but used
      fscache_wait_bit_interruptible as an action function.
      David Howells confirms this should be uniformly
      "uninterruptible"
      
      The main remaining user of wait_on_bit{,_lock}_action is NFS
      which needs to use a freezer-aware schedule() call.
      
      A comment in fs/gfs2/glock.c notes that having multiple 'action'
      functions is useful as they display differently in the 'wchan'
      field of 'ps'. (and /proc/$PID/wchan).
      As the new bit_wait{,_io} functions are tagged "__sched", they
      will not show up at all, but something higher in the stack.  So
      the distinction will still be visible, only with different
      function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
      gfs2/glock.c case).
      
      Since first version of this patch (against 3.15) two new action
      functions appeared, on in NFS and one in CIFS.  CIFS also now
      uses an action function that makes the same freezer aware
      schedule call as NFS.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
      Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steve French <sfrench@samba.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brownSigned-off-by: NIngo Molnar <mingo@kernel.org>
      74316201
  22. 24 6月, 2014 1 次提交
    • H
      mm: let mm_find_pmd fix buggy race with THP fault · f72e7dcd
      Hugh Dickins 提交于
      Trinity has reported:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
          IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
          CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G        W
                                  3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
          lock_acquire (arch/x86/include/asm/current.h:14
                        kernel/locking/lockdep.c:3602)
          _raw_spin_lock (include/linux/spinlock_api_smp.h:143
                          kernel/locking/spinlock.c:151)
          remove_migration_pte (mm/migrate.c:137)
          rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
          remove_migration_ptes (mm/migrate.c:224)
          migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
          migrate_misplaced_page (mm/migrate.c:1733)
          __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
          handle_mm_fault (mm/memory.c:3948)
          __get_user_pages (mm/memory.c:1851)
          __mlock_vma_pages_range (mm/mlock.c:255)
          __mm_populate (mm/mlock.c:711)
          SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)
      
      I believe this comes about because, whereas collapsing and splitting THP
      functions take anon_vma lock in write mode (which excludes concurrent
      rmap walks), faulting THP functions (write protection and misplaced
      NUMA) do not - and mostly they do not need to.
      
      But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
      an instant (indeed, for a long instant, given the inter-CPU TLB flush in
      there), leaves *pmd neither present not trans_huge.
      
      Which can confuse a concurrent rmap walk, as when removing migration
      ptes, seen in the dumped trace.  Although that rmap walk has a 4k page
      to insert, anon_vmas containing THPs are in no way segregated from
      4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
      that instant when a trans_huge pmd is temporarily absent.
      
      I don't think we need strengthen the locking at the THP end: it's easily
      handled with an ACCESS_ONCE() before testing both conditions.
      
      And since mm_find_pmd() had only one caller who wanted a THP rather than
      a pmd, let's slightly repurpose it to fail when it hits a THP or
      non-present pmd, and open code split_huge_page_address() again.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Lameter <cl@gentwo.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f72e7dcd
  23. 04 3月, 2014 1 次提交
    • D
      mm: close PageTail race · 668f9abb
      David Rientjes 提交于
      Commit bf6bddf1 ("mm: introduce compaction and migration for
      ballooned pages") introduces page_count(page) into memory compaction
      which dereferences page->first_page if PageTail(page).
      
      This results in a very rare NULL pointer dereference on the
      aforementioned page_count(page).  Indeed, anything that does
      compound_head(), including page_count() is susceptible to racing with
      prep_compound_page() and seeing a NULL or dangling page->first_page
      pointer.
      
      This patch uses Andrea's implementation of compound_trans_head() that
      deals with such a race and makes it the default compound_head()
      implementation.  This includes a read memory barrier that ensures that
      if PageTail(head) is true that we return a head page that is neither
      NULL nor dangling.  The patch then adds a store memory barrier to
      prep_compound_page() to ensure page->first_page is set.
      
      This is the safest way to ensure we see the head page that we are
      expecting, PageTail(page) is already in the unlikely() path and the
      memory barriers are unfortunately required.
      
      Hugetlbfs is the exception, we don't enforce a store memory barrier
      during init since no race is possible.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      668f9abb
  24. 24 1月, 2014 1 次提交
    • P
      mm: audit/fix non-modular users of module_init in core code · a64fb3cd
      Paul Gortmaker 提交于
      Code that is obj-y (always built-in) or dependent on a bool Kconfig
      (built-in or absent) can never be modular.  So using module_init as an
      alias for __initcall can be somewhat misleading.
      
      Fix these up now, so that we can relocate module_init from init.h into
      module.h in the future.  If we don't do this, we'd have to add module.h
      to obviously non-modular code, and that would be a worse thing.
      
      The audit targets the following module_init users for change:
       mm/ksm.c                       bool KSM
       mm/mmap.c                      bool MMU
       mm/huge_memory.c               bool TRANSPARENT_HUGEPAGE
       mm/mmu_notifier.c              bool MMU_NOTIFIER
      
      Note that direct use of __initcall is discouraged, vs.  one of the
      priority categorized subgroups.  As __initcall gets mapped onto
      device_initcall, our use of subsys_initcall (which makes sense for these
      files) will thus change this registration from level 6-device to level
      4-subsys (i.e.  slightly earlier).
      
      However no observable impact of that difference has been observed during
      testing.
      
      One might think that core_initcall (l2) or postcore_initcall (l3) would
      be more appropriate for anything in mm/ but if we look at some actual
      init functions themselves, we see things like:
      
      mm/huge_memory.c --> hugepage_init     --> hugepage_init_sysfs
      mm/mmap.c        --> init_user_reserve --> sysctl_user_reserve_kbytes
      mm/ksm.c         --> ksm_init          --> sysfs_create_group
      
      and hence the choice of subsys_initcall (l4) seems reasonable, and at
      the same time minimizes the risk of changing the priority too
      drastically all at once.  We can adjust further in the future.
      
      Also, several instances of missing ";" at EOL are fixed.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a64fb3cd