1. 16 1月, 2016 3 次提交
    • K
      mm: rework mapcount accounting to enable 4k mapping of THPs · 53f9263b
      Kirill A. Shutemov 提交于
      We're going to allow mapping of individual 4k pages of THP compound.  It
      means we need to track mapcount on per small page basis.
      
      Straight-forward approach is to use ->_mapcount in all subpages to track
      how many time this subpage is mapped with PMDs or PTEs combined.  But
      this is rather expensive: mapping or unmapping of a THP page with PMD
      would require HPAGE_PMD_NR atomic operations instead of single we have
      now.
      
      The idea is to store separately how many times the page was mapped as
      whole -- compound_mapcount.  This frees up ->_mapcount in subpages to
      track PTE mapcount.
      
      We use the same approach as with compound page destructor and compound
      order to store compound_mapcount: use space in first tail page,
      ->mapping this time.
      
      Any time we map/unmap whole compound page (THP or hugetlb) -- we
      increment/decrement compound_mapcount.  When we map part of compound
      page with PTE we operate on ->_mapcount of the subpage.
      
      page_mapcount() counts both: PTE and PMD mappings of the page.
      
      Basically, we have mapcount for a subpage spread over two counters.  It
      makes tricky to detect when last mapcount for a page goes away.
      
      We introduced PageDoubleMap() for this.  When we split THP PMD for the
      first time and there's other PMD mapping left we offset up ->_mapcount
      in all subpages by one and set PG_double_map on the compound page.
      These additional references go away with last compound_mapcount.
      
      This approach provides a way to detect when last mapcount goes away on
      per small page basis without introducing new overhead for most common
      cases.
      
      [akpm@linux-foundation.org: fix typo in comment]
      [mhocko@suse.com: ignore partial THP when moving task]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53f9263b
    • K
      rmap: add argument to charge compound page · d281ee61
      Kirill A. Shutemov 提交于
      We're going to allow mapping of individual 4k pages of THP compound
      page.  It means we cannot rely on PageTransHuge() check to decide if
      map/unmap small page or THP.
      
      The patch adds new argument to rmap functions to indicate whether we
      want to operate on whole compound page or only the small page.
      
      [n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d281ee61
    • K
      page-flags: define PG_locked behavior on compound pages · 48c935ad
      Kirill A. Shutemov 提交于
      lock_page() must operate on the whole compound page.  It doesn't make
      much sense to lock part of compound page.  Change code to use head
      page's PG_locked, if tail page is passed.
      
      This patch also gets rid of custom helper functions --
      __set_page_locked() and __clear_page_locked().  They are replaced with
      helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG.  Tail pages to these
      helper would trigger VM_BUG_ON().
      
      SLUB uses PG_locked as a bit spin locked.  IIUC, tail pages should never
      appear there.  VM_BUG_ON() is added to make sure that this assumption is
      correct.
      
      [akpm@linux-foundation.org: fix fs/cifs/file.c]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48c935ad
  2. 07 11月, 2015 2 次提交
    • M
      mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM · 71baba4b
      Mel Gorman 提交于
      __GFP_WAIT was used to signal that the caller was in atomic context and
      could not sleep.  Now it is possible to distinguish between true atomic
      context and callers that are not willing to sleep.  The latter should
      clear __GFP_DIRECT_RECLAIM so kswapd will still wake.  As clearing
      __GFP_WAIT behaves differently, there is a risk that people will clear the
      wrong flags.  This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
      indicate what it does -- setting it allows all reclaim activity, clearing
      them prevents it.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71baba4b
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  3. 06 11月, 2015 10 次提交
    • H
      mm: migrate dirty page without clear_page_dirty_for_io etc · 42cb14b1
      Hugh Dickins 提交于
      clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
      since v2.6.16 first introduced page migration; and the set_page_dirty()
      which completed its migration of PageDirty, later had to be moderated to
      __set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.
      
      No actual problems seen with this procedure recently, but if you look into
      what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
      achieving, it turns out to be nothing more than moving the PageDirty flag,
      and its NR_FILE_DIRTY stat from one zone to another.
      
      It would be good to avoid a pile of irrelevant decrementations and
      incrementations, and improper event counting, and unnecessary descent of
      the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
      radix_tree_replace_slot() left in place anyway).
      
      Do the NR_FILE_DIRTY movement, like the other stats movements, while
      interrupts still disabled in migrate_page_move_mapping(); and don't even
      bother if the zone is the same.  Do the PageDirty movement there under
      tree_lock too, where old page is frozen and newpage not yet visible:
      bearing in mind that as soon as newpage becomes visible in radix_tree, an
      un-page-locked set_page_dirty() might interfere (or perhaps that's just
      not possible: anything doing so should already hold an additional
      reference to the old page, preventing its migration; but play safe).
      
      But we do still need to transfer PageDirty in migrate_page_copy(), for
      those who don't go the mapping route through migrate_page_move_mapping().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42cb14b1
    • H
      mm: page migration avoid touching newpage until no going back · cf4b769a
      Hugh Dickins 提交于
      We have had trouble in the past from the way in which page migration's
      newpage is initialized in dribs and drabs - see commit 8bdd6380 ("mm:
      fix direct reclaim writeback regression") which proposed a cleanup.
      
      We have no actual problem now, but I think the procedure would be clearer
      (and alternative get_new_page pools safer to implement) if we assert that
      newpage is not touched until we are sure that it's going to be used -
      except for taking the trylock on it in __unmap_and_move().
      
      So shift the early initializations from move_to_new_page() into
      migrate_page_move_mapping(), mapping and NULL-mapping paths.  Similarly
      migrate_huge_page_move_mapping(), but its NULL-mapping path can just be
      deleted: you cannot reach hugetlbfs_migrate_page() with a NULL mapping.
      
      Adjust stages 3 to 8 in the Documentation file accordingly.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf4b769a
    • H
      mm: simplify page migration's anon_vma comment and flow · 03f15c86
      Hugh Dickins 提交于
      __unmap_and_move() contains a long stale comment on page_get_anon_vma()
      and PageSwapCache(), with an odd control flow that's hard to follow.
      Mostly this reflects our confusion about the lifetime of an anon_vma, in
      the early days of page migration, before we could take a reference to one.
       Nowadays this seems quite straightforward: cut it all down to essentials.
      
      I cannot see the relevance of swapcache here at all, so don't treat it any
      differently: I believe the old comment reflects in part our anon_vma
      confusions, and in part the original v2.6.16 page migration technique,
      which used actual swap to migrate anon instead of swap-like migration
      entries.  Why should a swapcache page not be migrated with the aid of
      migration entry ptes like everything else?  So lose that comment now, and
      enable migration entries for swapcache in the next patch.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03f15c86
    • H
      mm: page migration remove_migration_ptes at lock+unlock level · 5c3f9a67
      Hugh Dickins 提交于
      Clean up page migration a little more by calling remove_migration_ptes()
      from the same level, on success or on failure, from __unmap_and_move() or
      from unmap_and_move_huge_page().
      
      Don't reset page->mapping of a PageAnon old page in move_to_new_page(),
      leave that to when the page is freed.  Except for here in page migration,
      it has been an invariant that a PageAnon (bit set in page->mapping) page
      stays PageAnon until it is freed, and I think we're safer to keep to that.
      
      And with the above rearrangement, it's necessary because zap_pte_range()
      wants to identify whether a migration entry represents a file or an anon
      page, to update the appropriate rss stats without waiting on it.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c3f9a67
    • H
      mm: page migration trylock newpage at same level as oldpage · 7db7671f
      Hugh Dickins 提交于
      Clean up page migration a little by moving the trylock of newpage from
      move_to_new_page() into __unmap_and_move(), where the old page has been
      locked.  Adjust unmap_and_move_huge_page() and balloon_page_migrate()
      accordingly.
      
      But make one kind-of-functional change on the way: whereas trylock of
      newpage used to BUG() if it failed, now simply return -EAGAIN if so.
      Cutting out BUG()s is good, right?  But, to be honest, this is really to
      extend the usefulness of the custom put_new_page feature, allowing a pool
      of new pages to be shared perhaps with racing uses.
      
      Use an "else" instead of that "skip_unmap" label.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7db7671f
    • H
      mm: page migration use the put_new_page whenever necessary · 2def7424
      Hugh Dickins 提交于
      I don't know of any problem from the way it's used in our current tree,
      but there is one defect in page migration's custom put_new_page feature.
      
      An unused newpage is expected to be released with the put_new_page(), but
      there was one MIGRATEPAGE_SUCCESS (0) path which released it with
      putback_lru_page(): which can be very wrong for a custom pool.
      
      Fixed more easily by resetting put_new_page once it won't be needed, than
      by adding a further flag to modify the rc test.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2def7424
    • H
      mm: correct a couple of page migration comments · 14e0f9bc
      Hugh Dickins 提交于
      It's migrate.c not migration,c, and nowadays putback_movable_pages() not
      putback_lru_pages().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14e0f9bc
    • H
      mm: rename mem_cgroup_migrate to mem_cgroup_replace_page · 45637bab
      Hugh Dickins 提交于
      After v4.3's commit 0610c25d ("memcg: fix dirty page migration")
      mem_cgroup_migrate() doesn't have much to offer in page migration: convert
      migrate_misplaced_transhuge_page() to set_page_memcg() instead.
      
      Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its
      remaining callers are replace_page_cache_page() and shmem_replace_page():
      both of whom passed lrucare true, so just eliminate that argument.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45637bab
    • H
      mm: page migration fix PageMlocked on migrated pages · 51afb12b
      Hugh Dickins 提交于
      Commit e6c509f8 ("mm: use clear_page_mlock() in page_remove_rmap()")
      in v3.7 inadvertently made mlock_migrate_page() impotent: page migration
      unmaps the page from userspace before migrating, and that commit clears
      PageMlocked on the final unmap, leaving mlock_migrate_page() with
      nothing to do.  Not a serious bug, the next attempt at reclaiming the
      page would fix it up; but a betrayal of page migration's intent - the
      new page ought to emerge as PageMlocked.
      
      I don't see how to fix it for mlock_migrate_page() itself; but easily
      fixed in remove_migration_pte(), by calling mlock_vma_page() when the vma
      is VM_LOCKED - under pte lock as in try_to_unmap_one().
      
      Delete mlock_migrate_page()?  Not quite, it does still serve a purpose for
      migrate_misplaced_transhuge_page(): where we could replace it by a test,
      clear_page_mlock(), mlock_vma_page() sequence; but would that be an
      improvement?  mlock_migrate_page() is fairly lean, and let's make it
      leaner by skipping the irq save/restore now clearly not needed.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51afb12b
    • V
      mm, migrate: count pages failing all retries in vmstat and tracepoint · f2f81fb2
      Vlastimil Babka 提交于
      Migration tries up to 10 times to migrate pages that return -EAGAIN until
      it gives up.  If some pages fail all retries, they are counted towards the
      number of failed pages that migrate_pages() returns.  They should also be
      counted in the /proc/vmstat pgmigrate_fail and in the mm_migrate_pages
      tracepoint.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2f81fb2
  4. 02 10月, 2015 1 次提交
    • G
      memcg: fix dirty page migration · 0610c25d
      Greg Thelen 提交于
      The problem starts with a file backed dirty page which is charged to a
      memcg.  Then page migration is used to move oldpage to newpage.
      
      Migration:
       - copies the oldpage's data to newpage
       - clears oldpage.PG_dirty
       - sets newpage.PG_dirty
       - uncharges oldpage from memcg
       - charges newpage to memcg
      
      Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
      count.
      
      However, because newpage is not yet charged, setting newpage.PG_dirty
      does not increment the memcg's dirty page count.  After migration
      completes newpage.PG_dirty is eventually cleared, often in
      account_page_cleaned().  At this time newpage is charged to a memcg so
      the memcg's dirty page count is decremented which causes underflow
      because the count was not previously incremented by migration.  This
      underflow causes balance_dirty_pages() to see a very large unsigned
      number of dirty memcg pages which leads to aggressive throttling of
      buffered writes by processes in non root memcg.
      
      This issue:
       - can harm performance of non root memcg buffered writes.
       - can report too small (even negative) values in
         memory.stat[(total_)dirty] counters of all memcg, including the root.
      
      To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
      page_memcg() and set_page_memcg() helpers.
      
      Test:
          0) setup and enter limited memcg
          mkdir /sys/fs/cgroup/test
          echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
          echo $$ > /sys/fs/cgroup/test/cgroup.procs
      
          1) buffered writes baseline
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
          2) buffered writes with compaction antagonist to induce migration
          yes 1 > /proc/sys/vm/compact_memory &
          rm -rf /data/tmp/foo
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          kill %
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
          3) buffered writes without antagonist, should match baseline
          rm -rf /data/tmp/foo
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
                             (speed, dirty residue)
                   unpatched                       patched
          1) 841 MB/s 0 dirty pages          886 MB/s 0 dirty pages
          2) 611 MB/s -33427456 dirty pages  793 MB/s 0 dirty pages
          3) 114 MB/s -33427456 dirty pages  891 MB/s 0 dirty pages
      
          Notice that unpatched baseline performance (1) fell after
          migration (3): 841 -> 114 MB/s.  In the patched kernel, post
          migration performance matches baseline.
      
      Fixes: c4843a75 ("memcg: add per cgroup dirty page accounting")
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0610c25d
  5. 23 9月, 2015 1 次提交
  6. 11 9月, 2015 1 次提交
    • V
      mm: introduce idle page tracking · 33c3fc71
      Vladimir Davydov 提交于
      Knowing the portion of memory that is not used by a certain application or
      memory cgroup (idle memory) can be useful for partitioning the system
      efficiently, e.g.  by setting memory cgroup limits appropriately.
      Currently, the only means to estimate the amount of idle memory provided
      by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
      access bit for all pages mapped to a particular process by writing 1 to
      clear_refs, wait for some time, and then count smaps:Referenced.  However,
      this method has two serious shortcomings:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      To overcome these drawbacks, this patch introduces two new page flags,
      Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
      A page's Idle flag can only be set from userspace by setting bit in
      /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
      and it is cleared whenever the page is accessed either through page tables
      (it is cleared in page_referenced() in this case) or using the read(2)
      system call (mark_page_accessed()). Thus by setting the Idle flag for
      pages of a particular workload, which can be found e.g.  by reading
      /proc/PID/pagemap, waiting for some time to let the workload access its
      working set, and then reading the bitmap file, one can estimate the amount
      of pages that are not used by the workload.
      
      The Young page flag is used to avoid interference with the memory
      reclaimer.  A page's Young flag is set whenever the Access bit of a page
      table entry pointing to the page is cleared by writing to the bitmap file.
      If page_referenced() is called on a Young page, it will add 1 to its
      return value, therefore concealing the fact that the Access bit was
      cleared.
      
      Note, since there is no room for extra page flags on 32 bit, this feature
      uses extended page flags when compiled on 32 bit.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: kpageidle requires an MMU]
      [akpm@linux-foundation.org: decouple from page-flags rework]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33c3fc71
  7. 09 9月, 2015 2 次提交
    • V
      mm: rename alloc_pages_exact_node() to __alloc_pages_node() · 96db800f
      Vlastimil Babka 提交于
      alloc_pages_exact_node() was introduced in commit 6484eb3e ("page
      allocator: do not check NUMA node ID when the caller knows the node is
      valid") as an optimized variant of alloc_pages_node(), that doesn't
      fallback to current node for nid == NUMA_NO_NODE.  Unfortunately the
      name of the function can easily suggest that the allocation is
      restricted to the given node and fails otherwise.  In truth, the node is
      only preferred, unless __GFP_THISNODE is passed among the gfp flags.
      
      The misleading name has lead to mistakes in the past, see for example
      commits 5265047a ("mm, thp: really limit transparent hugepage
      allocation to local node") and b360edb4 ("mm, mempolicy:
      migrate_to_node should only migrate to node").
      
      Another issue with the name is that there's a family of
      alloc_pages_exact*() functions where 'exact' means exact size (instead
      of page order), which leads to more confusion.
      
      To prevent further mistakes, this patch effectively renames
      alloc_pages_exact_node() to __alloc_pages_node() to better convey that
      it's an optimized variant of alloc_pages_node() not intended for general
      usage.  Both functions get described in comments.
      
      It has been also considered to really provide a convenience function for
      allocations restricted to a node, but the major opinion seems to be that
      __GFP_THISNODE already provides that functionality and we shouldn't
      duplicate the API needlessly.  The number of users would be small
      anyway.
      
      Existing callers of alloc_pages_exact_node() are simply converted to
      call __alloc_pages_node(), with the exception of sba_alloc_coherent()
      which open-codes the check for NUMA_NO_NODE, so it is converted to use
      alloc_pages_node() instead.  This means it no longer performs some
      VM_BUG_ON checks, and since the current check for nid in
      alloc_pages_node() uses a 'nid < 0' comparison (which includes
      NUMA_NO_NODE), it may hide wrong values which would be previously
      exposed.
      
      Both differences will be rectified by the next patch.
      
      To sum up, this patch makes no functional changes, except temporarily
      hiding potentially buggy callers.  Restricting the checks in
      alloc_pages_node() is left for the next patch which can in turn expose
      more existing buggy callers.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRobin Holt <robinmholt@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cliff Whickman <cpw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96db800f
    • W
      mm/hwpoison: fix race between soft_offline_page and unpoison_memory · da1b13cc
      Wanpeng Li 提交于
      Wanpeng Li reported a race between soft_offline_page() and
      unpoison_memory(), which causes the following kernel panic:
      
         BUG: Bad page state in process bash  pfn:97000
         page:ffffea00025c0000 count:0 mapcount:1 mapping:          (null) index:0x7f4fdbe00
         flags: 0x1fffff80080048(uptodate|active|swapbacked)
         page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
         bad because of flags:
         flags: 0x40(active)
         Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 nfsv4 dns_resolver bnep rfcomm nfsd bluetooth auth_rpcgss nfs_acl nfs rfkill lockd grace sunrpc i2c_algo_bit drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic drm snd_hda_intel fscache snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_dummy snd_seq_oss crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel snd_rawmidi aesni_intel lrw gf128mul snd_seq glue_helper ablk_helper snd_seq_device cryptd fuse snd_timer dcdbas serio_raw mei_me parport_pc snd mei ppdev i2c_core video lp soundcore parport lpc_ich shpchp mfd_core ext4 mbcache jbd2 sd_mod e1000e ahci ptp libahci crc32c_intel libata pps_core
         CPU: 3 PID: 2211 Comm: bash Not tainted 4.2.0-rc5-mm1+ #45
         Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
         Call Trace:
           dump_stack+0x48/0x5c
           bad_page+0xe6/0x140
           free_pages_prepare+0x2f9/0x320
           ? uncharge_list+0xdd/0x100
           free_hot_cold_page+0x40/0x170
           __put_single_page+0x20/0x30
           put_page+0x25/0x40
           unmap_and_move+0x1a6/0x1f0
           migrate_pages+0x100/0x1d0
           ? kill_procs+0x100/0x100
           ? unlock_page+0x6f/0x90
           __soft_offline_page+0x127/0x2a0
           soft_offline_page+0xa6/0x200
      
      This race is explained like below:
      
        CPU0                    CPU1
      
        soft_offline_page
        __soft_offline_page
        TestSetPageHWPoison
                              unpoison_memory
                              PageHWPoison check (true)
                              TestClearPageHWPoison
                              put_page    -> release refcount held by get_hwpoison_page in unpoison_memory
                              put_page    -> release refcount held by isolate_lru_page in __soft_offline_page
        migrate_pages
      
      The second put_page() releases refcount held by isolate_lru_page() which
      will lead to unmap_and_move() releases the last refcount of page and w/
      mapcount still 1 since try_to_unmap() is not called if there is only one
      user map the page.  Anyway, the page refcount and mapcount will still
      mess if the page is mapped by multiple users.
      
      This race was introduced by commit 4491f712 ("mm/memory-failure: set
      PageHWPoison before migrate_pages()"), which focuses on preventing the
      reuse of successfully migrated page.  Before this commit we prevent the
      reuse by changing the migratetype to MIGRATE_ISOLATE during soft
      offlining, which has the following problems, so simply reverting the
      commit is not a best option:
      
        1) it doesn't eliminate the reuse completely, because
           set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the
           target page if the pageblock of the page contains one or more
           unmovable pages (i.e.  has_unmovable_pages() returns true).
      
        2) the original code changes migratetype to MIGRATE_ISOLATE
           forcibly, and sets it to MIGRATE_MOVABLE forcibly after soft offline,
           regardless of the original migratetype state, which could impact
           other subsystems like memory hotplug or compaction.
      
      This patch moves PageSetHWPoison just after put_page() in
      unmap_and_move(), which closes up the reported race window and minimizes
      another race window b/w SetPageHWPoison and reallocation (which causes
      the reuse of soft-offlined page.) The latter race window still exists
      but it's acceptable, because it's rare and effectively the same as
      ordinary "containment failure" case even if it happens, so keep the
      window open is acceptable.
      
      Fixes: 4491f712 ("mm/memory-failure: set PageHWPoison before migrate_pages()")
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Tested-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da1b13cc
  8. 05 9月, 2015 1 次提交
  9. 07 8月, 2015 2 次提交
    • N
      mm/memory-failure: set PageHWPoison before migrate_pages() · 4491f712
      Naoya Horiguchi 提交于
      Now page freeing code doesn't consider PageHWPoison as a bad page, so by
      setting it before completing the page containment, we can prevent the
      error page from being reused just after successful page migration.
      
      I added TTU_IGNORE_HWPOISON for try_to_unmap() to make sure that the
      page table entry is transformed into migration entry, not to hwpoison
      entry.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dean Nelson <dnelson@redhat.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4491f712
    • N
      mm: check __PG_HWPOISON separately from PAGE_FLAGS_CHECK_AT_* · f4c18e6f
      Naoya Horiguchi 提交于
      The race condition addressed in commit add05cec ("mm: soft-offline:
      don't free target page in successful page migration") was not closed
      completely, because that can happen not only for soft-offline, but also
      for hard-offline.  Consider that a slab page is about to be freed into
      buddy pool, and then an uncorrected memory error hits the page just
      after entering __free_one_page(), then VM_BUG_ON_PAGE(page->flags &
      PAGE_FLAGS_CHECK_AT_PREP) is triggered, despite the fact that it's not
      necessary because the data on the affected page is not consumed.
      
      To solve it, this patch drops __PG_HWPOISON from page flag checks at
      allocation/free time.  I think it's justified because __PG_HWPOISON
      flags is defined to prevent the page from being reused, and setting it
      outside the page's alloc-free cycle is a designed behavior (not a bug.)
      
      For recent months, I was annoyed about BUG_ON when soft-offlined page
      remains on lru cache list for a while, which is avoided by calling
      put_page() instead of putback_lru_page() in page migration's success
      path.  This means that this patch reverts a major change from commit
      add05cec about the new refcounting rule of soft-offlined pages, so
      "reuse window" revives.  This will be closed by a subsequent patch.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dean Nelson <dnelson@redhat.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4c18e6f
  10. 25 6月, 2015 2 次提交
    • A
      mm: clarify that the function operates on hugepage pte · 8809aa2d
      Aneesh Kumar K.V 提交于
      We have confusing functions to clear pmd, pmd_clear_* and pmd_clear.  Add
      _huge_ to pmdp_clear functions so that we are clear that they operate on
      hugepage pte.
      
      We don't bother about other functions like pmdp_set_wrprotect,
      pmdp_clear_flush_young, because they operate on PTE bits and hence
      indicate they are operating on hugepage ptes
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8809aa2d
    • N
      mm: soft-offline: don't free target page in successful page migration · add05cec
      Naoya Horiguchi 提交于
      Stress testing showed that soft offline events for a process iterating
      "mmap-pagefault-munmap" loop can trigger
      VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP) in __free_one_page():
      
        Soft offlining page 0x70fe1 at 0x70100008d000
        Soft offlining page 0x705fb at 0x70300008d000
        page:ffffea0001c3f840 count:0 mapcount:0 mapping:          (null) index:0x2
        flags: 0x1fffff80800000(hwpoison)
        page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
        ------------[ cut here ]------------
        kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
        invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
        Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
        CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
        RIP: free_pcppages_bulk+0x52a/0x6f0
        Call Trace:
          drain_pages_zone+0x3d/0x50
          drain_local_pages+0x1d/0x30
          on_each_cpu_mask+0x46/0x80
          drain_all_pages+0x14b/0x1e0
          soft_offline_page+0x432/0x6e0
          SyS_madvise+0x73c/0x780
          system_call_fastpath+0x12/0x17
        Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 <0f> 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
        RIP  [<ffffffff811a806a>] free_pcppages_bulk+0x52a/0x6f0
         RSP <ffff88007a117d28>
        ---[ end trace 53926436e76d1f35 ]---
      
      When soft offline successfully migrates page, the source page is supposed
      to be freed.  But there is a race condition where a source page looks
      isolated (i.e.  the refcount is 0 and the PageHWPoison is set) but
      somewhat linked to pcplist.  Then another soft offline event calls
      drain_all_pages() and tries to free such hwpoisoned page, which is
      forbidden.
      
      This odd page state seems to happen due to the race between put_page() in
      putback_lru_page() and __pagevec_lru_add_fn().  But I don't want to play
      with tweaking drain code as done in commit 9ab3b598 "mm: hwpoison:
      drop lru_add_drain_all() in __soft_offline_page()", or to change page
      freeing code for this soft offline's purpose.
      
      Instead, let's think about the difference between hard offline and soft
      offline.  There is an interesting difference in how to isolate the in-use
      page between these, that is, hard offline marks PageHWPoison of the target
      page at first, and doesn't free it by keeping its refcount 1.  OTOH, soft
      offline tries to free the target page then marks PageHWPoison.  This
      difference might be the source of complexity and result in bugs like the
      above.  So making soft offline isolate with keeping refcount can be a
      solution for this problem.
      
      We can pass to page migration code the "reason" which shows the caller, so
      let's use this more to avoid calling putback_lru_page() when called from
      soft offline, which effectively does the isolation for soft offline.  With
      this change, target pages of soft offline never be reused without changing
      migratetype, so this patch also removes the related code.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      add05cec
  11. 16 4月, 2015 1 次提交
  12. 15 4月, 2015 2 次提交
  13. 13 2月, 2015 2 次提交
    • M
      mm: convert p[te|md]_mknonnuma and remaining page table manipulations · 4d942466
      Mel Gorman 提交于
      With PROT_NONE, the traditional page table manipulation functions are
      sufficient.
      
      [andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
      [akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NAneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d942466
    • M
      mm: numa: do not dereference pmd outside of the lock during NUMA hinting fault · 5d833062
      Mel Gorman 提交于
      Automatic NUMA balancing depends on being able to protect PTEs to trap a
      fault and gather reference locality information.  Very broadly speaking
      it would mark PTEs as not present and use another bit to distinguish
      between NUMA hinting faults and other types of faults.  It was
      universally loved by everybody and caused no problems whatsoever.  That
      last sentence might be a lie.
      
      This series is very heavily based on patches from Linus and Aneesh to
      replace the existing PTE/PMD NUMA helper functions with normal change
      protections.  I did alter and add parts of it but I consider them
      relatively minor contributions.  At their suggestion, acked-bys are in
      there but I've no problem converting them to Signed-off-by if requested.
      
      AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh
      for that.  I tested trinity under kvm-tool and passed and ran a few
      other basic tests.  At the time of writing, only the short-lived tests
      have completed but testing of V2 indicated that long-term testing had no
      surprises.  In most cases I'm leaving out detail as it's not that
      interesting.
      
      specjbb single JVM: There was negligible performance difference in the
      	benchmark itself for short runs. However, system activity is
      	higher and interrupts are much higher over time -- possibly TLB
      	flushes. Migrations are also higher. Overall, this is more overhead
      	but considering the problems faced with the old approach I think
      	we just have to suck it up and find another way of reducing the
      	overhead.
      
      specjbb multi JVM: Negligible performance difference to the actual benchmark
      	but like the single JVM case, the system overhead is noticeably
      	higher.  Again, interrupts are a major factor.
      
      autonumabench: This was all over the place and about all that can be
      	reasonably concluded is that it's different but not necessarily
      	better or worse.
      
      autonumabench
                                           3.18.0-rc5            3.18.0-rc5
                                       mmotm-20141119         protnone-v3r3
      User    NUMA01               32380.24 (  0.00%)    21642.92 ( 33.16%)
      User    NUMA01_THEADLOCAL    22481.02 (  0.00%)    22283.22 (  0.88%)
      User    NUMA02                3137.00 (  0.00%)     3116.54 (  0.65%)
      User    NUMA02_SMT            1614.03 (  0.00%)     1543.53 (  4.37%)
      System  NUMA01                 322.97 (  0.00%)     1465.89 (-353.88%)
      System  NUMA01_THEADLOCAL       91.87 (  0.00%)       49.32 ( 46.32%)
      System  NUMA02                  37.83 (  0.00%)       14.61 ( 61.38%)
      System  NUMA02_SMT               7.36 (  0.00%)        7.45 ( -1.22%)
      Elapsed NUMA01                 716.63 (  0.00%)      599.29 ( 16.37%)
      Elapsed NUMA01_THEADLOCAL      553.98 (  0.00%)      539.94 (  2.53%)
      Elapsed NUMA02                  83.85 (  0.00%)       83.04 (  0.97%)
      Elapsed NUMA02_SMT              86.57 (  0.00%)       79.15 (  8.57%)
      CPU     NUMA01                4563.00 (  0.00%)     3855.00 ( 15.52%)
      CPU     NUMA01_THEADLOCAL     4074.00 (  0.00%)     4136.00 ( -1.52%)
      CPU     NUMA02                3785.00 (  0.00%)     3770.00 (  0.40%)
      CPU     NUMA02_SMT            1872.00 (  0.00%)     1959.00 ( -4.65%)
      
      System CPU usage of NUMA01 is worse but it's an adverse workload on this
      machine so I'm reluctant to conclude that it's a problem that matters.  On
      the other workloads that are sensible on this machine, system CPU usage is
      great.  Overall time to complete the benchmark is comparable
      
                3.18.0-rc5  3.18.0-rc5
              mmotm-20141119protnone-v3r3
      User        59612.50    48586.44
      System        460.22     1537.45
      Elapsed      1442.20     1304.29
      
      NUMA alloc hit                 5075182     5743353
      NUMA alloc miss                      0           0
      NUMA interleave hit                  0           0
      NUMA alloc local               5075174     5743339
      NUMA base PTE updates        637061448   443106883
      NUMA huge PMD updates          1243434      864747
      NUMA page range updates     1273699656   885857347
      NUMA hint faults               1658116     1214277
      NUMA hint local faults          959487      754113
      NUMA hint local percent             57          62
      NUMA pages migrated            5467056    61676398
      
      The NUMA pages migrated look terrible but when I looked at a graph of the
      activity over time I see that the massive spike in migration activity was
      during NUMA01.  This correlates with high system CPU usage and could be
      simply down to bad luck but any modifications that affect that workload
      would be related to scan rates and migrations, not the protection
      mechanism.  For all other workloads, migration activity was comparable.
      
      Overall, headline performance figures are comparable but the overhead is
      higher, mostly in interrupts.  To some extent, higher overhead from this
      approach was anticipated but not to this degree.  It's going to be
      necessary to reduce this again with a separate series in the future.  It's
      still worth going ahead with this series though as it's likely to avoid
      constant headaches with Xen and is probably easier to maintain.
      
      This patch (of 10):
      
      A transhuge NUMA hinting fault may find the page is migrating and should
      wait until migration completes.  The check is race-prone because the pmd
      is deferenced outside of the page lock and while the race is tiny, it'll
      be larger if the PMD is cleared while marking PMDs for hinting fault.
      This patch closes the race.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d833062
  14. 12 2月, 2015 1 次提交
    • N
      mm/hugetlb: take page table lock in follow_huge_pmd() · e66f17ff
      Naoya Horiguchi 提交于
      We have a race condition between move_pages() and freeing hugepages, where
      move_pages() calls follow_page(FOLL_GET) for hugepages internally and
      tries to get its refcount without preventing concurrent freeing.  This
      race crashes the kernel, so this patch fixes it by moving FOLL_GET code
      for hugepages into follow_huge_pmd() with taking the page table lock.
      
      This patch intentionally removes page==NULL check after pte_page.
      This is justified because pte_page() never returns NULL for any
      architectures or configurations.
      
      This patch changes the behavior of follow_huge_pmd() for tail pages and
      then tail pages can be pinned/returned.  So the caller must be changed to
      properly handle the returned tail pages.
      
      We could have a choice to add the similar locking to
      follow_huge_(addr|pud) for consistency, but it's not necessary because
      currently these functions don't support FOLL_GET flag, so let's leave it
      for future development.
      
      Here is the reproducer:
      
        $ cat movepages.c
        #include <stdio.h>
        #include <stdlib.h>
        #include <numaif.h>
      
        #define ADDR_INPUT      0x700000000000UL
        #define HPS             0x200000
        #define PS              0x1000
      
        int main(int argc, char *argv[]) {
                int i;
                int nr_hp = strtol(argv[1], NULL, 0);
                int nr_p  = nr_hp * HPS / PS;
                int ret;
                void **addrs;
                int *status;
                int *nodes;
                pid_t pid;
      
                pid = strtol(argv[2], NULL, 0);
                addrs  = malloc(sizeof(char *) * nr_p + 1);
                status = malloc(sizeof(char *) * nr_p + 1);
                nodes  = malloc(sizeof(char *) * nr_p + 1);
      
                while (1) {
                        for (i = 0; i < nr_p; i++) {
                                addrs[i] = (void *)ADDR_INPUT + i * PS;
                                nodes[i] = 1;
                                status[i] = 0;
                        }
                        ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                              MPOL_MF_MOVE_ALL);
                        if (ret == -1)
                                err("move_pages");
      
                        for (i = 0; i < nr_p; i++) {
                                addrs[i] = (void *)ADDR_INPUT + i * PS;
                                nodes[i] = 0;
                                status[i] = 0;
                        }
                        ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                              MPOL_MF_MOVE_ALL);
                        if (ret == -1)
                                err("move_pages");
                }
                return 0;
        }
      
        $ cat hugepage.c
        #include <stdio.h>
        #include <sys/mman.h>
        #include <string.h>
      
        #define ADDR_INPUT      0x700000000000UL
        #define HPS             0x200000
      
        int main(int argc, char *argv[]) {
                int nr_hp = strtol(argv[1], NULL, 0);
                char *p;
      
                while (1) {
                        p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
                                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
                        if (p != (void *)ADDR_INPUT) {
                                perror("mmap");
                                break;
                        }
                        memset(p, 0, nr_hp * HPS);
                        munmap(p, nr_hp * HPS);
                }
        }
      
        $ sysctl vm.nr_hugepages=40
        $ ./hugepage 10 &
        $ ./movepages 10 $(pgrep -f hugepage)
      
      Fixes: e632a938 ("mm: migrate: add hugepage migration code to move_pages()")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NHugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e66f17ff
  15. 11 2月, 2015 1 次提交
  16. 17 12月, 2014 1 次提交
  17. 14 12月, 2014 1 次提交
    • H
      mm: unmapped page migration avoid unmap+remap overhead · 2ebba6b7
      Hugh Dickins 提交于
      Page migration's __unmap_and_move(), and rmap's try_to_unmap(), were
      created for use on pages almost certainly mapped into userspace.  But
      nowadays compaction often applies them to unmapped page cache pages: which
      may exacerbate contention on i_mmap_rwsem quite unnecessarily, since
      try_to_unmap_file() makes no preliminary page_mapped() check.
      
      Now check page_mapped() in __unmap_and_move(); and avoid repeating the
      same overhead in rmap_walk_file() - don't remove_migration_ptes() when we
      never inserted any.
      
      (The PageAnon(page) comment blocks now look even sillier than before, but
      clean that up on some other occasion.  And note in passing that
      try_to_unmap_one() does not use a migration entry when PageSwapCache, so
      remove_migration_ptes() will then not update that swap entry to newpage
      pte: not a big deal, but something else to clean up later.)
      
      Davidlohr remarked in "mm,fs: introduce helpers around the i_mmap_mutex"
      conversion to i_mmap_rwsem, that "The biggest winner of these changes is
      migration": a part of the reason might be all of that unnecessary taking
      of i_mmap_mutex in page migration; and it's rather a shame that I didn't
      get around to sending this patch in before his - this one is much less
      useful after Davidlohr's conversion to rwsem, but still good.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ebba6b7
  18. 10 10月, 2014 1 次提交
    • K
      mm/balloon_compaction: redesign ballooned pages management · d6d86c0a
      Konstantin Khlebnikov 提交于
      Sasha Levin reported KASAN splash inside isolate_migratepages_range().
      Problem is in the function __is_movable_balloon_page() which tests
      AS_BALLOON_MAP in page->mapping->flags.  This function has no protection
      against anonymous pages.  As result it tried to check address space flags
      inside struct anon_vma.
      
      Further investigation shows more problems in current implementation:
      
      * Special branch in __unmap_and_move() never works:
        balloon_page_movable() checks page flags and page_count.  In
        __unmap_and_move() page is locked, reference counter is elevated, thus
        balloon_page_movable() always fails.  As a result execution goes to the
        normal migration path.  virtballoon_migratepage() returns
        MIGRATEPAGE_BALLOON_SUCCESS instead of MIGRATEPAGE_SUCCESS,
        move_to_new_page() thinks this is an error code and assigns
        newpage->mapping to NULL.  Newly migrated page lose connectivity with
        balloon an all ability for further migration.
      
      * lru_lock erroneously required in isolate_migratepages_range() for
        isolation ballooned page.  This function releases lru_lock periodically,
        this makes migration mostly impossible for some pages.
      
      * balloon_page_dequeue have a tight race with balloon_page_isolate:
        balloon_page_isolate could be executed in parallel with dequeue between
        picking page from list and locking page_lock.  Race is rare because they
        use trylock_page() for locking.
      
      This patch fixes all of them.
      
      Instead of fake mapping with special flag this patch uses special state of
      page->_mapcount: PAGE_BALLOON_MAPCOUNT_VALUE = -256.  Buddy allocator uses
      PAGE_BUDDY_MAPCOUNT_VALUE = -128 for similar purpose.  Storing mark
      directly in struct page makes everything safer and easier.
      
      PagePrivate is used to mark pages present in page list (i.e.  not
      isolated, like PageLRU for normal pages).  It replaces special rules for
      reference counter and makes balloon migration similar to migration of
      normal pages.  This flag is protected by page_lock together with link to
      the balloon device.
      Signed-off-by: NKonstantin Khlebnikov <k.khlebnikov@samsung.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Link: http://lkml.kernel.org/p/53E6CEAA.9020105@oracle.com
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: <stable@vger.kernel.org>	[3.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6d86c0a
  19. 03 10月, 2014 1 次提交
    • M
      mm: migrate: Close race between migration completion and mprotect · d3cb8bf6
      Mel Gorman 提交于
      A migration entry is marked as write if pte_write was true at the time the
      entry was created. The VMA protections are not double checked when migration
      entries are being removed as mprotect marks write-migration-entries as
      read. It means that potentially we take a spurious fault to mark PTEs write
      again but it's straight-forward. However, there is a race between write
      migrations being marked read and migrations finishing. This potentially
      allows a PTE to be write that should have been read. Close this race by
      double checking the VMA permissions using maybe_mkwrite when migration
      completes.
      
      [torvalds@linux-foundation.org: use maybe_mkwrite]
      Cc: stable@vger.kernel.org
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3cb8bf6
  20. 09 8月, 2014 1 次提交
    • J
      mm: memcontrol: rewrite uncharge API · 0a31bc97
      Johannes Weiner 提交于
      The memcg uncharging code that is involved towards the end of a page's
      lifetime - truncation, reclaim, swapout, migration - is impressively
      complicated and fragile.
      
      Because anonymous and file pages were always charged before they had their
      page->mapping established, uncharges had to happen when the page type
      could still be known from the context; as in unmap for anonymous, page
      cache removal for file and shmem pages, and swap cache truncation for swap
      pages.  However, these operations happen well before the page is actually
      freed, and so a lot of synchronization is necessary:
      
      - Charging, uncharging, page migration, and charge migration all need
        to take a per-page bit spinlock as they could race with uncharging.
      
      - Swap cache truncation happens during both swap-in and swap-out, and
        possibly repeatedly before the page is actually freed.  This means
        that the memcg swapout code is called from many contexts that make
        no sense and it has to figure out the direction from page state to
        make sure memory and memory+swap are always correctly charged.
      
      - On page migration, the old page might be unmapped but then reused,
        so memcg code has to prevent untimely uncharging in that case.
        Because this code - which should be a simple charge transfer - is so
        special-cased, it is not reusable for replace_page_cache().
      
      But now that charged pages always have a page->mapping, introduce
      mem_cgroup_uncharge(), which is called after the final put_page(), when we
      know for sure that nobody is looking at the page anymore.
      
      For page migration, introduce mem_cgroup_migrate(), which is called after
      the migration is successful and the new page is fully rmapped.  Because
      the old page is no longer uncharged after migration, prevent double
      charges by decoupling the page's memcg association (PCG_USED and
      pc->mem_cgroup) from the page holding an actual charge.  The new bits
      PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
      to the new page during migration.
      
      mem_cgroup_migrate() is suitable for replace_page_cache() as well,
      which gets rid of mem_cgroup_replace_page_cache().  However, care
      needs to be taken because both the source and the target page can
      already be charged and on the LRU when fuse is splicing: grab the page
      lock on the charge moving side to prevent changing pc->mem_cgroup of a
      page under migration.  Also, the lruvecs of both pages change as we
      uncharge the old and charge the new during migration, and putback may
      race with us, so grab the lru lock and isolate the pages iff on LRU to
      prevent races and ensure the pages are on the right lruvec afterward.
      
      Swap accounting is massively simplified: because the page is no longer
      uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
      transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
      before the final put_page() in page reclaim.
      
      Finally, page_cgroup changes are now protected by whatever protection the
      page itself offers: anonymous pages are charged under the page table lock,
      whereas page cache insertions, swapin, and migration hold the page lock.
      Uncharging happens under full exclusion with no outstanding references.
      Charging and uncharging also ensure that the page is off-LRU, which
      serializes against charge migration.  Remove the very costly page_cgroup
      lock and set pc->flags non-atomically.
      
      [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
      [vdavydov@parallels.com: fix flags definition]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Tested-by: NJet Chen <jet.chen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NFelipe Balbi <balbi@ti.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a31bc97
  21. 27 7月, 2014 1 次提交
    • H
      mm: fix direct reclaim writeback regression · 8bdd6380
      Hugh Dickins 提交于
      Shortly before 3.16-rc1, Dave Jones reported:
      
        WARNING: CPU: 3 PID: 19721 at fs/xfs/xfs_aops.c:971
                 xfs_vm_writepage+0x5ce/0x630 [xfs]()
        CPU: 3 PID: 19721 Comm: trinity-c61 Not tainted 3.15.0+ #3
        Call Trace:
          xfs_vm_writepage+0x5ce/0x630 [xfs]
          shrink_page_list+0x8f9/0xb90
          shrink_inactive_list+0x253/0x510
          shrink_lruvec+0x563/0x6c0
          shrink_zone+0x3b/0x100
          shrink_zones+0x1f1/0x3c0
          try_to_free_pages+0x164/0x380
          __alloc_pages_nodemask+0x822/0xc90
          alloc_pages_vma+0xaf/0x1c0
          handle_mm_fault+0xa31/0xc50
        etc.
      
       970   if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
       971                   PF_MEMALLOC))
      
      I did not respond at the time, because a glance at the PageDirty block
      in shrink_page_list() quickly shows that this is impossible: we don't do
      writeback on file pages (other than tmpfs) from direct reclaim nowadays.
      Dave was hallucinating, but it would have been disrespectful to say so.
      
      However, my own /var/log/messages now shows similar complaints
      
        WARNING: CPU: 1 PID: 28814 at fs/ext4/inode.c:1881 ext4_writepage+0xa7/0x38b()
        WARNING: CPU: 0 PID: 27347 at fs/ext4/inode.c:1764 ext4_writepage+0xa7/0x38b()
      
      from stressing some mmotm trees during July.
      
      Could a dirty xfs or ext4 file page somehow get marked PageSwapBacked,
      so fail shrink_page_list()'s page_is_file_cache() test, and so proceed
      to mapping->a_ops->writepage()?
      
      Yes, 3.16-rc1's commit 68711a74 ("mm, migration: add destination
      page freeing callback") has provided such a way to compaction: if
      migrating a SwapBacked page fails, its newpage may be put back on the
      list for later use with PageSwapBacked still set, and nothing will clear
      it.
      
      Whether that can do anything worse than issue WARN_ON_ONCEs, and get
      some statistics wrong, is unclear: easier to fix than to think through
      the consequences.
      
      Fixing it here, before the put_new_page(), addresses the bug directly,
      but is probably the worst place to fix it.  Page migration is doing too
      many parts of the job on too many levels: fixing it in
      move_to_new_page() to complement its SetPageSwapBacked would be
      preferable, except why is it (and newpage->mapping and newpage->index)
      done there, rather than down in migrate_page_move_mapping(), once we are
      sure of success? Not a cleanup to get into right now, especially not
      with memcg cleanups coming in 3.17.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bdd6380
  22. 24 6月, 2014 1 次提交
    • H
      mm: let mm_find_pmd fix buggy race with THP fault · f72e7dcd
      Hugh Dickins 提交于
      Trinity has reported:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
          IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
          CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G        W
                                  3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
          lock_acquire (arch/x86/include/asm/current.h:14
                        kernel/locking/lockdep.c:3602)
          _raw_spin_lock (include/linux/spinlock_api_smp.h:143
                          kernel/locking/spinlock.c:151)
          remove_migration_pte (mm/migrate.c:137)
          rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
          remove_migration_ptes (mm/migrate.c:224)
          migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
          migrate_misplaced_page (mm/migrate.c:1733)
          __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
          handle_mm_fault (mm/memory.c:3948)
          __get_user_pages (mm/memory.c:1851)
          __mlock_vma_pages_range (mm/mlock.c:255)
          __mm_populate (mm/mlock.c:711)
          SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)
      
      I believe this comes about because, whereas collapsing and splitting THP
      functions take anon_vma lock in write mode (which excludes concurrent
      rmap walks), faulting THP functions (write protection and misplaced
      NUMA) do not - and mostly they do not need to.
      
      But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
      an instant (indeed, for a long instant, given the inter-CPU TLB flush in
      there), leaves *pmd neither present not trans_huge.
      
      Which can confuse a concurrent rmap walk, as when removing migration
      ptes, seen in the dumped trace.  Although that rmap walk has a 4k page
      to insert, anon_vmas containing THPs are in no way segregated from
      4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
      that instant when a trans_huge pmd is temporarily absent.
      
      I don't think we need strengthen the locking at the THP end: it's easily
      handled with an ACCESS_ONCE() before testing both conditions.
      
      And since mm_find_pmd() had only one caller who wanted a THP rather than
      a pmd, let's slightly repurpose it to fail when it hits a THP or
      non-present pmd, and open code split_huge_page_address() again.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Lameter <cl@gentwo.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f72e7dcd
  23. 05 6月, 2014 1 次提交