1. 04 5月, 2017 2 次提交
    • M
      mm: make ttu's return boolean · 666e5a40
      Minchan Kim 提交于
      try_to_unmap() returns SWAP_SUCCESS or SWAP_FAIL so it's suitable for
      boolean return.  This patch changes it.
      
      Link: http://lkml.kernel.org/r/1489555493-14659-8-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      666e5a40
    • S
      mm: delete unnecessary TTU_* flags · a128ca71
      Shaohua Li 提交于
      Patch series "mm: fix some MADV_FREE issues", v5.
      
      We are trying to use MADV_FREE in jemalloc.  Several issues are found.
      Without solving the issues, jemalloc can't use the MADV_FREE feature.
      
       - Doesn't support system without swap enabled. Because if swap is off,
         we can't or can't efficiently age anonymous pages. And since
         MADV_FREE pages are mixed with other anonymous pages, we can't
         reclaim MADV_FREE pages. In current implementation, MADV_FREE will
         fallback to MADV_DONTNEED without swap enabled. But in our
         environment, a lot of machines don't enable swap. This will prevent
         our setup using MADV_FREE.
      
       - Increases memory pressure. page reclaim bias file pages reclaim
         against anonymous pages. This doesn't make sense for MADV_FREE pages,
         because those pages could be freed easily and refilled with very
         slight penality. Even page reclaim doesn't bias file pages, there is
         still an issue, because MADV_FREE pages and other anonymous pages are
         mixed together. To reclaim a MADV_FREE page, we probably must scan a
         lot of other anonymous pages, which is inefficient. In our test, we
         usually see oom with MADV_FREE enabled and nothing without it.
      
       - Accounting. There are two accounting problems. We don't have a global
         accounting. If the system is abnormal, we don't know if it's a
         problem from MADV_FREE side. The other problem is RSS accounting.
         MADV_FREE pages are accounted as normal anon pages and reclaimed
         lazily, so application's RSS becomes bigger. This confuses our
         workloads. We have monitoring daemon running and if it finds
         applications' RSS becomes abnormal, the daemon will kill the
         applications even kernel can reclaim the memory easily.
      
      To address the first the two issues, we can either put MADV_FREE pages
      into a separate LRU list (Minchan's previous patches and V1 patches), or
      put them into LRU_INACTIVE_FILE list (suggested by Johannes).  The
      patchset use the second idea.  The reason is LRU_INACTIVE_FILE list is
      tiny nowadays and should be full of used once file pages.  So we can
      still efficiently reclaim MADV_FREE pages there without interference
      with other anon and active file pages.  Putting the pages into inactive
      file list also has an advantage which allows page reclaim to prioritize
      MADV_FREE pages and used once file pages.  MADV_FREE pages are put into
      the lru list and clear SwapBacked flag, so PageAnon(page) &&
      !PageSwapBacked(page) will indicate a MADV_FREE pages.  These pages will
      directly freed without pageout if they are clean, otherwise normal swap
      will reclaim them.
      
      For the third issue, the previous post adds global accounting and a
      separate RSS count for MADV_FREE pages.  The problem is we never get
      accurate accounting for MADV_FREE pages.  The pages are mapped to
      userspace, can be dirtied without notice from kernel side.  To get
      accurate accounting, we could write protect the page, but then there is
      extra page fault overhead, which people don't want to pay.  Jemalloc
      guys have concerns about the inaccurate accounting, so this post drops
      the accounting patches temporarily.  The info exported to
      /proc/pid/smaps for MADV_FREE pages are kept, which is the only place we
      can get accurate accounting right now.
      
      This patch (of 6):
      
      Johannes pointed out TTU_LZFREE is unnecessary.  It's true because we
      always have the flag set if we want to do an unmap.  For cases we don't
      do an unmap, the TTU_LZFREE part of code should never run.
      
      Also the TTU_UNMAP is unnecessary.  If no other flags set (for example,
      TTU_MIGRATION), an unmap is implied.
      
      The patch includes Johannes's cleanup and dead TTU_ACTION macro removal
      code
      
      Link: http://lkml.kernel.org/r/4be3ea1bc56b26fd98a54d0a6f70bec63f6d8980.1487965799.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a128ca71
  2. 02 3月, 2017 2 次提交
  3. 25 2月, 2017 1 次提交
  4. 26 12月, 2016 1 次提交
  5. 12 11月, 2016 1 次提交
  6. 29 7月, 2016 2 次提交
    • N
      mm: hwpoison: remove incorrect comments · 7c7fd825
      Naoya Horiguchi 提交于
      dequeue_hwpoisoned_huge_page() can be called without page lock hold, so
      let's remove incorrect comment.
      
      The reason why the page lock is not really needed is that
      dequeue_hwpoisoned_huge_page() checks page_huge_active() inside
      hugetlb_lock, which allows us to avoid trying to dequeue a hugepage that
      are just allocated but not linked to active list yet, even without
      taking page lock.
      
      Link: http://lkml.kernel.org/r/20160720092901.GA15995@www9186uo.sakura.ne.jpSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NZhan Chen <zhanc1@andrew.cmu.edu>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c7fd825
    • M
      mm, vmscan: move LRU lists to node · 599d0c95
      Mel Gorman 提交于
      This moves the LRU lists from the zone to the node and related data such
      as counters, tracing, congestion tracking and writeback tracking.
      
      Unfortunately, due to reclaim and compaction retry logic, it is
      necessary to account for the number of LRU pages on both zone and node
      logic.  Most reclaim logic is based on the node counters but the retry
      logic uses the zone counters which do not distinguish inactive and
      active sizes.  It would be possible to leave the LRU counters on a
      per-zone basis but it's a heavier calculation across multiple cache
      lines that is much more frequent than the retry checks.
      
      Other than the LRU counters, this is mostly a mechanical patch but note
      that it introduces a number of anomalies.  For example, the scans are
      per-zone but using per-node counters.  We also mark a node as congested
      when a zone is congested.  This causes weird problems that are fixed
      later but is easier to review.
      
      In the event that there is excessive overhead on 32-bit systems due to
      the nodes being on LRU then there are two potential solutions
      
      1. Long-term isolation of highmem pages when reclaim is lowmem
      
         When pages are skipped, they are immediately added back onto the LRU
         list. If lowmem reclaim persisted for long periods of time, the same
         highmem pages get continually scanned. The idea would be that lowmem
         keeps those pages on a separate list until a reclaim for highmem pages
         arrives that splices the highmem pages back onto the LRU. It potentially
         could be implemented similar to the UNEVICTABLE list.
      
         That would reduce the skip rate with the potential corner case is that
         highmem pages have to be scanned and reclaimed to free lowmem slab pages.
      
      2. Linear scan lowmem pages if the initial LRU shrink fails
      
         This will break LRU ordering but may be preferable and faster during
         memory pressure than skipping LRU pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      599d0c95
  7. 21 5月, 2016 1 次提交
  8. 29 4月, 2016 1 次提交
  9. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  10. 18 3月, 2016 1 次提交
  11. 16 3月, 2016 1 次提交
  12. 16 1月, 2016 6 次提交
    • N
      mm: soft-offline: exit with failure for non anonymous thp · 98fd1ef4
      Naoya Horiguchi 提交于
      Currently memory_failure() doesn't handle non anonymous thp case,
      because we can hardly expect the error handling to be successful, and it
      can just hit some corner case which results in BUG_ON or something
      severe like that.  This is also the case for soft offline code, so let's
      make it in the same way.
      
      Orignal code has a MF_COUNT_INCREASED check before put_hwpoison_page(),
      but it's unnecessary because get_any_page() is already called when
      running on this code, which takes a refcount of the target page
      regardress of the flag.  So this patch also removes it.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98fd1ef4
    • N
      mm: soft-offline: clean up soft_offline_page() · acc14dc4
      Naoya Horiguchi 提交于
      soft_offline_page() has some deeply indented code, that's the sign of
      demand for cleanup.  So let's do this.  No functionality change.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      acc14dc4
    • N
      mm: hwpoison: adjust for new thp refcounting · 4e41a30c
      Naoya Horiguchi 提交于
      Some mm-related BUG_ON()s could trigger from hwpoison code due to recent
      changes in thp refcounting rule.  This patch fixes them up.
      
      In the new refcounting, we no longer use tail->_mapcount to keep tail's
      refcount, and thereby we can simplify get/put_hwpoison_page().
      
      And another change is that tail's refcount is not transferred to the raw
      page during thp split (more precisely, in new rule we don't take
      refcount on tail page any more.) So when we need thp split, we have to
      transfer the refcount properly to the 4kB soft-offlined page before
      migration.
      
      thp split code goes into core code only when precheck
      (total_mapcount(head) == page_count(head) - 1) passes to avoid useless
      split, where we assume that one refcount is held by the caller of thp
      split and the others are taken via mapping.  To meet this assumption,
      this patch moves thp split part in soft_offline_page() after
      get_any_page().
      
      [akpm@linux-foundation.org: remove unneeded #define, per Kirill]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NKirill A.  Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e41a30c
    • N
      mm: soft-offline: check return value in second __get_any_page() call · d96b339f
      Naoya Horiguchi 提交于
      I saw the following BUG_ON triggered in a testcase where a process calls
      madvise(MADV_SOFT_OFFLINE) on thps, along with a background process that
      calls migratepages command repeatedly (doing ping-pong among different
      NUMA nodes) for the first process:
      
         Soft offlining page 0x60000 at 0x700000600000
         __get_any_page: 0x60000 free buddy page
         page:ffffea0001800000 count:0 mapcount:-127 mapping:          (null) index:0x1
         flags: 0x1fffc0000000000()
         page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
         ------------[ cut here ]------------
         kernel BUG at /src/linux-dev/include/linux/mm.h:342!
         invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
         Modules linked in: cfg80211 rfkill crc32c_intel serio_raw virtio_balloon i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi
         CPU: 3 PID: 3035 Comm: test_alloc_gene Tainted: G           O    4.4.0-rc8-v4.4-rc8-160107-1501-00000-rc8+ #74
         Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
         task: ffff88007c63d5c0 ti: ffff88007c210000 task.ti: ffff88007c210000
         RIP: 0010:[<ffffffff8118998c>]  [<ffffffff8118998c>] put_page+0x5c/0x60
         RSP: 0018:ffff88007c213e00  EFLAGS: 00010246
         Call Trace:
           put_hwpoison_page+0x4e/0x80
           soft_offline_page+0x501/0x520
           SyS_madvise+0x6bc/0x6f0
           entry_SYSCALL_64_fastpath+0x12/0x6a
         Code: 8b fc ff ff 5b 5d c3 48 89 df e8 b0 fa ff ff 48 89 df 31 f6 e8 c6 7d ff ff 5b 5d c3 48 c7 c6 08 54 a2 81 48 89 df e8 a4 c5 01 00 <0f> 0b 66 90 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 8b 47
         RIP  [<ffffffff8118998c>] put_page+0x5c/0x60
          RSP <ffff88007c213e00>
      
      The root cause resides in get_any_page() which retries to get a refcount
      of the page to be soft-offlined.  This function calls
      put_hwpoison_page(), expecting that the target page is putback to LRU
      list.  But it can be also freed to buddy.  So the second check need to
      care about such case.
      
      Fixes: af8fae7c ("mm/memory-failure.c: clean up soft_offline_page()")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[3.9+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d96b339f
    • K
      thp, mm: split_huge_page(): caller need to lock page · 4d2fa965
      Kirill A. Shutemov 提交于
      We're going to use migration entries instead of compound_lock() to
      stabilize page refcounts.  Setup and remove migration entries require
      page to be locked.
      
      Some of split_huge_page() callers already have the page locked.  Let's
      require everybody to lock the page before calling split_huge_page().
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d2fa965
    • K
      page-flags: define PG_locked behavior on compound pages · 48c935ad
      Kirill A. Shutemov 提交于
      lock_page() must operate on the whole compound page.  It doesn't make
      much sense to lock part of compound page.  Change code to use head
      page's PG_locked, if tail page is passed.
      
      This patch also gets rid of custom helper functions --
      __set_page_locked() and __clear_page_locked().  They are replaced with
      helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG.  Tail pages to these
      helper would trigger VM_BUG_ON().
      
      SLUB uses PG_locked as a bit spin locked.  IIUC, tail pages should never
      appear there.  VM_BUG_ON() is added to make sure that this assumption is
      correct.
      
      [akpm@linux-foundation.org: fix fs/cifs/file.c]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48c935ad
  13. 07 11月, 2015 1 次提交
    • K
      mm: make compound_head() robust · 1d798ca3
      Kirill A. Shutemov 提交于
      Hugh has pointed that compound_head() call can be unsafe in some
      context. There's one example:
      
      	CPU0					CPU1
      
      isolate_migratepages_block()
        page_count()
          compound_head()
            !!PageTail() == true
      					put_page()
      					  tail->first_page = NULL
            head = tail->first_page
      					alloc_pages(__GFP_COMP)
      					   prep_compound_page()
      					     tail->first_page = head
      					     __SetPageTail(p);
            !!PageTail() == true
          <head == NULL dereferencing>
      
      The race is pure theoretical. I don't it's possible to trigger it in
      practice. But who knows.
      
      We can fix the race by changing how encode PageTail() and compound_head()
      within struct page to be able to update them in one shot.
      
      The patch introduces page->compound_head into third double word block in
      front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
      the rest bits are pointer to head page if bit zero is set.
      
      The patch moves page->pmd_huge_pte out of word, just in case if an
      architecture defines pgtable_t into something what can have the bit 0
      set.
      
      hugetlb_cgroup uses page->lru.next in the second tail page to store
      pointer struct hugetlb_cgroup. The patch switch it to use page->private
      in the second tail page instead. The space is free since ->first_page is
      removed from the union.
      
      The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
      limitation, since there's now space in first tail page to store struct
      hugetlb_cgroup pointer. But that's out of scope of the patch.
      
      That means page->compound_head shares storage space with:
      
       - page->lru.next;
       - page->next;
       - page->rcu_head.next;
      
      That's too long list to be absolutely sure, but looks like nobody uses
      bit 0 of the word.
      
      page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
      call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
      call_rcu_lazy() is not allowed as it makes use of the bit and we can
      get false positive PageTail().
      
      [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d798ca3
  14. 06 11月, 2015 1 次提交
  15. 11 9月, 2015 1 次提交
    • V
      hwpoison: use page_cgroup_ino for filtering by memcg · 94a59fb3
      Vladimir Davydov 提交于
      Hwpoison allows to filter pages by memory cgroup ino.  Currently, it
      calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
      then its ino using cgroup_ino, but now we have a helper method for
      that, page_cgroup_ino, so use it instead.
      
      This patch also loosens the hwpoison memcg filter dependency rules - it
      makes it depend on CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP, because
      hwpoison memcg filter does not require anything (nor it used to) from
      CONFIG_MEMCG_SWAP side.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94a59fb3
  16. 09 9月, 2015 9 次提交
    • V
      mm: rename alloc_pages_exact_node() to __alloc_pages_node() · 96db800f
      Vlastimil Babka 提交于
      alloc_pages_exact_node() was introduced in commit 6484eb3e ("page
      allocator: do not check NUMA node ID when the caller knows the node is
      valid") as an optimized variant of alloc_pages_node(), that doesn't
      fallback to current node for nid == NUMA_NO_NODE.  Unfortunately the
      name of the function can easily suggest that the allocation is
      restricted to the given node and fails otherwise.  In truth, the node is
      only preferred, unless __GFP_THISNODE is passed among the gfp flags.
      
      The misleading name has lead to mistakes in the past, see for example
      commits 5265047a ("mm, thp: really limit transparent hugepage
      allocation to local node") and b360edb4 ("mm, mempolicy:
      migrate_to_node should only migrate to node").
      
      Another issue with the name is that there's a family of
      alloc_pages_exact*() functions where 'exact' means exact size (instead
      of page order), which leads to more confusion.
      
      To prevent further mistakes, this patch effectively renames
      alloc_pages_exact_node() to __alloc_pages_node() to better convey that
      it's an optimized variant of alloc_pages_node() not intended for general
      usage.  Both functions get described in comments.
      
      It has been also considered to really provide a convenience function for
      allocations restricted to a node, but the major opinion seems to be that
      __GFP_THISNODE already provides that functionality and we shouldn't
      duplicate the API needlessly.  The number of users would be small
      anyway.
      
      Existing callers of alloc_pages_exact_node() are simply converted to
      call __alloc_pages_node(), with the exception of sba_alloc_coherent()
      which open-codes the check for NUMA_NO_NODE, so it is converted to use
      alloc_pages_node() instead.  This means it no longer performs some
      VM_BUG_ON checks, and since the current check for nid in
      alloc_pages_node() uses a 'nid < 0' comparison (which includes
      NUMA_NO_NODE), it may hide wrong values which would be previously
      exposed.
      
      Both differences will be rectified by the next patch.
      
      To sum up, this patch makes no functional changes, except temporarily
      hiding potentially buggy callers.  Restricting the checks in
      alloc_pages_node() is left for the next patch which can in turn expose
      more existing buggy callers.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRobin Holt <robinmholt@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cliff Whickman <cpw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96db800f
    • N
      mm/hwpoison: don't try to unpoison containment-failed pages · 230ac719
      Naoya Horiguchi 提交于
      memory_failure() can be called at any page at any time, which means that
      we can't eliminate the possibility of containment failure.  In such case
      the best option is to leak the page intentionally (and never touch it
      later.)
      
      We have an unpoison function for testing, and it cannot handle such
      containment-failed pages, which results in kernel panic (visible with
      various calltraces.) So this patch suggests that we limit the
      unpoisonable pages to properly contained pages and ignore any other
      ones.
      
      Testers are recommended to keep in mind that there're un-unpoisonable
      pages when writing test programs.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Tested-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      230ac719
    • W
      mm/hwpoison: fix race between soft_offline_page and unpoison_memory · da1b13cc
      Wanpeng Li 提交于
      Wanpeng Li reported a race between soft_offline_page() and
      unpoison_memory(), which causes the following kernel panic:
      
         BUG: Bad page state in process bash  pfn:97000
         page:ffffea00025c0000 count:0 mapcount:1 mapping:          (null) index:0x7f4fdbe00
         flags: 0x1fffff80080048(uptodate|active|swapbacked)
         page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
         bad because of flags:
         flags: 0x40(active)
         Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 nfsv4 dns_resolver bnep rfcomm nfsd bluetooth auth_rpcgss nfs_acl nfs rfkill lockd grace sunrpc i2c_algo_bit drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic drm snd_hda_intel fscache snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_dummy snd_seq_oss crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel snd_rawmidi aesni_intel lrw gf128mul snd_seq glue_helper ablk_helper snd_seq_device cryptd fuse snd_timer dcdbas serio_raw mei_me parport_pc snd mei ppdev i2c_core video lp soundcore parport lpc_ich shpchp mfd_core ext4 mbcache jbd2 sd_mod e1000e ahci ptp libahci crc32c_intel libata pps_core
         CPU: 3 PID: 2211 Comm: bash Not tainted 4.2.0-rc5-mm1+ #45
         Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
         Call Trace:
           dump_stack+0x48/0x5c
           bad_page+0xe6/0x140
           free_pages_prepare+0x2f9/0x320
           ? uncharge_list+0xdd/0x100
           free_hot_cold_page+0x40/0x170
           __put_single_page+0x20/0x30
           put_page+0x25/0x40
           unmap_and_move+0x1a6/0x1f0
           migrate_pages+0x100/0x1d0
           ? kill_procs+0x100/0x100
           ? unlock_page+0x6f/0x90
           __soft_offline_page+0x127/0x2a0
           soft_offline_page+0xa6/0x200
      
      This race is explained like below:
      
        CPU0                    CPU1
      
        soft_offline_page
        __soft_offline_page
        TestSetPageHWPoison
                              unpoison_memory
                              PageHWPoison check (true)
                              TestClearPageHWPoison
                              put_page    -> release refcount held by get_hwpoison_page in unpoison_memory
                              put_page    -> release refcount held by isolate_lru_page in __soft_offline_page
        migrate_pages
      
      The second put_page() releases refcount held by isolate_lru_page() which
      will lead to unmap_and_move() releases the last refcount of page and w/
      mapcount still 1 since try_to_unmap() is not called if there is only one
      user map the page.  Anyway, the page refcount and mapcount will still
      mess if the page is mapped by multiple users.
      
      This race was introduced by commit 4491f712 ("mm/memory-failure: set
      PageHWPoison before migrate_pages()"), which focuses on preventing the
      reuse of successfully migrated page.  Before this commit we prevent the
      reuse by changing the migratetype to MIGRATE_ISOLATE during soft
      offlining, which has the following problems, so simply reverting the
      commit is not a best option:
      
        1) it doesn't eliminate the reuse completely, because
           set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the
           target page if the pageblock of the page contains one or more
           unmovable pages (i.e.  has_unmovable_pages() returns true).
      
        2) the original code changes migratetype to MIGRATE_ISOLATE
           forcibly, and sets it to MIGRATE_MOVABLE forcibly after soft offline,
           regardless of the original migratetype state, which could impact
           other subsystems like memory hotplug or compaction.
      
      This patch moves PageSetHWPoison just after put_page() in
      unmap_and_move(), which closes up the reported race window and minimizes
      another race window b/w SetPageHWPoison and reallocation (which causes
      the reuse of soft-offlined page.) The latter race window still exists
      but it's acceptable, because it's rare and effectively the same as
      ordinary "containment failure" case even if it happens, so keep the
      window open is acceptable.
      
      Fixes: 4491f712 ("mm/memory-failure: set PageHWPoison before migrate_pages()")
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Tested-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da1b13cc
    • N
      mm/hwpoison: introduce num_poisoned_pages wrappers · 8e30456b
      Naoya Horiguchi 提交于
      num_poisoned_pages counter will be changed outside mm/memory-failure.c
      by a subsequent patch, so this patch prepares wrappers to manipulate it.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Tested-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e30456b
    • W
      mm/hwpoison: replace most of put_page in memory error handling by put_hwpoison_page · 665d9da7
      Wanpeng Li 提交于
      Replace most instances of put_page() in memory error handling with
      put_hwpoison_page().
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      665d9da7
    • W
      mm/hwpoison: introduce put_hwpoison_page to put refcount for memory error handling · 94bf4ec8
      Wanpeng Li 提交于
      Introduce put_hwpoison_page to put refcount for memory error handling.
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Suggested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94bf4ec8
    • W
      mm/hwpoison: fix PageHWPoison test/set race · 1e0e635b
      Wanpeng Li 提交于
      There is a race between madvise_hwpoison path and memory_failure:
      
       CPU0					CPU1
      
      madvise_hwpoison
      get_user_pages_fast
      PageHWPoison check (false)
      					memory_failure
      					TestSetPageHWPoison
      soft_offline_page
      PageHWPoison check (true)
      return -EBUSY (without put_page)
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Suggested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e0e635b
    • W
      mm/hwpoison: fix failure to split thp w/ refcount held · 7d1900c7
      Wanpeng Li 提交于
      THP pages will get a refcount in madvise_hwpoison() w/
      MF_COUNT_INCREASED flag, however, the refcount is still held when fail
      to split THP pages.
      
      Fix it by reducing the refcount of THP pages when fail to split THP.
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d1900c7
    • M
      memcg: export struct mem_cgroup · 33398cf2
      Michal Hocko 提交于
      mem_cgroup structure is defined in mm/memcontrol.c currently which means
      that the code outside of this file has to use external API even for
      trivial access stuff.
      
      This patch exports mm_struct with its dependencies and makes some of the
      exported functions inlines.  This even helps to reduce the code size a bit
      (make defconfig + CONFIG_MEMCG=y)
      
        text		data    bss     dec     	 hex 	filename
        12355346        1823792 1089536 15268674         e8fb42 vmlinux.before
        12354970        1823792 1089536 15268298         e8f9ca vmlinux.after
      
      This is not much (370B) but better than nothing.
      
      We also save a function call in some hot paths like callers of
      mem_cgroup_count_vm_event which is used for accounting.
      
      The patch doesn't introduce any functional changes.
      
      [vdavykov@parallels.com: inline memcg_kmem_is_active]
      [vdavykov@parallels.com: do not expose type outside of CONFIG_MEMCG]
      [akpm@linux-foundation.org: memcontrol.h needs eventfd.h for eventfd_ctx]
      [akpm@linux-foundation.org: export mem_cgroup_from_task() to modules]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33398cf2
  17. 15 8月, 2015 3 次提交
    • W
      mm/hwpoison: fix panic due to split huge zero page · 7f6bf39b
      Wanpeng Li 提交于
      Bug:
      
        ------------[ cut here ]------------
        kernel BUG at mm/huge_memory.c:1957!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 snd_hda_codec_realtek snd_hda_codec_generic nfsv4 dns_re
        CPU: 2 PID: 2576 Comm: test_huge Not tainted 4.2.0-rc5-mm1+ #27
        Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
        task: ffff880204e3d600 ti: ffff8800db16c000 task.ti: ffff8800db16c000
        RIP: split_huge_page_to_list+0xdb/0x120
        Call Trace:
          memory_failure+0x32e/0x7c0
          madvise_hwpoison+0x8b/0x160
          SyS_madvise+0x40/0x240
          ? do_page_fault+0x37/0x90
          entry_SYSCALL_64_fastpath+0x12/0x71
        Code: ff f0 41 ff 4c 24 30 74 0d 31 c0 48 83 c4 08 5b 41 5c 41 5d c9 c3 4c 89 e7 e8 e2 58 fd ff 48 83 c4 08 31 c0
        RIP  split_huge_page_to_list+0xdb/0x120
         RSP <ffff8800db16fde8>
        ---[ end trace aee7ce0df8e44076 ]---
      
      Testcase:
      
          #define _GNU_SOURCE
          #include <stdlib.h>
          #include <stdio.h>
          #include <sys/mman.h>
          #include <unistd.h>
          #include <fcntl.h>
          #include <sys/types.h>
          #include <errno.h>
          #include <string.h>
      
          #define MB 1024*1024
      
          int main(void)
          {
                  char *mem;
      
                  posix_memalign((void **)&mem, 2 * MB, 200 * MB);
      
                  madvise(mem, 200 * MB, MADV_HWPOISON);
      
                  free(mem);
      
                  return 0;
          }
      
      Huge zero page is allocated if page fault w/o FAULT_FLAG_WRITE flag.
      The get_user_pages_fast() which called in madvise_hwpoison() will get
      huge zero page if the page is not allocated before.  Huge zero page is a
      tranparent huge page, however, it is not an anonymous page.
      memory_failure will split the huge zero page and trigger
      BUG_ON(is_huge_zero_page(page));
      
      After commit 98ed2b00 ("mm/memory-failure: give up error handling
      for non-tail-refcounted thp"), memory_failure will not catch non anon
      thp from madvise_hwpoison path and this bug occur.
      
      Fix it by catching non anon thp in memory_failure in order to not split
      huge zero page in madvise_hwpoison path.
      
      After this patch:
      
        Injecting memory failure for page 0x202800 at 0x7fd8ae800000
        MCE: 0x202800: non anonymous thp
        [...]
      
      [akpm@linux-foundation.org: remove second split, per Wanpeng]
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f6bf39b
    • W
      mm/hwpoison: fix fail isolate hugetlbfs page w/ refcount held · 03613808
      Wanpeng Li 提交于
      Hugetlbfs pages will get a refcount in get_any_page() or
      madvise_hwpoison() if soft offlining through madvise.  The refcount which
      is held by the soft offline path should be released if we fail to isolate
      hugetlbfs pages.
      
      Fix it by reducing the refcount for both isolation success and failure.
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>	[3.9+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03613808
    • W
      mm/hwpoison: fix page refcount of unknown non LRU page · 4f32be67
      Wanpeng Li 提交于
      After trying to drain pages from pagevec/pageset, we try to get reference
      count of the page again, however, the reference count of the page is not
      reduced if the page is still not on LRU list.
      
      Fix it by adding the put_page() to drop the page reference which is from
      __get_any_page().
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>	[3.9+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f32be67
  18. 07 8月, 2015 4 次提交
  19. 25 6月, 2015 1 次提交