1. 18 3月, 2016 3 次提交
    • K
      thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers · fec89c10
      Kirill A. Shutemov 提交于
      freeze_page() and unfreeze_page() helpers evolved in rather complex
      beasts.  It would be nice to cut complexity of this code.
      
      This patch rewrites freeze_page() using standard try_to_unmap().
      unfreeze_page() is rewritten with remove_migration_ptes().
      
      The result is much simpler.
      
      But the new variant is somewhat slower for PTE-mapped THPs.  Current
      helpers iterates over VMAs the compound page is mapped to, and then over
      ptes within this VMA.  New helpers iterates over small page, then over
      VMA the small page mapped to, and only then find relevant pte.
      
      We have short cut for PMD-mapped THP: we directly install migration
      entries on PMD split.
      
      I don't think the slowdown is critical, considering how much simpler
      result is and that split_huge_page() is quite rare nowadays.  It only
      happens due memory pressure or migration.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fec89c10
    • K
      rmap: extend try_to_unmap() to be usable by split_huge_page() · 2a52bcbc
      Kirill A. Shutemov 提交于
      Add support for two ttu_flags:
      
        - TTU_SPLIT_HUGE_PMD would split PMD if it's there, before trying to
          unmap page;
      
        - TTU_RMAP_LOCKED indicates that caller holds relevant rmap lock;
      
      Also, change rwc->done to !page_mapcount() instead of !page_mapped().
      try_to_unmap() works on pte level, so we are really interested in the
      mappedness of this small page rather than of the compound page it's a
      part of.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a52bcbc
    • K
      rmap: introduce rmap_walk_locked() · b9773199
      Kirill A. Shutemov 提交于
      This patchset rewrites freeze_page() and unfreeze_page() using
      try_to_unmap() and remove_migration_ptes().  Result is much simpler, but
      somewhat slower.
      
      Migration 8GiB worth of PMD-mapped THP:
      
        Baseline	20.21 +/- 0.393
        Patched	20.73 +/- 0.082
        Slowdown	1.03x
      
      It's 3% slower, comparing to 14% in v1.  I don't it should be a stopper.
      
      Splitting of PTE-mapped pages slowed more.  But this is not a common
      case.
      
      Migration 8GiB worth of PMD-mapped THP:
      
        Baseline	20.39 +/- 0.225
        Patched	22.43 +/- 0.496
        Slowdown	1.10x
      
      rmap_walk_locked() is the same as rmap_walk(), but the caller takes care
      of the relevant rmap lock.
      
      This is preparation for switching THP splitting from custom rmap walk in
      freeze_page()/unfreeze_page() to the generic one.
      
      There is no support for KSM pages for now: not clear which lock is
      implied.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9773199
  2. 16 3月, 2016 2 次提交
  3. 16 1月, 2016 10 次提交
    • K
      mm: fix locking order in mm_take_all_locks() · 88f306b6
      Kirill A. Shutemov 提交于
      Dmitry Vyukov has reported[1] possible deadlock (triggered by his
      syzkaller fuzzer):
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&hugetlbfs_i_mmap_rwsem_key);
                                     lock(&mapping->i_mmap_rwsem);
                                     lock(&hugetlbfs_i_mmap_rwsem_key);
        lock(&mapping->i_mmap_rwsem);
      
      Both traces points to mm_take_all_locks() as a source of the problem.
      It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
      mapping->i_mmap_rwsem for hugetlb mapping) vs.  i_mmap_rwsem.
      
      huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
      and allocator can take i_mmap_rwsem if it hit reclaim.  So we need to
      take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
      rest of VMAs.
      
      The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.
      
      [1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f306b6
    • M
      mm: support madvise(MADV_FREE) · 854e9ed0
      Minchan Kim 提交于
      Linux doesn't have an ability to free pages lazy while other OS already
      have been supported that named by madvise(MADV_FREE).
      
      The gain is clear that kernel can discard freed pages rather than
      swapping out or OOM if memory pressure happens.
      
      Without memory pressure, freed pages would be reused by userspace
      without another additional overhead(ex, page fault + allocation +
      zeroing).
      
      Jason Evans said:
      
      : Facebook has been using MAP_UNINITIALIZED
      : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
      : several years, but there are operational costs to maintaining this
      : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
      : in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
      : increased throughput for much of our workload by ~5%, and although the
      : benefit has decreased using newer hardware and kernels, there is still
      : enough benefit that we cannot reasonably retire it without a replacement.
      :
      : Aside from Facebook operations, there are numerous broadly used
      : applications that would benefit from MADV_FREE.  The ones that immediately
      : come to mind are redis, varnish, and MariaDB.  I don't have much insight
      : into Android internals and development process, but I would hope to see
      : MADV_FREE support eventually end up there as well to benefit applications
      : linked with the integrated jemalloc.
      :
      : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
      : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
      : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
      : (and AIX, but I'm not sure it even compiles on AIX).  The lack of
      : MADV_FREE on Linux forced me down a long series of increasingly
      : sophisticated heuristics for madvise() volume reduction, and even so this
      : remains a common performance issue for people using jemalloc on Linux.
      : Please integrate MADV_FREE; many people will benefit substantially.
      
      How it works:
      
      When madvise syscall is called, VM clears dirty bit of ptes of the
      range.  If memory pressure happens, VM checks dirty bit of page table
      and if it found still "clean", it means it's a "lazyfree pages" so VM
      could discard the page instead of swapping out.  Once there was store
      operation for the page before VM peek a page to reclaim, dirty bit is
      set so VM can swap out the page instead of discarding.
      
      One thing we should notice is that basically, MADV_FREE relies on dirty
      bit in page table entry to decide whether VM allows to discard the page
      or not.  IOW, if page table entry includes marked dirty bit, VM
      shouldn't discard the page.
      
      However, as a example, if swap-in by read fault happens, page table
      entry doesn't have dirty bit so MADV_FREE could discard the page
      wrongly.
      
      For avoiding the problem, MADV_FREE did more checks with PageDirty and
      PageSwapCache.  It worked out because swapped-in page lives on swap
      cache and since it is evicted from the swap cache, the page has PG_dirty
      flag.  So both page flags check effectively prevent wrong discarding by
      MADV_FREE.
      
      However, a problem in above logic is that swapped-in page has PG_dirty
      still after they are removed from swap cache so VM cannot consider the
      page as freeable any more even if madvise_free is called in future.
      
      Look at below example for detail.
      
          ptr = malloc();
          memset(ptr);
          ..
          ..
          .. heavy memory pressure so all of pages are swapped out
          ..
          ..
          var = *ptr; -> a page swapped-in and could be removed from
                         swapcache. Then, page table doesn't mark
                         dirty bit and page descriptor includes PG_dirty
          ..
          ..
          madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
          ..
          ..
          ..
          .. heavy memory pressure again.
          .. In this time, VM cannot discard the page because the page
          .. has *PG_dirty*
      
      To solve the problem, this patch clears PG_dirty if only the page is
      owned exclusively by current process when madvise is called because
      PG_dirty represents ptes's dirtiness in several processes so we could
      clear it only if we own it exclusively.
      
      Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
      and hope glibc supports it) and jemalloc/tcmalloc already have supported
      the feature for other OS(ex, FreeBSD)
      
        barrios@blaptop:~/benchmark/ebizzy$ lscpu
        Architecture:          x86_64
        CPU op-mode(s):        32-bit, 64-bit
        Byte Order:            Little Endian
        CPU(s):                12
        On-line CPU(s) list:   0-11
        Thread(s) per core:    1
        Core(s) per socket:    1
        Socket(s):             12
        NUMA node(s):          1
        Vendor ID:             GenuineIntel
        CPU family:            6
        Model:                 2
        Stepping:              3
        CPU MHz:               3200.185
        BogoMIPS:              6400.53
        Virtualization:        VT-x
        Hypervisor vendor:     KVM
        Virtualization type:   full
        L1d cache:             32K
        L1i cache:             32K
        L2 cache:              4096K
        NUMA node0 CPU(s):     0-11
        ebizzy benchmark(./ebizzy -S 10 -n 512)
      
        Higher avg is better.
      
         vanilla-jemalloc             MADV_free-jemalloc
      
        1 thread
        records: 10                   records: 10
        avg:   2961.90                avg:  12069.70
        std:     71.96(2.43%)         std:    186.68(1.55%)
        max:   3070.00                max:  12385.00
        min:   2796.00                min:  11746.00
      
        2 thread
        records: 10                   records: 10
        avg:   5020.00                avg:  17827.00
        std:    264.87(5.28%)         std:    358.52(2.01%)
        max:   5244.00                max:  18760.00
        min:   4251.00                min:  17382.00
      
        4 thread
        records: 10                   records: 10
        avg:   8988.80                avg:  27930.80
        std:   1175.33(13.08%)        std:   3317.33(11.88%)
        max:   9508.00                max:  30879.00
        min:   5477.00                min:  21024.00
      
        8 thread
        records: 10                   records: 10
        avg:  13036.50                avg:  33739.40
        std:    170.67(1.31%)         std:   5146.22(15.25%)
        max:  13371.00                max:  40572.00
        min:  12785.00                min:  24088.00
      
        16 thread
        records: 10                   records: 10
        avg:  11092.40                avg:  31424.20
        std:    710.60(6.41%)         std:   3763.89(11.98%)
        max:  12446.00                max:  36635.00
        min:   9949.00                min:  25669.00
      
        32 thread
        records: 10                   records: 10
        avg:  11067.00                avg:  34495.80
        std:    971.06(8.77%)         std:   2721.36(7.89%)
        max:  12010.00                max:  38598.00
        min:   9002.00                min:  30636.00
      
      In summary, MADV_FREE is about much faster than MADV_DONTNEED.
      
      This patch (of 12):
      
      Add core MADV_FREE implementation.
      
      [akpm@linux-foundation.org: small cleanups]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jason Evans <je@fb.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: "Shaohua Li" <shli@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      854e9ed0
    • V
      mm: add page_check_address_transhuge() helper · 8749cfea
      Vladimir Davydov 提交于
      page_referenced_one() and page_idle_clear_pte_refs_one() duplicate the
      code for looking up pte of a (possibly transhuge) page.  Move this code
      to a new helper function, page_check_address_transhuge(), and make the
      above mentioned functions use it.
      
      This is just a cleanup, no functional changes are intended.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8749cfea
    • K
      mm: prepare page_referenced() and page_idle to new THP refcounting · b20ce5e0
      Kirill A. Shutemov 提交于
      Both page_referenced() and page_idle_clear_pte_refs_one() assume that
      THP can only be mapped with PMD, so there's no reason to look on PTEs
      for PageTransHuge() pages.  That's no true anymore: THP can be mapped
      with PTEs too.
      
      The patch removes PageTransHuge() test from the functions and opencode
      page table check.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b20ce5e0
    • K
      thp: allow mlocked THP again · e90309c9
      Kirill A. Shutemov 提交于
      Before THP refcounting rework, THP was not allowed to cross VMA
      boundary.  So, if we have THP and we split it, PG_mlocked can be safely
      transferred to small pages.
      
      With new THP refcounting and naive approach to mlocking we can end up
      with this scenario:
       1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
       2. the process does munlock() on the *part* of the THP:
            - the VMA is split into two, one of them VM_LOCKED;
            - huge PMD split into PTE table;
            - THP is still mlocked;
       3. split_huge_page():
            - it transfers PG_mlocked to *all* small pages regrardless if it
      	blong to any VM_LOCKED VMA.
      
      We probably could munlock() all small pages on split_huge_page(), but I
      think we have accounting issue already on step two.
      
      Instead of forbidding mlocked pages altogether, we just avoid mlocking
      PTE-mapped THPs and munlock THPs on split_huge_pmd().
      
      This means PTE-mapped THPs will be on normal lru lists and will be split
      under memory pressure by vmscan.  After the split vmscan will detect
      unevictable small pages and mlock them.
      
      With this approach we shouldn't hit situation like described above.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e90309c9
    • K
      thp: introduce deferred_split_huge_page() · 9a982250
      Kirill A. Shutemov 提交于
      Currently we don't split huge page on partial unmap.  It's not an ideal
      situation.  It can lead to memory overhead.
      
      Furtunately, we can detect partial unmap on page_remove_rmap().  But we
      cannot call split_huge_page() from there due to locking context.
      
      It's also counterproductive to do directly from munmap() codepath: in
      many cases we will hit this from exit(2) and splitting the huge page
      just to free it up in small pages is not what we really want.
      
      The patch introduce deferred_split_huge_page() which put the huge page
      into queue for splitting.  The splitting itself will happen when we get
      memory pressure via shrinker interface.  The page will be dropped from
      list on freeing through compound page destructor.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a982250
    • K
      thp: reintroduce split_huge_page() · e9b61f19
      Kirill A. Shutemov 提交于
      This patch adds implementation of split_huge_page() for new
      refcountings.
      
      Unlike previous implementation, new split_huge_page() can fail if
      somebody holds GUP pin on the page.  It also means that pin on page
      would prevent it from bening split under you.  It makes situation in
      many places much cleaner.
      
      The basic scheme of split_huge_page():
      
        - Check that sum of mapcounts of all subpage is equal to page_count()
          plus one (caller pin). Foll off with -EBUSY. This way we can avoid
          useless PMD-splits.
      
        - Freeze the page counters by splitting all PMD and setup migration
          PTEs.
      
        - Re-check sum of mapcounts against page_count(). Page's counts are
          stable now. -EBUSY if page is pinned.
      
        - Split compound page.
      
        - Unfreeze the page by removing migration entries.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9b61f19
    • K
      mm: rework mapcount accounting to enable 4k mapping of THPs · 53f9263b
      Kirill A. Shutemov 提交于
      We're going to allow mapping of individual 4k pages of THP compound.  It
      means we need to track mapcount on per small page basis.
      
      Straight-forward approach is to use ->_mapcount in all subpages to track
      how many time this subpage is mapped with PMDs or PTEs combined.  But
      this is rather expensive: mapping or unmapping of a THP page with PMD
      would require HPAGE_PMD_NR atomic operations instead of single we have
      now.
      
      The idea is to store separately how many times the page was mapped as
      whole -- compound_mapcount.  This frees up ->_mapcount in subpages to
      track PTE mapcount.
      
      We use the same approach as with compound page destructor and compound
      order to store compound_mapcount: use space in first tail page,
      ->mapping this time.
      
      Any time we map/unmap whole compound page (THP or hugetlb) -- we
      increment/decrement compound_mapcount.  When we map part of compound
      page with PTE we operate on ->_mapcount of the subpage.
      
      page_mapcount() counts both: PTE and PMD mappings of the page.
      
      Basically, we have mapcount for a subpage spread over two counters.  It
      makes tricky to detect when last mapcount for a page goes away.
      
      We introduced PageDoubleMap() for this.  When we split THP PMD for the
      first time and there's other PMD mapping left we offset up ->_mapcount
      in all subpages by one and set PG_double_map on the compound page.
      These additional references go away with last compound_mapcount.
      
      This approach provides a way to detect when last mapcount goes away on
      per small page basis without introducing new overhead for most common
      cases.
      
      [akpm@linux-foundation.org: fix typo in comment]
      [mhocko@suse.com: ignore partial THP when moving task]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53f9263b
    • K
      mm, thp: remove infrastructure for handling splitting PMDs · 4b471e88
      Kirill A. Shutemov 提交于
      With new refcounting we don't need to mark PMDs splitting.  Let's drop
      code to handle this.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b471e88
    • K
      rmap: add argument to charge compound page · d281ee61
      Kirill A. Shutemov 提交于
      We're going to allow mapping of individual 4k pages of THP compound
      page.  It means we cannot rely on PageTransHuge() check to decide if
      map/unmap small page or THP.
      
      The patch adds new argument to rmap functions to indicate whether we
      want to operate on whole compound page or only the small page.
      
      [n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d281ee61
  4. 15 1月, 2016 2 次提交
    • J
      mm, shmem: add internal shmem resident memory accounting · eca56ff9
      Jerome Marchand 提交于
      Currently looking at /proc/<pid>/status or statm, there is no way to
      distinguish shmem pages from pages mapped to a regular file (shmem pages
      are mapped to /dev/zero), even though their implication in actual memory
      use is quite different.
      
      The internal accounting currently counts shmem pages together with
      regular files.  As a preparation to extend the userspace interfaces,
      this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
      shmem pages separately from MM_FILEPAGES.  The next patch will expose it
      to userspace - this patch doesn't change the exported values yet, by
      adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
      used before.  The only user-visible change after this patch is the OOM
      killer message that separates the reported "shmem-rss" from "file-rss".
      
      [vbabka@suse.cz: forward-porting, tweak changelog]
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eca56ff9
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  5. 06 11月, 2015 4 次提交
    • H
      mm: page migration use migration entry for swapcache too · 470f119f
      Hugh Dickins 提交于
      Hitherto page migration has avoided using a migration entry for a
      swapcache page mapped into userspace, apparently for historical reasons.
      So any page blessed with swapcache would entail a minor fault when it's
      next touched, which page migration otherwise tries to avoid.  Swapcache in
      an mlocked area is rare, so won't often matter, but still better fixed.
      
      Just rearrange the block in try_to_unmap_one(), to handle TTU_MIGRATION
      before checking PageAnon, that's all (apart from some reindenting).
      
      Well, no, that's not quite all: doesn't this by the way fix a soft_dirty
      bug, that page migration of a file page was forgetting to transfer the
      soft_dirty bit?  Probably not a serious bug: if I understand correctly,
      soft_dirty afficionados usually have to handle file pages separately
      anyway; but we publish the bit in /proc/<pid>/pagemap on file mappings as
      well as anonymous, so page migration ought not to perturb it.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      470f119f
    • H
      mm: rmap use pte lock not mmap_sem to set PageMlocked · b87537d9
      Hugh Dickins 提交于
      KernelThreadSanitizer (ktsan) has shown that the down_read_trylock() of
      mmap_sem in try_to_unmap_one() (when going to set PageMlocked on a page
      found mapped in a VM_LOCKED vma) is ineffective against races with
      exit_mmap()'s munlock_vma_pages_all(), because mmap_sem is not held when
      tearing down an mm.
      
      But that's okay, those races are benign; and although we've believed for
      years in that ugly down_read_trylock(), it's unsuitable for the job, and
      frustrates the good intention of setting PageMlocked when it fails.
      
      It just doesn't matter if here we read vm_flags an instant before or after
      a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED: the
      syscalls (or exit) work their way up the address space (taking pt locks
      after updating vm_flags) to establish the final state.
      
      We do still need to be careful never to mark a page Mlocked (hence
      unevictable) by any race that will not be corrected shortly after.  The
      page lock protects from many of the races, but not all (a page is not
      necessarily locked when it's unmapped).  But the pte lock we just dropped
      is good to cover the rest (and serializes even with
      munlock_vma_pages_all(), so no special barriers required): now hold on to
      the pte lock while calling mlock_vma_page().  Is that lock ordering safe?
      Yes, that's how follow_page_pte() calls it, and how page_remove_rmap()
      calls the complementary clear_page_mlock().
      
      This fixes the following case (though not a case which anyone has
      complained of), which mmap_sem did not: truncation's preliminary
      unmap_mapping_range() is supposed to remove even the anonymous COWs of
      filecache pages, and that might race with try_to_unmap_one() on a
      VM_LOCKED vma, so that mlock_vma_page() sets PageMlocked just after
      zap_pte_range() unmaps the page, causing "Bad page state (mlocked)" when
      freed.  The pte lock protects against this.
      
      You could say that it also protects against the more ordinary case, racing
      with the preliminary unmapping of a filecache page itself: but in our
      current tree, that's independently protected by i_mmap_rwsem; and that
      race would be why "Bad page state (mlocked)" was seen before commit
      48ec833b ("Revert mm/memory.c: share the i_mmap_rwsem").
      
      Vlastimil Babka points out another race which this patch protects against.
       try_to_unmap_one() might reach its mlock_vma_page() TestSetPageMlocked a
      moment after munlock_vma_pages_all() did its Phase 1 TestClearPageMlocked:
      leaving PageMlocked and unevictable when it should be evictable.  mmap_sem
      is ineffective because exit_mmap() does not hold it; page lock ineffective
      because __munlock_pagevec() only takes it afterwards, in Phase 2; pte lock
      is effective because __munlock_pagevec_fill() takes it to get the page,
      after VM_LOCKED was cleared from vm_flags, so visible to try_to_unmap_one.
      
      Kirill Shutemov points out that if the compiler chooses to implement a
      "vma->vm_flags &= VM_WHATEVER" or "vma->vm_flags |= VM_WHATEVER" operation
      with an intermediate store of unrelated bits set, since I'm here foregoing
      its usual protection by mmap_sem, try_to_unmap_one() might catch sight of
      a spurious VM_LOCKED in vm_flags, and make the wrong decision.  This does
      not appear to be an immediate problem, but we may want to define vm_flags
      accessors in future, to guard against such a possibility.
      
      While we're here, make a related optimization in try_to_munmap_one(): if
      it's doing TTU_MUNLOCK, then there's no point at all in descending the
      page tables and getting the pt lock, unless the vma is VM_LOCKED.  Yes,
      that can change racily, but it can change racily even without the
      optimization: it's not critical.  Far better not to waste time here.
      
      Stopped short of separating try_to_munlock_one() from try_to_munmap_one()
      on this occasion, but that's probably the sensible next step - with a
      rename, given that try_to_munlock()'s business is to try to set Mlocked.
      
      Updated the unevictable-lru Documentation, to remove its reference to mmap
      semaphore, but found a few more updates needed in just that area.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b87537d9
    • A
      ksm: add cond_resched() to the rmap_walks · ad12695f
      Andrea Arcangeli 提交于
      While at it add it to the file and anon walks too.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Acked-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad12695f
    • N
      mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status · 5d317b2b
      Naoya Horiguchi 提交于
      Currently there's no easy way to get per-process usage of hugetlb pages,
      which is inconvenient because userspace applications which use hugetlb
      typically want to control their processes on the basis of how much memory
      (including hugetlb) they use.  So this patch simply provides easy access
      to the info via /proc/PID/status.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NJoern Engel <joern@logfs.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d317b2b
  6. 11 9月, 2015 1 次提交
    • V
      mm: introduce idle page tracking · 33c3fc71
      Vladimir Davydov 提交于
      Knowing the portion of memory that is not used by a certain application or
      memory cgroup (idle memory) can be useful for partitioning the system
      efficiently, e.g.  by setting memory cgroup limits appropriately.
      Currently, the only means to estimate the amount of idle memory provided
      by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
      access bit for all pages mapped to a particular process by writing 1 to
      clear_refs, wait for some time, and then count smaps:Referenced.  However,
      this method has two serious shortcomings:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      To overcome these drawbacks, this patch introduces two new page flags,
      Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
      A page's Idle flag can only be set from userspace by setting bit in
      /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
      and it is cleared whenever the page is accessed either through page tables
      (it is cleared in page_referenced() in this case) or using the read(2)
      system call (mark_page_accessed()). Thus by setting the Idle flag for
      pages of a particular workload, which can be found e.g.  by reading
      /proc/PID/pagemap, waiting for some time to let the workload access its
      working set, and then reading the bitmap file, one can estimate the amount
      of pages that are not used by the workload.
      
      The Young page flag is used to avoid interference with the memory
      reclaimer.  A page's Young flag is set whenever the Access bit of a page
      table entry pointing to the page is cleared by writing to the bitmap file.
      If page_referenced() is called on a Young page, it will add 1 to its
      return value, therefore concealing the fact that the Access bit was
      cleared.
      
      Note, since there is no room for extra page flags on 32 bit, this feature
      uses extended page flags when compiled on 32 bit.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: kpageidle requires an MMU]
      [akpm@linux-foundation.org: decouple from page-flags rework]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33c3fc71
  7. 05 9月, 2015 2 次提交
    • M
      mm: defer flush of writable TLB entries · d950c947
      Mel Gorman 提交于
      If a PTE is unmapped and it's dirty then it was writable recently.  Due to
      deferred TLB flushing, it's best to assume a writable TLB cache entry
      exists.  With that assumption, the TLB must be flushed before any IO can
      start or the page is freed to avoid lost writes or data corruption.  This
      patch defers flushing of potentially writable TLBs as long as possible.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d950c947
    • M
      mm: send one IPI per CPU to TLB flush all entries after unmapping pages · 72b252ae
      Mel Gorman 提交于
      An IPI is sent to flush remote TLBs when a page is unmapped that was
      potentially accesssed by other CPUs.  There are many circumstances where
      this happens but the obvious one is kswapd reclaiming pages belonging to a
      running process as kswapd and the task are likely running on separate
      CPUs.
      
      On small machines, this is not a significant problem but as machine gets
      larger with more cores and more memory, the cost of these IPIs can be
      high.  This patch uses a simple structure that tracks CPUs that
      potentially have TLB entries for pages being unmapped.  When the unmapping
      is complete, the full TLB is flushed on the assumption that a refill cost
      is lower than flushing individual entries.
      
      Architectures wishing to do this must give the following guarantee.
      
              If a clean page is unmapped and not immediately flushed, the
              architecture must guarantee that a write to that linear address
              from a CPU with a cached TLB entry will trap a page fault.
      
      This is essentially what the kernel already depends on but the window is
      much larger with this patch applied and is worth highlighting.  The
      architecture should consider whether the cost of the full TLB flush is
      higher than sending an IPI to flush each individual entry.  An additional
      architecture helper called flush_tlb_local is required.  It's a trivial
      wrapper with some accounting in the x86 case.
      
      The impact of this patch depends on the workload as measuring any benefit
      requires both mapped pages co-located on the LRU and memory pressure.  The
      case with the biggest impact is multiple processes reading mapped pages
      taken from the vm-scalability test suite.  The test case uses NR_CPU
      readers of mapped files that consume 10*RAM.
      
      Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed      159.62 (  0.00%)   120.68 ( 24.40%)
      Ops lru-file-mmap-read-time_range    30.59 (  0.00%)     2.80 ( 90.85%)
      Ops lru-file-mmap-read-time_stddv     6.70 (  0.00%)     0.64 ( 90.38%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User          581.00       611.43
      System       5804.93      4111.76
      Elapsed       161.03       122.12
      
      This is showing that the readers completed 24.40% faster with 29% less
      system CPU time.  From vmstats, it is known that the vanilla kernel was
      interrupted roughly 900K times per second during the steady phase of the
      test and the patched kernel was interrupts 180K times per second.
      
      The impact is lower on a single socket machine.
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed       25.33 (  0.00%)    20.38 ( 19.54%)
      Ops lru-file-mmap-read-time_range     0.91 (  0.00%)     1.44 (-58.24%)
      Ops lru-file-mmap-read-time_stddv     0.28 (  0.00%)     0.47 (-65.34%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User           58.09        57.64
      System        111.82        76.56
      Elapsed        27.29        22.55
      
      It's still a noticeable improvement with vmstat showing interrupts went
      from roughly 500K per second to 45K per second.
      
      The patch will have no impact on workloads with no memory pressure or have
      relatively few mapped pages.  It will have an unpredictable impact on the
      workload running on the CPU being flushed as it'll depend on how many TLB
      entries need to be refilled and how long that takes.  Worst case, the TLB
      will be completely cleared of active entries when the target PFNs were not
      resident at all.
      
      [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72b252ae
  8. 25 6月, 2015 2 次提交
  9. 02 6月, 2015 1 次提交
    • G
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen 提交于
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2.  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: NSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c4843a75
  10. 16 4月, 2015 1 次提交
  11. 26 3月, 2015 1 次提交
  12. 12 2月, 2015 1 次提交
  13. 11 2月, 2015 1 次提交
  14. 09 1月, 2015 1 次提交
  15. 18 12月, 2014 1 次提交
  16. 14 12月, 2014 4 次提交
  17. 11 12月, 2014 1 次提交
    • M
      mm, memcg: fix potential undefined behaviour in page stat accounting · e4bd6a02
      Michal Hocko 提交于
      Since commit d7365e78 ("mm: memcontrol: fix missed end-writeback
      page accounting") mem_cgroup_end_page_stat consumes locked and flags
      variables directly rather than via pointers which might trigger C
      undefined behavior as those variables are initialized only in the slow
      path of mem_cgroup_begin_page_stat.
      
      Although mem_cgroup_end_page_stat handles parameters correctly and
      touches them only when they hold a sensible value it is caller which
      loads a potentially uninitialized value which then might allow compiler
      to do crazy things.
      
      I haven't seen any warning from gcc and it seems that the current
      version (4.9) doesn't exploit this type undefined behavior but Sasha has
      reported the following:
      
        UBSan: Undefined behaviour in mm/rmap.c:1084:2
        load of value 255 is not a valid value for type '_Bool'
        CPU: 4 PID: 8304 Comm: rngd Not tainted 3.18.0-rc2-next-20141029-sasha-00039-g77ed13d-dirty #1427
        Call Trace:
          dump_stack (lib/dump_stack.c:52)
          ubsan_epilogue (lib/ubsan.c:159)
          __ubsan_handle_load_invalid_value (lib/ubsan.c:482)
          page_remove_rmap (mm/rmap.c:1084 mm/rmap.c:1096)
          unmap_page_range (./arch/x86/include/asm/atomic.h:27 include/linux/mm.h:463 mm/memory.c:1146 mm/memory.c:1258 mm/memory.c:1279 mm/memory.c:1303)
          unmap_single_vma (mm/memory.c:1348)
          unmap_vmas (mm/memory.c:1377 (discriminator 3))
          exit_mmap (mm/mmap.c:2837)
          mmput (kernel/fork.c:659)
          do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:462 kernel/exit.c:747)
          do_group_exit (include/linux/sched.h:775 kernel/exit.c:873)
          SyS_exit_group (kernel/exit.c:901)
          tracesys_phase2 (arch/x86/kernel/entry_64.S:529)
      
      Fix this by using pointer parameters for both locked and flags and be
      more robust for future compiler changes even though the current code is
      implemented correctly.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4bd6a02
  18. 04 12月, 2014 1 次提交
    • D
      mm: fix anon_vma_clone() error treatment · c4ea95d7
      Daniel Forrest 提交于
      Andrew Morton noticed that the error return from anon_vma_clone() was
      being dropped and replaced with -ENOMEM (which is not itself a bug
      because the only error return value from anon_vma_clone() is -ENOMEM).
      
      I did an audit of callers of anon_vma_clone() and discovered an actual
      bug where the error return was being lost.  In __split_vma(), between
      Linux 3.11 and 3.12 the code was changed so the err variable is used
      before the call to anon_vma_clone() and the default initial value of
      -ENOMEM is overwritten.  So a failure of anon_vma_clone() will return
      success since err at this point is now zero.
      
      Below is a patch which fixes this bug and also propagates the error
      return value from anon_vma_clone() in all cases.
      
      Fixes: ef0855d3 ("mm: mempolicy: turn vma_set_policy() into vma_dup_policy()")
      Signed-off-by: NDaniel Forrest <dan.forrest@ssec.wisc.edu>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tim Hartrick <tim@edgecast.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4ea95d7
  19. 30 10月, 2014 1 次提交