1. 04 5月, 2017 5 次提交
  2. 01 4月, 2017 1 次提交
  3. 10 3月, 2017 2 次提交
  4. 02 3月, 2017 2 次提交
  5. 25 2月, 2017 6 次提交
  6. 13 12月, 2016 1 次提交
    • V
      mm, rmap: handle anon_vma_prepare() common case inline · d5a187da
      Vlastimil Babka 提交于
      anon_vma_prepare() is mostly a large "if (unlikely(...))" block, as the
      expected common case is that an anon_vma already exists.  We could turn
      the condition around and return 0, but it also makes sense to do it
      inline and avoid a call for the common case.
      
      Bloat-o-meter naturally shows that inlining the check has some code size
      costs:
      
      add/remove: 1/1 grow/shrink: 4/0 up/down: 475/-373 (102)
      function                                     old     new   delta
      __anon_vma_prepare                             -     359    +359
      handle_mm_fault                             2744    2796     +52
      hugetlb_cow                                 1146    1170     +24
      hugetlb_fault                               2123    2145     +22
      wp_page_copy                                1469    1487     +18
      anon_vma_prepare                             373       -    -373
      
      Checking the asm however confirms that the hot paths now avoid a call,
      which is moved away.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20161116074005.22768-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5a187da
  7. 11 8月, 2016 2 次提交
  8. 29 7月, 2016 4 次提交
  9. 27 7月, 2016 3 次提交
  10. 15 7月, 2016 2 次提交
  11. 28 5月, 2016 1 次提交
  12. 20 5月, 2016 2 次提交
  13. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  14. 02 4月, 2016 1 次提交
    • N
      mm/rmap: batched invalidations should use existing api · 858eaaa7
      Nadav Amit 提交于
      The recently introduced batched invalidations mechanism uses its own
      mechanism for shootdown.  However, it does wrong accounting of
      interrupts (e.g., inc_irq_stat is called for local invalidations),
      trace-points (e.g., TLB_REMOTE_SHOOTDOWN for local invalidations) and
      may break some platforms as it bypasses the invalidation mechanisms of
      Xen and SGI UV.
      
      This patch reuses the existing TLB flushing mechnaisms instead.  We use
      NULL as mm to indicate a global invalidation is required.
      
      Fixes 72b252ae ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      858eaaa7
  15. 18 3月, 2016 3 次提交
    • K
      thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers · fec89c10
      Kirill A. Shutemov 提交于
      freeze_page() and unfreeze_page() helpers evolved in rather complex
      beasts.  It would be nice to cut complexity of this code.
      
      This patch rewrites freeze_page() using standard try_to_unmap().
      unfreeze_page() is rewritten with remove_migration_ptes().
      
      The result is much simpler.
      
      But the new variant is somewhat slower for PTE-mapped THPs.  Current
      helpers iterates over VMAs the compound page is mapped to, and then over
      ptes within this VMA.  New helpers iterates over small page, then over
      VMA the small page mapped to, and only then find relevant pte.
      
      We have short cut for PMD-mapped THP: we directly install migration
      entries on PMD split.
      
      I don't think the slowdown is critical, considering how much simpler
      result is and that split_huge_page() is quite rare nowadays.  It only
      happens due memory pressure or migration.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fec89c10
    • K
      rmap: extend try_to_unmap() to be usable by split_huge_page() · 2a52bcbc
      Kirill A. Shutemov 提交于
      Add support for two ttu_flags:
      
        - TTU_SPLIT_HUGE_PMD would split PMD if it's there, before trying to
          unmap page;
      
        - TTU_RMAP_LOCKED indicates that caller holds relevant rmap lock;
      
      Also, change rwc->done to !page_mapcount() instead of !page_mapped().
      try_to_unmap() works on pte level, so we are really interested in the
      mappedness of this small page rather than of the compound page it's a
      part of.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a52bcbc
    • K
      rmap: introduce rmap_walk_locked() · b9773199
      Kirill A. Shutemov 提交于
      This patchset rewrites freeze_page() and unfreeze_page() using
      try_to_unmap() and remove_migration_ptes().  Result is much simpler, but
      somewhat slower.
      
      Migration 8GiB worth of PMD-mapped THP:
      
        Baseline	20.21 +/- 0.393
        Patched	20.73 +/- 0.082
        Slowdown	1.03x
      
      It's 3% slower, comparing to 14% in v1.  I don't it should be a stopper.
      
      Splitting of PTE-mapped pages slowed more.  But this is not a common
      case.
      
      Migration 8GiB worth of PMD-mapped THP:
      
        Baseline	20.39 +/- 0.225
        Patched	22.43 +/- 0.496
        Slowdown	1.10x
      
      rmap_walk_locked() is the same as rmap_walk(), but the caller takes care
      of the relevant rmap lock.
      
      This is preparation for switching THP splitting from custom rmap walk in
      freeze_page()/unfreeze_page() to the generic one.
      
      There is no support for KSM pages for now: not clear which lock is
      implied.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9773199
  16. 16 3月, 2016 2 次提交
  17. 16 1月, 2016 2 次提交
    • K
      mm: fix locking order in mm_take_all_locks() · 88f306b6
      Kirill A. Shutemov 提交于
      Dmitry Vyukov has reported[1] possible deadlock (triggered by his
      syzkaller fuzzer):
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&hugetlbfs_i_mmap_rwsem_key);
                                     lock(&mapping->i_mmap_rwsem);
                                     lock(&hugetlbfs_i_mmap_rwsem_key);
        lock(&mapping->i_mmap_rwsem);
      
      Both traces points to mm_take_all_locks() as a source of the problem.
      It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
      mapping->i_mmap_rwsem for hugetlb mapping) vs.  i_mmap_rwsem.
      
      huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
      and allocator can take i_mmap_rwsem if it hit reclaim.  So we need to
      take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
      rest of VMAs.
      
      The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.
      
      [1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f306b6
    • M
      mm: support madvise(MADV_FREE) · 854e9ed0
      Minchan Kim 提交于
      Linux doesn't have an ability to free pages lazy while other OS already
      have been supported that named by madvise(MADV_FREE).
      
      The gain is clear that kernel can discard freed pages rather than
      swapping out or OOM if memory pressure happens.
      
      Without memory pressure, freed pages would be reused by userspace
      without another additional overhead(ex, page fault + allocation +
      zeroing).
      
      Jason Evans said:
      
      : Facebook has been using MAP_UNINITIALIZED
      : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
      : several years, but there are operational costs to maintaining this
      : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
      : in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
      : increased throughput for much of our workload by ~5%, and although the
      : benefit has decreased using newer hardware and kernels, there is still
      : enough benefit that we cannot reasonably retire it without a replacement.
      :
      : Aside from Facebook operations, there are numerous broadly used
      : applications that would benefit from MADV_FREE.  The ones that immediately
      : come to mind are redis, varnish, and MariaDB.  I don't have much insight
      : into Android internals and development process, but I would hope to see
      : MADV_FREE support eventually end up there as well to benefit applications
      : linked with the integrated jemalloc.
      :
      : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
      : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
      : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
      : (and AIX, but I'm not sure it even compiles on AIX).  The lack of
      : MADV_FREE on Linux forced me down a long series of increasingly
      : sophisticated heuristics for madvise() volume reduction, and even so this
      : remains a common performance issue for people using jemalloc on Linux.
      : Please integrate MADV_FREE; many people will benefit substantially.
      
      How it works:
      
      When madvise syscall is called, VM clears dirty bit of ptes of the
      range.  If memory pressure happens, VM checks dirty bit of page table
      and if it found still "clean", it means it's a "lazyfree pages" so VM
      could discard the page instead of swapping out.  Once there was store
      operation for the page before VM peek a page to reclaim, dirty bit is
      set so VM can swap out the page instead of discarding.
      
      One thing we should notice is that basically, MADV_FREE relies on dirty
      bit in page table entry to decide whether VM allows to discard the page
      or not.  IOW, if page table entry includes marked dirty bit, VM
      shouldn't discard the page.
      
      However, as a example, if swap-in by read fault happens, page table
      entry doesn't have dirty bit so MADV_FREE could discard the page
      wrongly.
      
      For avoiding the problem, MADV_FREE did more checks with PageDirty and
      PageSwapCache.  It worked out because swapped-in page lives on swap
      cache and since it is evicted from the swap cache, the page has PG_dirty
      flag.  So both page flags check effectively prevent wrong discarding by
      MADV_FREE.
      
      However, a problem in above logic is that swapped-in page has PG_dirty
      still after they are removed from swap cache so VM cannot consider the
      page as freeable any more even if madvise_free is called in future.
      
      Look at below example for detail.
      
          ptr = malloc();
          memset(ptr);
          ..
          ..
          .. heavy memory pressure so all of pages are swapped out
          ..
          ..
          var = *ptr; -> a page swapped-in and could be removed from
                         swapcache. Then, page table doesn't mark
                         dirty bit and page descriptor includes PG_dirty
          ..
          ..
          madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
          ..
          ..
          ..
          .. heavy memory pressure again.
          .. In this time, VM cannot discard the page because the page
          .. has *PG_dirty*
      
      To solve the problem, this patch clears PG_dirty if only the page is
      owned exclusively by current process when madvise is called because
      PG_dirty represents ptes's dirtiness in several processes so we could
      clear it only if we own it exclusively.
      
      Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
      and hope glibc supports it) and jemalloc/tcmalloc already have supported
      the feature for other OS(ex, FreeBSD)
      
        barrios@blaptop:~/benchmark/ebizzy$ lscpu
        Architecture:          x86_64
        CPU op-mode(s):        32-bit, 64-bit
        Byte Order:            Little Endian
        CPU(s):                12
        On-line CPU(s) list:   0-11
        Thread(s) per core:    1
        Core(s) per socket:    1
        Socket(s):             12
        NUMA node(s):          1
        Vendor ID:             GenuineIntel
        CPU family:            6
        Model:                 2
        Stepping:              3
        CPU MHz:               3200.185
        BogoMIPS:              6400.53
        Virtualization:        VT-x
        Hypervisor vendor:     KVM
        Virtualization type:   full
        L1d cache:             32K
        L1i cache:             32K
        L2 cache:              4096K
        NUMA node0 CPU(s):     0-11
        ebizzy benchmark(./ebizzy -S 10 -n 512)
      
        Higher avg is better.
      
         vanilla-jemalloc             MADV_free-jemalloc
      
        1 thread
        records: 10                   records: 10
        avg:   2961.90                avg:  12069.70
        std:     71.96(2.43%)         std:    186.68(1.55%)
        max:   3070.00                max:  12385.00
        min:   2796.00                min:  11746.00
      
        2 thread
        records: 10                   records: 10
        avg:   5020.00                avg:  17827.00
        std:    264.87(5.28%)         std:    358.52(2.01%)
        max:   5244.00                max:  18760.00
        min:   4251.00                min:  17382.00
      
        4 thread
        records: 10                   records: 10
        avg:   8988.80                avg:  27930.80
        std:   1175.33(13.08%)        std:   3317.33(11.88%)
        max:   9508.00                max:  30879.00
        min:   5477.00                min:  21024.00
      
        8 thread
        records: 10                   records: 10
        avg:  13036.50                avg:  33739.40
        std:    170.67(1.31%)         std:   5146.22(15.25%)
        max:  13371.00                max:  40572.00
        min:  12785.00                min:  24088.00
      
        16 thread
        records: 10                   records: 10
        avg:  11092.40                avg:  31424.20
        std:    710.60(6.41%)         std:   3763.89(11.98%)
        max:  12446.00                max:  36635.00
        min:   9949.00                min:  25669.00
      
        32 thread
        records: 10                   records: 10
        avg:  11067.00                avg:  34495.80
        std:    971.06(8.77%)         std:   2721.36(7.89%)
        max:  12010.00                max:  38598.00
        min:   9002.00                min:  30636.00
      
      In summary, MADV_FREE is about much faster than MADV_DONTNEED.
      
      This patch (of 12):
      
      Add core MADV_FREE implementation.
      
      [akpm@linux-foundation.org: small cleanups]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jason Evans <je@fb.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: "Shaohua Li" <shli@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      854e9ed0