1. 05 9月, 2015 16 次提交
    • A
      userfaultfd: avoid mmap_sem read recursion in mcopy_atomic · b6ebaedb
      Andrea Arcangeli 提交于
      If the rwsem starves writers it wasn't strictly a bug but lockdep
      doesn't like it and this avoids depending on lowlevel implementation
      details of the lock.
      
      [akpm@linux-foundation.org: delete weird BUILD_BUG_ON()]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6ebaedb
    • A
      userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation · c1a4de99
      Andrea Arcangeli 提交于
      This implements mcopy_atomic and mfill_zeropage that are the lowlevel
      VM methods that are invoked respectively by the UFFDIO_COPY and
      UFFDIO_ZEROPAGE userfaultfd commands.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1a4de99
    • A
      userfaultfd: prevent khugepaged to merge if userfaultfd is armed · c1294d05
      Andrea Arcangeli 提交于
      If userfaultfd is armed on a certain vma we can't "fill" the holes with
      zeroes or we'll break the userland on demand paging.  The holes if the
      userfault is armed, are really missing information (not zeroes) that the
      userland has to load from network or elsewhere.
      
      The same issue happens for wrprotected ptes that we can't just convert
      into a single writable pmd_trans_huge.
      
      We could however in theory still merge across zeropages if only
      VM_UFFD_MISSING is set (so if VM_UFFD_WP is not set)...  that could be
      slightly improved but it'd be much more complex code for a tiny corner
      case.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1294d05
    • A
      userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx · 19a809af
      Andrea Arcangeli 提交于
      vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
      must be aware about so that we can merge vmas back like they were
      originally before arming the userfaultfd on some memory range.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19a809af
    • A
      userfaultfd: call handle_userfault() for userfaultfd_missing() faults · 6b251fc9
      Andrea Arcangeli 提交于
      This is where the page faults must be modified to call
      handle_userfault() if userfaultfd_missing() is true (so if the
      vma->vm_flags had VM_UFFD_MISSING set).
      
      handle_userfault() then takes care of blocking the page fault and
      delivering it to userland.
      
      The fault flags must also be passed as parameter so the "read|write"
      kind of fault can be passed to userland.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b251fc9
    • D
      mm/slab.h: fix argument order in cache_from_obj's error message · 2d16e0fd
      Daniel Borkmann 提交于
      While debugging a networking issue, I hit a condition that triggered an
      object to be freed into the wrong kmem cache, and thus triggered the
      warning in cache_from_obj().
      
      The arguments in the error message are in wrong order: the location
      of the object's kmem cache is in cachep, not s.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d16e0fd
    • J
      mm/slub: don't wait for high-order page allocation · 45eb00cd
      Joonsoo Kim 提交于
      Description is almost copied from commit fb05e7a8 ("net: don't wait
      for order-3 page allocation").
      
      I saw excessive direct memory reclaim/compaction triggered by slub.  This
      causes performance issues and add latency.  Slub uses high-order
      allocation to reduce internal fragmentation and management overhead.  But,
      direct memory reclaim/compaction has high overhead and the benefit of
      high-order allocation can't compensate the overhead of both work.
      
      This patch makes auxiliary high-order allocation atomic.  If there is no
      memory pressure and memory isn't fragmented, the alloction will still
      success, so we don't sacrifice high-order allocation's benefit here.  If
      the atomic allocation fails, direct memory reclaim/compaction will not be
      triggered, allocation fallback to low-order immediately, hence the direct
      memory reclaim/compaction overhead is avoided.  In the allocation failure
      case, kswapd is waken up and trying to make high-order freepages, so
      allocation could success next time.
      
      Following is the test to measure effect of this patch.
      
      System: QEMU, CPU 8, 512 MB
      Mem: 25% memory is allocated at random position to make fragmentation.
       Memory-hogger occupies 150 MB memory.
      Workload: hackbench -g 20 -l 1000
      
      Average result by 10 runs (Base va Patched)
      
      elapsed_time(s): 4.3468 vs 2.9838
      compact_stall: 461.7 vs 73.6
      pgmigrate_success: 28315.9 vs 7256.1
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45eb00cd
    • K
      mm/slub: fix slab double-free in case of duplicate sysfs filename · 80da026a
      Konstantin Khlebnikov 提交于
      sysfs_slab_add() shouldn't call kobject_put at error path: this puts last
      reference of kmem-cache kobject and frees it.  Kmem cache will be freed
      second time at error path in kmem_cache_create().
      
      For example this happens when slub debug was enabled in runtime and
      somebody creates new kmem cache:
      
      # echo 1 | tee /sys/kernel/slab/*/sanity_checks
      # modprobe configfs
      
      "configfs_dir_cache" cannot be merged because existing slab have debug and
      cannot create new slab because unique name ":t-0000096" already taken.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80da026a
    • T
      mm/slub: move slab initialization into irq enabled region · 588f8ba9
      Thomas Gleixner 提交于
      Initializing a new slab can introduce rather large latencies because most
      of the initialization runs always with interrupts disabled.
      
      There is no point in doing so.  The newly allocated slab is not visible
      yet, so there is no reason to protect it against concurrent alloc/free.
      
      Move the expensive parts of the initialization into allocate_slab(), so
      for all allocations with GFP_WAIT set, interrupts are enabled.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      588f8ba9
    • J
      slub: add support for kmem_cache_debug in bulk calls · 3eed034d
      Jesper Dangaard Brouer 提交于
      Per request of Joonsoo Kim adding kmem debug support.
      
      I've tested that when debugging is disabled, then there is almost no
      performance impact as this code basically gets removed by the compiler.
      
      Need some guidance in enabling and testing this.
      
      bulk- PREVIOUS                  - THIS-PATCH
        1 -  43 cycles(tsc) 10.811 ns -  44 cycles(tsc) 11.236 ns  improved  -2.3%
        2 -  27 cycles(tsc)  6.867 ns -  28 cycles(tsc)  7.019 ns  improved  -3.7%
        3 -  21 cycles(tsc)  5.496 ns -  22 cycles(tsc)  5.526 ns  improved  -4.8%
        4 -  24 cycles(tsc)  6.038 ns -  19 cycles(tsc)  4.786 ns  improved  20.8%
        8 -  17 cycles(tsc)  4.280 ns -  18 cycles(tsc)  4.572 ns  improved  -5.9%
       16 -  17 cycles(tsc)  4.483 ns -  18 cycles(tsc)  4.658 ns  improved  -5.9%
       30 -  18 cycles(tsc)  4.531 ns -  18 cycles(tsc)  4.568 ns  improved   0.0%
       32 -  58 cycles(tsc) 14.586 ns -  65 cycles(tsc) 16.454 ns  improved -12.1%
       34 -  53 cycles(tsc) 13.391 ns -  63 cycles(tsc) 15.932 ns  improved -18.9%
       48 -  65 cycles(tsc) 16.268 ns -  50 cycles(tsc) 12.506 ns  improved  23.1%
       64 -  53 cycles(tsc) 13.440 ns -  63 cycles(tsc) 15.929 ns  improved -18.9%
      128 -  79 cycles(tsc) 19.899 ns -  86 cycles(tsc) 21.583 ns  improved  -8.9%
      158 -  90 cycles(tsc) 22.732 ns -  90 cycles(tsc) 22.552 ns  improved   0.0%
      250 -  95 cycles(tsc) 23.916 ns -  98 cycles(tsc) 24.589 ns  improved  -3.2%
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3eed034d
    • J
      slub: initial bulk free implementation · fbd02630
      Jesper Dangaard Brouer 提交于
      This implements SLUB specific kmem_cache_free_bulk().  SLUB allocator now
      both have bulk alloc and free implemented.
      
      Choose to reenable local IRQs while calling slowpath __slab_free().  In
      worst case, where all objects hit slowpath call, the performance should
      still be faster than fallback function __kmem_cache_free_bulk(), because
      local_irq_{disable+enable} is very fast (7-cycles), while the fallback
      invokes this_cpu_cmpxchg() which is slightly slower (9-cycles).
      Nitpicking, this should be faster for N>=4, due to the entry cost of
      local_irq_{disable+enable}.
      
      Do notice that the save+restore variant is very expensive, this is key to
      why this optimization works.
      
      CPU: i7-4790K CPU @ 4.00GHz
       * local_irq_{disable,enable}:  7 cycles(tsc) - 1.821 ns
       * local_irq_{save,restore}  : 37 cycles(tsc) - 9.443 ns
      
      Measurements on CPU CPU i7-4790K @ 4.00GHz
      Baseline normal fastpath (alloc+free cost): 43 cycles(tsc) 10.834 ns
      
      Bulk- fallback                   - this-patch
        1 -  58 cycles(tsc) 14.542 ns  -  43 cycles(tsc) 10.811 ns  improved 25.9%
        2 -  50 cycles(tsc) 12.659 ns  -  27 cycles(tsc)  6.867 ns  improved 46.0%
        3 -  48 cycles(tsc) 12.168 ns  -  21 cycles(tsc)  5.496 ns  improved 56.2%
        4 -  47 cycles(tsc) 11.987 ns  -  24 cycles(tsc)  6.038 ns  improved 48.9%
        8 -  46 cycles(tsc) 11.518 ns  -  17 cycles(tsc)  4.280 ns  improved 63.0%
       16 -  45 cycles(tsc) 11.366 ns  -  17 cycles(tsc)  4.483 ns  improved 62.2%
       30 -  45 cycles(tsc) 11.433 ns  -  18 cycles(tsc)  4.531 ns  improved 60.0%
       32 -  75 cycles(tsc) 18.983 ns  -  58 cycles(tsc) 14.586 ns  improved 22.7%
       34 -  71 cycles(tsc) 17.940 ns  -  53 cycles(tsc) 13.391 ns  improved 25.4%
       48 -  80 cycles(tsc) 20.077 ns  -  65 cycles(tsc) 16.268 ns  improved 18.8%
       64 -  71 cycles(tsc) 17.799 ns  -  53 cycles(tsc) 13.440 ns  improved 25.4%
      128 -  91 cycles(tsc) 22.980 ns  -  79 cycles(tsc) 19.899 ns  improved 13.2%
      158 - 100 cycles(tsc) 25.241 ns  -  90 cycles(tsc) 22.732 ns  improved 10.0%
      250 - 102 cycles(tsc) 25.583 ns  -  95 cycles(tsc) 23.916 ns  improved  6.9%
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fbd02630
    • J
      slub: improve bulk alloc strategy · ebe909e0
      Jesper Dangaard Brouer 提交于
      Call slowpath __slab_alloc() from within the bulk loop, as the side-effect
      of this call likely repopulates c->freelist.
      
      Choose to reenable local IRQs while calling slowpath.
      
      Saving some optimizations for later.  E.g.  it is possible to extract
      parts of __slab_alloc() and avoid the unnecessary and expensive (37
      cycles) local_irq_{save,restore}.  For now, be happy calling
      __slab_alloc() this lower icache impact of this func and I don't have to
      worry about correctness.
      
      Measurements on CPU CPU i7-4790K @ 4.00GHz
      Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.601 ns
      
      Bulk- fallback                   - this-patch
        1 -  58 cycles(tsc) 14.516 ns  -  49 cycles(tsc) 12.459 ns  improved 15.5%
        2 -  51 cycles(tsc) 12.930 ns  -  38 cycles(tsc)  9.605 ns  improved 25.5%
        3 -  49 cycles(tsc) 12.274 ns  -  34 cycles(tsc)  8.525 ns  improved 30.6%
        4 -  48 cycles(tsc) 12.058 ns  -  32 cycles(tsc)  8.036 ns  improved 33.3%
        8 -  46 cycles(tsc) 11.609 ns  -  31 cycles(tsc)  7.756 ns  improved 32.6%
       16 -  45 cycles(tsc) 11.451 ns  -  32 cycles(tsc)  8.148 ns  improved 28.9%
       30 -  79 cycles(tsc) 19.865 ns  -  68 cycles(tsc) 17.164 ns  improved 13.9%
       32 -  76 cycles(tsc) 19.212 ns  -  66 cycles(tsc) 16.584 ns  improved 13.2%
       34 -  74 cycles(tsc) 18.600 ns  -  63 cycles(tsc) 15.954 ns  improved 14.9%
       48 -  88 cycles(tsc) 22.092 ns  -  77 cycles(tsc) 19.373 ns  improved 12.5%
       64 -  80 cycles(tsc) 20.043 ns  -  68 cycles(tsc) 17.188 ns  improved 15.0%
      128 -  99 cycles(tsc) 24.818 ns  -  89 cycles(tsc) 22.404 ns  improved 10.1%
      158 -  99 cycles(tsc) 24.977 ns  -  92 cycles(tsc) 23.089 ns  improved  7.1%
      250 - 106 cycles(tsc) 26.552 ns  -  99 cycles(tsc) 24.785 ns  improved  6.6%
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebe909e0
    • J
      slub bulk alloc: extract objects from the per cpu slab · 994eb764
      Jesper Dangaard Brouer 提交于
      First piece: acceleration of retrieval of per cpu objects
      
      If we are allocating lots of objects then it is advantageous to disable
      interrupts and avoid the this_cpu_cmpxchg() operation to get these objects
      faster.
      
      Note that we cannot do the fast operation if debugging is enabled, because
      we would have to add extra code to do all the debugging checks.  And it
      would not be fast anyway.
      
      Note also that the requirement of having interrupts disabled avoids having
      to do processor flag operations.
      
      Allocate as many objects as possible in the fast way and then fall back to
      the generic implementation for the rest of the objects.
      
      Measurements on CPU CPU i7-4790K @ 4.00GHz
      Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.554 ns
      
      Bulk- fallback                   - this-patch
        1 -  57 cycles(tsc) 14.432 ns  -  48 cycles(tsc) 12.155 ns  improved 15.8%
        2 -  50 cycles(tsc) 12.746 ns  -  37 cycles(tsc)  9.390 ns  improved 26.0%
        3 -  48 cycles(tsc) 12.180 ns  -  33 cycles(tsc)  8.417 ns  improved 31.2%
        4 -  48 cycles(tsc) 12.015 ns  -  32 cycles(tsc)  8.045 ns  improved 33.3%
        8 -  46 cycles(tsc) 11.526 ns  -  30 cycles(tsc)  7.699 ns  improved 34.8%
       16 -  45 cycles(tsc) 11.418 ns  -  32 cycles(tsc)  8.205 ns  improved 28.9%
       30 -  80 cycles(tsc) 20.246 ns  -  73 cycles(tsc) 18.328 ns  improved  8.8%
       32 -  79 cycles(tsc) 19.946 ns  -  72 cycles(tsc) 18.208 ns  improved  8.9%
       34 -  78 cycles(tsc) 19.659 ns  -  71 cycles(tsc) 17.987 ns  improved  9.0%
       48 -  86 cycles(tsc) 21.516 ns  -  82 cycles(tsc) 20.566 ns  improved  4.7%
       64 -  93 cycles(tsc) 23.423 ns  -  89 cycles(tsc) 22.480 ns  improved  4.3%
      128 - 100 cycles(tsc) 25.170 ns  -  99 cycles(tsc) 24.871 ns  improved  1.0%
      158 - 102 cycles(tsc) 25.549 ns  - 101 cycles(tsc) 25.375 ns  improved  1.0%
      250 - 101 cycles(tsc) 25.344 ns  - 100 cycles(tsc) 25.182 ns  improved  1.0%
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      994eb764
    • C
      slab: infrastructure for bulk object allocation and freeing · 484748f0
      Christoph Lameter 提交于
      Add the basic infrastructure for alloc/free operations on pointer arrays.
      It includes a generic function in the common slab code that is used in
      this infrastructure patch to create the unoptimized functionality for slab
      bulk operations.
      
      Allocators can then provide optimized allocation functions for situations
      in which large numbers of objects are needed.  These optimization may
      avoid taking locks repeatedly and bypass metadata creation if all objects
      in slab pages can be used to provide the objects required.
      
      Allocators can extend the skeletons provided and add their own code to the
      bulk alloc and free functions.  They can keep the generic allocation and
      freeing and just fall back to those if optimizations would not work (like
      for example when debugging is on).
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      484748f0
    • J
      slub: fix spelling succedd to succeed · 2ae44005
      Jesper Dangaard Brouer 提交于
      With this patchset the SLUB allocator now has both bulk alloc and free
      implemented.
      
      This patchset mostly optimizes the "fastpath" where objects are available
      on the per CPU fastpath page.  This mostly amortize the less-heavy
      none-locked cmpxchg_double used on fastpath.
      
      The "fallback" bulking (e.g __kmem_cache_free_bulk) provides a good basis
      for comparison.  Measurements[1] of the fallback functions
      __kmem_cache_{free,alloc}_bulk have been copied from slab_common.c and
      forced "noinline" to force a function call like slab_common.c.
      
      Measurements on CPU CPU i7-4790K @ 4.00GHz
      Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.601 ns
      
      Measurements last-patch with disabled debugging:
      
      Bulk- fallback                   - this-patch
        1 -  57 cycles(tsc) 14.448 ns  -  44 cycles(tsc) 11.236 ns  improved 22.8%
        2 -  51 cycles(tsc) 12.768 ns  -  28 cycles(tsc)  7.019 ns  improved 45.1%
        3 -  48 cycles(tsc) 12.232 ns  -  22 cycles(tsc)  5.526 ns  improved 54.2%
        4 -  48 cycles(tsc) 12.025 ns  -  19 cycles(tsc)  4.786 ns  improved 60.4%
        8 -  46 cycles(tsc) 11.558 ns  -  18 cycles(tsc)  4.572 ns  improved 60.9%
       16 -  45 cycles(tsc) 11.458 ns  -  18 cycles(tsc)  4.658 ns  improved 60.0%
       30 -  45 cycles(tsc) 11.499 ns  -  18 cycles(tsc)  4.568 ns  improved 60.0%
       32 -  79 cycles(tsc) 19.917 ns  -  65 cycles(tsc) 16.454 ns  improved 17.7%
       34 -  78 cycles(tsc) 19.655 ns  -  63 cycles(tsc) 15.932 ns  improved 19.2%
       48 -  68 cycles(tsc) 17.049 ns  -  50 cycles(tsc) 12.506 ns  improved 26.5%
       64 -  80 cycles(tsc) 20.009 ns  -  63 cycles(tsc) 15.929 ns  improved 21.3%
      128 -  94 cycles(tsc) 23.749 ns  -  86 cycles(tsc) 21.583 ns  improved  8.5%
      158 -  97 cycles(tsc) 24.299 ns  -  90 cycles(tsc) 22.552 ns  improved  7.2%
      250 - 102 cycles(tsc) 25.681 ns  -  98 cycles(tsc) 24.589 ns  improved  3.9%
      
      Benchmarking shows impressive improvements in the "fastpath" with a small
      number of objects in the working set.  Once the working set increases,
      resulting in activating the "slowpath" (that contains the heavier locked
      cmpxchg_double) the improvement decreases.
      
      I'm currently working on also optimizing the "slowpath" (as network stack
      use-case hits this), but this patchset should provide a good foundation
      for further improvements.  Rest of my patch queue in this area needs some
      more work, but preliminary results are good.  I'm attending Netfilter
      Workshop[2] next week, and I'll hopefully return working on further
      improvements in this area.
      
      This patch (of 6):
      
      s/succedd/succeed/
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ae44005
    • T
      memory-hotplug: add hot-added memory ranges to memblock before allocate node_data for a node. · 7f36e3e5
      Tang Chen 提交于
      Commit f9126ab9 ("memory-hotplug: fix wrong edge when hot add a new
      node") hot-added memory range to memblock, after creating pgdat for new
      node.
      
      But there is a problem:
      
        add_memory()
        |--> hotadd_new_pgdat()
             |--> free_area_init_node()
                  |--> get_pfn_range_for_nid()
                       |--> find start_pfn and end_pfn in memblock
        |--> ......
        |--> memblock_add_node(start, size, nid)    --------    Here, just too late.
      
      get_pfn_range_for_nid() will find that start_pfn and end_pfn are both 0.
      As a result, when adding memory, dmesg will give the following wrong
      message.
      
        Initmem setup node 5 [mem 0x0000000000000000-0xffffffffffffffff]
        On node 5 totalpages: 0
        Built 5 zonelists in Node order, mobility grouping on.  Total pages: 32588823
        Policy zone: Normal
        init_memory_mapping: [mem 0x60000000000-0x607ffffffff]
      
      The solution is simple, just add the memory range to memblock a little
      earlier, before hotadd_new_pgdat().
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[4.2.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f36e3e5
  2. 01 9月, 2015 1 次提交
  3. 22 8月, 2015 2 次提交
    • A
      x86/kasan, mm: Introduce generic kasan_populate_zero_shadow() · 69786cdb
      Andrey Ryabinin 提交于
      Introduce generic kasan_populate_zero_shadow(shadow_start,
      shadow_end). This function maps kasan_zero_page to the
      [shadow_start, shadow_end] addresses.
      
      This replaces x86_64 specific populate_zero_shadow() and will
      be used for ARM64 in follow on patches.
      
      The main changes from original version are:
      
       * Use p?d_populate*() instead of set_p?d()
       * Use memblock allocator directly instead of vmemmap_alloc_block()
       * __pa() instead of __pa_nodebug(). __pa() causes troubles
         iff we use it before kasan_early_init(). kasan_populate_zero_shadow()
         will be used later, so we ok with __pa() here.
      Signed-off-by: NAndrey Ryabinin <ryabinin.a.a@gmail.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexey Klimov <klimov.linux@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: David Keitel <dkeitel@codeaurora.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yury <yury.norov@gmail.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/1439444244-26057-3-git-send-email-ryabinin.a.a@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      69786cdb
    • M
      mm: make page pfmemalloc check more robust · 2f064f34
      Michal Hocko 提交于
      Commit c48a11c7 ("netvm: propagate page->pfmemalloc to skb") added
      checks for page->pfmemalloc to __skb_fill_page_desc():
      
              if (page->pfmemalloc && !page->mapping)
                      skb->pfmemalloc = true;
      
      It assumes page->mapping == NULL implies that page->pfmemalloc can be
      trusted.  However, __delete_from_page_cache() can set set page->mapping
      to NULL and leave page->index value alone.  Due to being in union, a
      non-zero page->index will be interpreted as true page->pfmemalloc.
      
      So the assumption is invalid if the networking code can see such a page.
      And it seems it can.  We have encountered this with a NFS over loopback
      setup when such a page is attached to a new skbuf.  There is no copying
      going on in this case so the page confuses __skb_fill_page_desc which
      interprets the index as pfmemalloc flag and the network stack drops
      packets that have been allocated using the reserves unless they are to
      be queued on sockets handling the swapping which is the case here and
      that leads to hangs when the nfs client waits for a response from the
      server which has been dropped and thus never arrive.
      
      The struct page is already heavily packed so rather than finding another
      hole to put it in, let's do a trick instead.  We can reuse the index
      again but define it to an impossible value (-1UL).  This is the page
      index so it should never see the value that large.  Replace all direct
      users of page->pfmemalloc by page_is_pfmemalloc which will hide this
      nastiness from unspoiled eyes.
      
      The information will get lost if somebody wants to use page->index
      obviously but that was the case before and the original code expected
      that the information should be persisted somewhere else if that is
      really needed (e.g.  what SLAB and SLUB do).
      
      [akpm@linux-foundation.org: fix blooper in slub]
      Fixes: c48a11c7 ("netvm: propagate page->pfmemalloc to skb")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Debugged-by: NVlastimil Babka <vbabka@suse.com>
      Debugged-by: NJiri Bohac <jbohac@suse.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>	[3.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f064f34
  4. 15 8月, 2015 6 次提交
  5. 14 8月, 2015 1 次提交
  6. 07 8月, 2015 13 次提交
  7. 05 8月, 2015 1 次提交
    • M
      mm, vmscan: Do not wait for page writeback for GFP_NOFS allocations · ecf5fc6e
      Michal Hocko 提交于
      Nikolay has reported a hang when a memcg reclaim got stuck with the
      following backtrace:
      
      PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
        #0 __schedule at ffffffff815ab152
        #1 schedule at ffffffff815ab76e
        #2 schedule_timeout at ffffffff815ae5e5
        #3 io_schedule_timeout at ffffffff815aad6a
        #4 bit_wait_io at ffffffff815abfc6
        #5 __wait_on_bit at ffffffff815abda5
        #6 wait_on_page_bit at ffffffff8111fd4f
        #7 shrink_page_list at ffffffff81135445
        #8 shrink_inactive_list at ffffffff81135845
        #9 shrink_lruvec at ffffffff81135ead
       #10 shrink_zone at ffffffff811360c3
       #11 shrink_zones at ffffffff81136eff
       #12 do_try_to_free_pages at ffffffff8113712f
       #13 try_to_free_mem_cgroup_pages at ffffffff811372be
       #14 try_charge at ffffffff81189423
       #15 mem_cgroup_try_charge at ffffffff8118c6f5
       #16 __add_to_page_cache_locked at ffffffff8112137d
       #17 add_to_page_cache_lru at ffffffff81121618
       #18 pagecache_get_page at ffffffff8112170b
       #19 grow_dev_page at ffffffff811c8297
       #20 __getblk_slow at ffffffff811c91d6
       #21 __getblk_gfp at ffffffff811c92c1
       #22 ext4_ext_grow_indepth at ffffffff8124565c
       #23 ext4_ext_create_new_leaf at ffffffff81246ca8
       #24 ext4_ext_insert_extent at ffffffff81246f09
       #25 ext4_ext_map_blocks at ffffffff8124a848
       #26 ext4_map_blocks at ffffffff8121a5b7
       #27 mpage_map_one_extent at ffffffff8121b1fa
       #28 mpage_map_and_submit_extent at ffffffff8121f07b
       #29 ext4_writepages at ffffffff8121f6d5
       #30 do_writepages at ffffffff8112c490
       #31 __filemap_fdatawrite_range at ffffffff81120199
       #32 filemap_flush at ffffffff8112041c
       #33 ext4_alloc_da_blocks at ffffffff81219da1
       #34 ext4_rename at ffffffff81229b91
       #35 ext4_rename2 at ffffffff81229e32
       #36 vfs_rename at ffffffff811a08a5
       #37 SYSC_renameat2 at ffffffff811a3ffc
       #38 sys_renameat2 at ffffffff811a408e
       #39 sys_rename at ffffffff8119e51e
       #40 system_call_fastpath at ffffffff815afa89
      
      Dave Chinner has properly pointed out that this is a deadlock in the
      reclaim code because ext4 doesn't submit pages which are marked by
      PG_writeback right away.
      
      The heuristic was introduced by commit e62e384e ("memcg: prevent OOM
      with too many dirty pages") and it was applied only when may_enter_fs
      was specified.  The code has been changed by c3b94f44 ("memcg:
      further prevent OOM with too many dirty pages") which has removed the
      __GFP_FS restriction with a reasoning that we do not get into the fs
      code.  But this is not sufficient apparently because the fs doesn't
      necessarily submit pages marked PG_writeback for IO right away.
      
      ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
      submit the bio.  Instead it tries to map more pages into the bio and
      mpage_map_one_extent might trigger memcg charge which might end up
      waiting on a page which is marked PG_writeback but hasn't been submitted
      yet so we would end up waiting for something that never finishes.
      
      Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
      before we go to wait on the writeback.  The page fault path, which is
      the only path that triggers memcg oom killer since 3.12, shouldn't
      require GFP_NOFS and so we shouldn't reintroduce the premature OOM
      killer issue which was originally addressed by the heuristic.
      
      As per David Chinner the xfs is doing similar thing since 2.6.15 already
      so ext4 is not the only affected filesystem.  Moreover he notes:
      
      : For example: IO completion might require unwritten extent conversion
      : which executes filesystem transactions and GFP_NOFS allocations. The
      : writeback flag on the pages can not be cleared until unwritten
      : extent conversion completes. Hence memory reclaim cannot wait on
      : page writeback to complete in GFP_NOFS context because it is not
      : safe to do so, memcg reclaim or otherwise.
      
      Cc: stable@vger.kernel.org # 3.9+
      [tytso@mit.edu: corrected the control flow]
      Fixes: c3b94f44 ("memcg: further prevent OOM with too many dirty pages")
      Reported-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecf5fc6e