1. 01 8月, 2012 1 次提交
    • M
      mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages · 072bb0aa
      Mel Gorman 提交于
      When a user or administrator requires swap for their application, they
      create a swap partition and file, format it with mkswap and activate it
      with swapon.  Swap over the network is considered as an option in diskless
      systems.  The two likely scenarios are when blade servers are used as part
      of a cluster where the form factor or maintenance costs do not allow the
      use of disks and thin clients.
      
      The Linux Terminal Server Project recommends the use of the Network Block
      Device (NBD) for swap according to the manual at
      https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
      There is also documentation and tutorials on how to setup swap over NBD at
      places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
      nbd-client also documents the use of NBD as swap.  Despite this, the fact
      is that a machine using NBD for swap can deadlock within minutes if swap
      is used intensively.  This patch series addresses the problem.
      
      The core issue is that network block devices do not use mempools like
      normal block devices do.  As the host cannot control where they receive
      packets from, they cannot reliably work out in advance how much memory
      they might need.  Some years ago, Peter Zijlstra developed a series of
      patches that supported swap over an NFS that at least one distribution is
      carrying within their kernels.  This patch series borrows very heavily
      from Peter's work to support swapping over NBD as a pre-requisite to
      supporting swap-over-NFS.  The bulk of the complexity is concerned with
      preserving memory that is allocated from the PFMEMALLOC reserves for use
      by the network layer which is needed for both NBD and NFS.
      
      Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
      	preserve access to pages allocated under low memory situations
      	to callers that are freeing memory.
      
      Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
      
      Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
      	reserves without setting PFMEMALLOC.
      
      Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
      	for later use by network packet processing.
      
      Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
      
      Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
      
      Patches 7-12 allows network processing to use PFMEMALLOC reserves when
      	the socket has been marked as being used by the VM to clean pages. If
      	packets are received and stored in pages that were allocated under
      	low-memory situations and are unrelated to the VM, the packets
      	are dropped.
      
      	Patch 11 reintroduces __skb_alloc_page which the networking
      	folk may object to but is needed in some cases to propogate
      	pfmemalloc from a newly allocated page to an skb. If there is a
      	strong objection, this patch can be dropped with the impact being
      	that swap-over-network will be slower in some cases but it should
      	not fail.
      
      Patch 13 is a micro-optimisation to avoid a function call in the
      	common case.
      
      Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
      	PFMEMALLOC if necessary.
      
      Patch 15 notes that it is still possible for the PFMEMALLOC reserve
      	to be depleted. To prevent this, direct reclaimers get throttled on
      	a waitqueue if 50% of the PFMEMALLOC reserves are depleted.  It is
      	expected that kswapd and the direct reclaimers already running
      	will clean enough pages for the low watermark to be reached and
      	the throttled processes are woken up.
      
      Patch 16 adds a statistic to track how often processes get throttled
      
      Some basic performance testing was run using kernel builds, netperf on
      loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
      sysbench.  Each of them were expected to use the sl*b allocators
      reasonably heavily but there did not appear to be significant performance
      variances.
      
      For testing swap-over-NBD, a machine was booted with 2G of RAM with a
      swapfile backed by NBD.  8*NUM_CPU processes were started that create
      anonymous memory mappings and read them linearly in a loop.  The total
      size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
      memory pressure.
      
      Without the patches and using SLUB, the machine locks up within minutes
      and runs to completion with them applied.  With SLAB, the story is
      different as an unpatched kernel run to completion.  However, the patched
      kernel completed the test 45% faster.
      
      MICRO
                                               3.5.0-rc2 3.5.0-rc2
      					 vanilla     swapnbd
      Unrecognised test vmscan-anon-mmap-write
      MMTests Statistics: duration
      Sys Time Running Test (seconds)             197.80    173.07
      User+Sys Time Running Test (seconds)        206.96    182.03
      Total Elapsed Time (seconds)               3240.70   1762.09
      
      This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
      
      Allocations of pages below the min watermark run a risk of the machine
      hanging due to a lack of memory.  To prevent this, only callers who have
      PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
      allowed to allocate with ALLOC_NO_WATERMARKS.  Once they are allocated to
      a slab though, nothing prevents other callers consuming free objects
      within those slabs.  This patch limits access to slab pages that were
      alloced from the PFMEMALLOC reserves.
      
      When this patch is applied, pages allocated from below the low watermark
      are returned with page->pfmemalloc set and it is up to the caller to
      determine how the page should be protected.  SLAB restricts access to any
      page with page->pfmemalloc set to callers which are known to able to
      access the PFMEMALLOC reserve.  If one is not available, an attempt is
      made to allocate a new page rather than use a reserve.  SLUB is a bit more
      relaxed in that it only records if the current per-CPU page was allocated
      from PFMEMALLOC reserve and uses another partial slab if the caller does
      not have the necessary GFP or process flags.  This was found to be
      sufficient in tests to avoid hangs due to SLUB generally maintaining
      smaller lists than SLAB.
      
      In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
      a slab allocation even though free objects are available because they are
      being preserved for callers that are freeing pages.
      
      [a.p.zijlstra@chello.nl: Original implementation]
      [sebastian@breakpoint.cc: Correct order of page flag clearing]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      072bb0aa
  2. 11 7月, 2012 1 次提交
  3. 09 7月, 2012 5 次提交
  4. 20 6月, 2012 3 次提交
    • J
      slub: refactoring unfreeze_partials() · 43d77867
      Joonsoo Kim 提交于
      Current implementation of unfreeze_partials() is so complicated,
      but benefit from it is insignificant. In addition many code in
      do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab.
      Under current implementation which test status of cpu partial slab
      and acquire list_lock in do {} while loop,
      we don't need to acquire a list_lock and gain a little benefit
      when front of the cpu partial slab is to be discarded, but this is a rare case.
      In case that add_partial is performed and cmpxchg_double_slab is failed,
      remove_partial should be called case by case.
      
      I think that these are disadvantages of current implementation,
      so I do refactoring unfreeze_partials().
      
      Minimizing code in do {} while loop introduce a reduced fail rate
      of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256'
      when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done.
      
      ** before **
      Cmpxchg_double Looping
      ------------------------
      Locked Cmpxchg Double redos   182685
      Unlocked Cmpxchg Double redos 0
      
      ** after **
      Cmpxchg_double Looping
      ------------------------
      Locked Cmpxchg Double redos   177995
      Unlocked Cmpxchg Double redos 1
      
      We can see cmpxchg_double_slab fail rate is improved slightly.
      
      Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'.
      
      ** before **
       Performance counter stats for './hackbench 50 process 4000' (30 runs):
      
           108517.190463 task-clock                #    7.926 CPUs utilized            ( +-  0.24% )
               2,919,550 context-switches          #    0.027 M/sec                    ( +-  3.07% )
                 100,774 CPU-migrations            #    0.929 K/sec                    ( +-  4.72% )
                 124,201 page-faults               #    0.001 M/sec                    ( +-  0.15% )
         401,500,234,387 cycles                    #    3.700 GHz                      ( +-  0.24% )
         <not supported> stalled-cycles-frontend
         <not supported> stalled-cycles-backend
         250,576,913,354 instructions              #    0.62  insns per cycle          ( +-  0.13% )
          45,934,956,860 branches                  #  423.297 M/sec                    ( +-  0.14% )
             188,219,787 branch-misses             #    0.41% of all branches          ( +-  0.56% )
      
            13.691837307 seconds time elapsed                                          ( +-  0.24% )
      
      ** after **
       Performance counter stats for './hackbench 50 process 4000' (30 runs):
      
           107784.479767 task-clock                #    7.928 CPUs utilized            ( +-  0.22% )
               2,834,781 context-switches          #    0.026 M/sec                    ( +-  2.33% )
                  93,083 CPU-migrations            #    0.864 K/sec                    ( +-  3.45% )
                 123,967 page-faults               #    0.001 M/sec                    ( +-  0.15% )
         398,781,421,836 cycles                    #    3.700 GHz                      ( +-  0.22% )
         <not supported> stalled-cycles-frontend
         <not supported> stalled-cycles-backend
         250,189,160,419 instructions              #    0.63  insns per cycle          ( +-  0.09% )
          45,855,370,128 branches                  #  425.436 M/sec                    ( +-  0.10% )
             169,881,248 branch-misses             #    0.37% of all branches          ( +-  0.43% )
      
            13.596272341 seconds time elapsed                                          ( +-  0.22% )
      
      No regression is found, but rather we can see slightly better result.
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      43d77867
    • J
      slub: use __cmpxchg_double_slab() at interrupt disabled place · d24ac77f
      Joonsoo Kim 提交于
      get_freelist(), unfreeze_partials() are only called with interrupt disabled,
      so __cmpxchg_double_slab() is suitable.
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      d24ac77f
    • A
      slab/mempolicy: always use local policy from interrupt context · e7b691b0
      Andi Kleen 提交于
      slab_node() could access current->mempolicy from interrupt context.
      However there's a race condition during exit where the mempolicy
      is first freed and then the pointer zeroed.
      
      Using this from interrupts seems bogus anyways. The interrupt
      will interrupt a random process and therefore get a random
      mempolicy. Many times, this will be idle's, which noone can change.
      
      Just disable this here and always use local for slab
      from interrupts. I also cleaned up the callers of slab_node a bit
      which always passed the same argument.
      
      I believe the original mempolicy code did that in fact,
      so it's likely a regression.
      
      v2: send version with correct logic
      v3: simplify. fix typo.
      Reported-by: NArun Sharma <asharma@fb.com>
      Cc: penberg@kernel.org
      Cc: cl@linux.com
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      [tdmackey@twitter.com: Rework control flow based on feedback from
      cl@linux.com, fix logic, and cleanup current task_struct reference]
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NDavid Mackey <tdmackey@twitter.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      e7b691b0
  5. 14 6月, 2012 1 次提交
    • C
      mm, sl[aou]b: Extract common fields from struct kmem_cache · 3b0efdfa
      Christoph Lameter 提交于
      Define a struct that describes common fields used in all slab allocators.
      A slab allocator either uses the common definition (like SLOB) or is
      required to provide members of kmem_cache with the definition given.
      
      After that it will be possible to share code that
      only operates on those fields of kmem_cache.
      
      The patch basically takes the slob definition of kmem cache and
      uses the field namees for the other allocators.
      
      It also standardizes the names used for basic object lengths in
      allocators:
      
      object_size	Struct size specified at kmem_cache_create. Basically
      		the payload expected to be used by the subsystem.
      
      size		The size of memory allocator for each object. This size
      		is larger than object_size and includes padding, alignment
      		and extra metadata for each object (f.e. for debugging
      		and rcu).
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      3b0efdfa
  6. 01 6月, 2012 9 次提交
  7. 18 5月, 2012 3 次提交
    • J
      slub: use __SetPageSlab function to set PG_slab flag · c03f94cc
      Joonsoo Kim 提交于
      To set page-flag, using SetPageXXXX() and __SetPageXXXX() is more
      understandable and maintainable. So change it.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      c03f94cc
    • J
      slub: fix a memory leak in get_partial_node() · 02d7633f
      Joonsoo Kim 提交于
      In the case which is below,
      
      1. acquire slab for cpu partial list
      2. free object to it by remote cpu
      3. page->freelist = t
      
      then memory leak is occurred.
      
      Change acquire_slab() not to zap freelist when it works for cpu partial list.
      I think it is a sufficient solution for fixing a memory leak.
      
      Below is output of 'slabinfo -r kmalloc-256'
      when './perf stat -r 30 hackbench 50 process 4000 > /dev/null' is done.
      
      ***Vanilla***
      Sizes (bytes)     Slabs              Debug                Memory
      ------------------------------------------------------------------------
      Object :     256  Total  :     468   Sanity Checks : Off  Total: 3833856
      SlabObj:     256  Full   :     111   Redzoning     : Off  Used : 2004992
      SlabSiz:    8192  Partial:     302   Poisoning     : Off  Loss : 1828864
      Loss   :       0  CpuSlab:      55   Tracking      : Off  Lalig:       0
      Align  :       8  Objects:      32   Tracing       : Off  Lpadd:       0
      
      ***Patched***
      Sizes (bytes)     Slabs              Debug                Memory
      ------------------------------------------------------------------------
      Object :     256  Total  :     300   Sanity Checks : Off  Total: 2457600
      SlabObj:     256  Full   :     204   Redzoning     : Off  Used : 2348800
      SlabSiz:    8192  Partial:      33   Poisoning     : Off  Loss :  108800
      Loss   :       0  CpuSlab:      63   Tracking      : Off  Lalig:       0
      Align  :       8  Objects:      32   Tracing       : Off  Lpadd:       0
      
      Total and loss number is the impact of this patch.
      
      Cc: <stable@vger.kernel.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      02d7633f
    • M
      slub: missing test for partial pages flush work in flush_all() · 02e1a9cd
      majianpeng 提交于
      I found some kernel messages such as:
      
          SLUB raid5-md127: kmem_cache_destroy called for cache that still has objects.
          Pid: 6143, comm: mdadm Tainted: G           O 3.4.0-rc6+        #75
          Call Trace:
          kmem_cache_destroy+0x328/0x400
          free_conf+0x2d/0xf0 [raid456]
          stop+0x41/0x60 [raid456]
          md_stop+0x1a/0x60 [md_mod]
          do_md_stop+0x74/0x470 [md_mod]
          md_ioctl+0xff/0x11f0 [md_mod]
          blkdev_ioctl+0xd8/0x7a0
          block_ioctl+0x3b/0x40
          do_vfs_ioctl+0x96/0x560
          sys_ioctl+0x91/0xa0
          system_call_fastpath+0x16/0x1b
      
      Then using kmemleak I found these messages:
      
          unreferenced object 0xffff8800b6db7380 (size 112):
            comm "mdadm", pid 5783, jiffies 4294810749 (age 90.589s)
            hex dump (first 32 bytes):
              01 01 db b6 ad 4e ad de ff ff ff ff ff ff ff ff  .....N..........
              ff ff ff ff ff ff ff ff 98 40 4a 82 ff ff ff ff  .........@J.....
            backtrace:
              kmemleak_alloc+0x21/0x50
              kmem_cache_alloc+0xeb/0x1b0
              kmem_cache_open+0x2f1/0x430
              kmem_cache_create+0x158/0x320
              setup_conf+0x649/0x770 [raid456]
              run+0x68b/0x840 [raid456]
              md_run+0x529/0x940 [md_mod]
              do_md_run+0x18/0xc0 [md_mod]
              md_ioctl+0xba8/0x11f0 [md_mod]
              blkdev_ioctl+0xd8/0x7a0
              block_ioctl+0x3b/0x40
              do_vfs_ioctl+0x96/0x560
              sys_ioctl+0x91/0xa0
              system_call_fastpath+0x16/0x1b
      
      This bug was introduced by commit a8364d55 ("slub: only IPI CPUs that
      have per cpu obj to flush"), which did not include checks for per cpu
      partial pages being present on a cpu.
      Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Tested-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02e1a9cd
  8. 16 5月, 2012 2 次提交
  9. 08 5月, 2012 1 次提交
  10. 29 3月, 2012 1 次提交
    • G
      slub: only IPI CPUs that have per cpu obj to flush · a8364d55
      Gilad Ben-Yossef 提交于
      flush_all() is called for each kmem_cache_destroy().  So every cache being
      destroyed dynamically ends up sending an IPI to each CPU in the system,
      regardless if the cache has ever been used there.
      
      For example, if you close the Infinband ipath driver char device file, the
      close file ops calls kmem_cache_destroy().  So running some infiniband
      config tool on one a single CPU dedicated to system tasks might interrupt
      the rest of the 127 CPUs dedicated to some CPU intensive or latency
      sensitive task.
      
      I suspect there is a good chance that every line in the output of "git
      grep kmem_cache_destroy linux/ | grep '\->'" has a similar scenario.
      
      This patch attempts to rectify this issue by sending an IPI to flush the
      per cpu objects back to the free lists only to CPUs that seem to have such
      objects.
      
      The check which CPU to IPI is racy but we don't care since asking a CPU
      without per cpu objects to flush does no damage and as far as I can tell
      the flush_all by itself is racy against allocs on remote CPUs anyway, so
      if you required the flush_all to be determinstic, you had to arrange for
      locking regardless.
      
      Without this patch the following artificial test case:
      
      $ cd /sys/kernel/slab
      $ for DIR in *; do cat $DIR/alloc_calls > /dev/null; done
      
      produces 166 IPIs on an cpuset isolated CPU. With it it produces none.
      
      The code path of memory allocation failure for CPUMASK_OFFSTACK=y
      config was tested using fault injection framework.
      Signed-off-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Sasha Levin <levinsasha928@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Michal Nazarewicz <mina86@mina86.org>
      Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com>
      Cc: Milton Miller <miltonm@bga.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8364d55
  11. 22 3月, 2012 1 次提交
    • M
      cpuset: mm: reduce large amounts of memory barrier related damage v3 · cc9a6c87
      Mel Gorman 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") wins a super prize for the largest number of
      memory barriers entered into fast paths for one commit.
      
      [get|put]_mems_allowed is incredibly heavy with pairs of full memory
      barriers inserted into a number of hot paths.  This was detected while
      investigating at large page allocator slowdown introduced some time
      after 2.6.32.  The largest portion of this overhead was shown by
      oprofile to be at an mfence introduced by this commit into the page
      allocator hot path.
      
      For extra style points, the commit introduced the use of yield() in an
      implementation of what looks like a spinning mutex.
      
      This patch replaces the full memory barriers on both read and write
      sides with a sequence counter with just read barriers on the fast path
      side.  This is much cheaper on some architectures, including x86.  The
      main bulk of the patch is the retry logic if the nodemask changes in a
      manner that can cause a false failure.
      
      While updating the nodemask, a check is made to see if a false failure
      is a risk.  If it is, the sequence number gets bumped and parallel
      allocators will briefly stall while the nodemask update takes place.
      
      In a page fault test microbenchmark, oprofile samples from
      __alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
      actual results were
      
                                   3.3.0-rc3          3.3.0-rc3
                                   rc3-vanilla        nobarrier-v2r1
          Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
          Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
          Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
          Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
          Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
          Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
          Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
          Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
          Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
          Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
          Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
          Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
          Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
          Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
          Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
          MMTests Statistics: duration
          Sys Time Running Test (seconds)             135.68    132.17
          User+Sys Time Running Test (seconds)         164.2    160.13
          Total Elapsed Time (seconds)                123.46    120.87
      
      The overall improvement is small but the System CPU time is much
      improved and roughly in correlation to what oprofile reported (these
      performance figures are without profiling so skew is expected).  The
      actual number of page faults is noticeably improved.
      
      For benchmarks like kernel builds, the overall benefit is marginal but
      the system CPU time is slightly reduced.
      
      To test the actual bug the commit fixed I opened two terminals.  The
      first ran within a cpuset and continually ran a small program that
      faulted 100M of anonymous data.  In a second window, the nodemask of the
      cpuset was continually randomised in a loop.
      
      Without the commit, the program would fail every so often (usually
      within 10 seconds) and obviously with the commit everything worked fine.
      With this patch applied, it also worked fine so the fix should be
      functionally equivalent.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc9a6c87
  12. 18 2月, 2012 1 次提交
  13. 10 2月, 2012 1 次提交
  14. 06 2月, 2012 1 次提交
  15. 25 1月, 2012 1 次提交
    • E
      slub: prefetch next freelist pointer in slab_alloc() · 0ad9500e
      Eric Dumazet 提交于
      Recycling a page is a problem, since freelist link chain is hot on
      cpu(s) which freed objects, and possibly very cold on cpu currently
      owning slab.
      
      Adding a prefetch of cache line containing the pointer to next object in
      slab_alloc() helps a lot in many workloads, in particular on assymetric
      ones (allocations done on one cpu, frees on another cpus). Added cost is
      three machine instructions only.
      
      Examples on my dual socket quad core ht machine (Intel CPU E5540
      @2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
      
      Before patch :
      
      # perf stat -r 32 hackbench 50 process 4000 >/dev/null
      
       Performance counter stats for 'hackbench 50 process 4000' (32 runs):
      
           327577,471718 task-clock                #   15,821 CPUs utilized            ( +-  0,64% )
              28 866 491 context-switches          #    0,088 M/sec                    ( +-  1,80% )
               1 506 929 CPU-migrations            #    0,005 M/sec                    ( +-  3,24% )
                 127 151 page-faults               #    0,000 M/sec                    ( +-  0,16% )
         829 399 813 448 cycles                    #    2,532 GHz                      ( +-  0,64% )
         580 664 691 740 stalled-cycles-frontend   #   70,01% frontend cycles idle     ( +-  0,71% )
         197 431 700 448 stalled-cycles-backend    #   23,80% backend  cycles idle     ( +-  1,03% )
         503 548 648 975 instructions              #    0,61  insns per cycle
                                                   #    1,15  stalled cycles per insn  ( +-  0,46% )
          95 780 068 471 branches                  #  292,389 M/sec                    ( +-  0,48% )
           1 426 407 916 branch-misses             #    1,49% of all branches          ( +-  1,35% )
      
            20,705679994 seconds time elapsed                                          ( +-  0,64% )
      
      After patch :
      
      # perf stat -r 32 hackbench 50 process 4000 >/dev/null
      
       Performance counter stats for 'hackbench 50 process 4000' (32 runs):
      
           286236,542804 task-clock                #   15,786 CPUs utilized            ( +-  1,32% )
              19 703 372 context-switches          #    0,069 M/sec                    ( +-  4,99% )
               1 658 249 CPU-migrations            #    0,006 M/sec                    ( +-  6,62% )
                 126 776 page-faults               #    0,000 M/sec                    ( +-  0,12% )
         724 636 593 213 cycles                    #    2,532 GHz                      ( +-  1,32% )
         499 320 714 837 stalled-cycles-frontend   #   68,91% frontend cycles idle     ( +-  1,47% )
         156 555 126 809 stalled-cycles-backend    #   21,60% backend  cycles idle     ( +-  2,22% )
         463 897 792 661 instructions              #    0,64  insns per cycle
                                                   #    1,08  stalled cycles per insn  ( +-  0,94% )
          87 717 352 563 branches                  #  306,451 M/sec                    ( +-  0,99% )
             941 738 280 branch-misses             #    1,07% of all branches          ( +-  3,35% )
      
            18,132070670 seconds time elapsed                                          ( +-  1,30% )
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      CC: Matt Mackall <mpm@selenic.com>
      CC: David Rientjes <rientjes@google.com>
      CC: "Alex,Shi" <alex.shi@intel.com>
      CC: Shaohua Li <shaohua.li@intel.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      0ad9500e
  16. 13 1月, 2012 2 次提交
  17. 11 1月, 2012 2 次提交
  18. 04 1月, 2012 1 次提交
    • J
      x86: Fix and improve cmpxchg_double{,_local}() · cdcd6298
      Jan Beulich 提交于
      Just like the per-CPU ones they had several
      problems/shortcomings:
      
      Only the first memory operand was mentioned in the asm()
      operands, and the 2x64-bit version didn't have a memory clobber
      while the 2x32-bit one did. The former allowed the compiler to
      not recognize the need to re-load the data in case it had it
      cached in some register, while the latter was overly
      destructive.
      
      The types of the local copies of the old and new values were
      incorrect (the types of the pointed-to variables should be used
      here, to make sure the respective old/new variable types are
      compatible).
      
      The __dummy/__junk variables were pointless, given that local
      copies of the inputs already existed (and can hence be used for
      discarded outputs).
      
      The 32-bit variant of cmpxchg_double_local() referenced
      cmpxchg16b_local().
      
      At once also:
      
       - change the return value type to what it really is: 'bool'
       - unify 32- and 64-bit variants
       - abstract out the common part of the 'normal' and 'local' variants
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/4F01F12A020000780006A19B@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      cdcd6298
  19. 23 12月, 2011 1 次提交
  20. 14 12月, 2011 2 次提交