1. 26 9月, 2012 2 次提交
  2. 25 9月, 2012 4 次提交
  3. 19 9月, 2012 1 次提交
  4. 12 9月, 2012 1 次提交
    • M
      slab: fix the DEADLOCK issue on l3 alien lock · 947ca185
      Michael Wang 提交于
      DEADLOCK will be report while running a kernel with NUMA and LOCKDEP enabled,
      the process of this fake report is:
      
      	   kmem_cache_free()	//free obj in cachep
      	-> cache_free_alien()	//acquire cachep's l3 alien lock
      	-> __drain_alien_cache()
      	-> free_block()
      	-> slab_destroy()
      	-> kmem_cache_free()	//free slab in cachep->slabp_cache
      	-> cache_free_alien()	//acquire cachep->slabp_cache's l3 alien lock
      
      Since the cachep and cachep->slabp_cache's l3 alien are in the same lock class,
      fake report generated.
      
      This should not happen since we already have init_lock_keys() which will
      reassign the lock class for both l3 list and l3 alien.
      
      However, init_lock_keys() was invoked at a wrong position which is before we
      invoke enable_cpucache() on each cache.
      
      Since until set slab_state to be FULL, we won't invoke enable_cpucache()
      on caches to build their l3 alien while creating them, so although we invoked
      init_lock_keys(), the l3 alien lock class won't change since we don't have
      them until invoked enable_cpucache() later.
      
      This patch will invoke init_lock_keys() after we done enable_cpucache()
      instead of before to avoid the fake DEADLOCK report.
      
      Michael traced the problem back to a commit in release 3.0.0:
      
      commit 30765b92
      Author: Peter Zijlstra <peterz@infradead.org>
      Date:   Thu Jul 28 23:22:56 2011 +0200
      
          slab, lockdep: Annotate the locks before using them
      
          Fernando found we hit the regular OFF_SLAB 'recursion' before we
          annotate the locks, cure this.
      
          The relevant portion of the stack-trace:
      
          > [    0.000000]  [<c085e24f>] rt_spin_lock+0x50/0x56
          > [    0.000000]  [<c04fb406>] __cache_free+0x43/0xc3
          > [    0.000000]  [<c04fb23f>] kmem_cache_free+0x6c/0xdc
          > [    0.000000]  [<c04fb2fe>] slab_destroy+0x4f/0x53
          > [    0.000000]  [<c04fb396>] free_block+0x94/0xc1
          > [    0.000000]  [<c04fc551>] do_tune_cpucache+0x10b/0x2bb
          > [    0.000000]  [<c04fc8dc>] enable_cpucache+0x7b/0xa7
          > [    0.000000]  [<c0bd9d3c>] kmem_cache_init_late+0x1f/0x61
          > [    0.000000]  [<c0bba687>] start_kernel+0x24c/0x363
          > [    0.000000]  [<c0bba0ba>] i386_start_kernel+0xa9/0xaf
      Reported-by: NFernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
          Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptopSigned-off-by: NIngo Molnar <mingo@elte.hu>
      
      The commit moved init_lock_keys() before we build up the alien, so we
      failed to reclass it.
      
      Cc: <stable@vger.kernel.org> # 3.0+
      Acked-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      947ca185
  5. 17 8月, 2012 1 次提交
  6. 16 8月, 2012 1 次提交
  7. 01 8月, 2012 3 次提交
    • M
      mm: micro-optimise slab to avoid a function call · 381760ea
      Mel Gorman 提交于
      Getting and putting objects in SLAB currently requires a function call but
      the bulk of the work is related to PFMEMALLOC reserves which are only
      consumed when network-backed storage is critical.  Use an inline function
      to determine if the function call is required.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      381760ea
    • M
      mm: introduce __GFP_MEMALLOC to allow access to emergency reserves · b37f1dd0
      Mel Gorman 提交于
      __GFP_MEMALLOC will allow the allocation to disregard the watermarks, much
      like PF_MEMALLOC.  It allows one to pass along the memalloc state in
      object related allocation flags as opposed to task related flags, such as
      sk->sk_allocation.  This removes the need for ALLOC_PFMEMALLOC as callers
      using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag which is now
      enough to identify allocations related to page reclaim.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b37f1dd0
    • M
      mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages · 072bb0aa
      Mel Gorman 提交于
      When a user or administrator requires swap for their application, they
      create a swap partition and file, format it with mkswap and activate it
      with swapon.  Swap over the network is considered as an option in diskless
      systems.  The two likely scenarios are when blade servers are used as part
      of a cluster where the form factor or maintenance costs do not allow the
      use of disks and thin clients.
      
      The Linux Terminal Server Project recommends the use of the Network Block
      Device (NBD) for swap according to the manual at
      https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
      There is also documentation and tutorials on how to setup swap over NBD at
      places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
      nbd-client also documents the use of NBD as swap.  Despite this, the fact
      is that a machine using NBD for swap can deadlock within minutes if swap
      is used intensively.  This patch series addresses the problem.
      
      The core issue is that network block devices do not use mempools like
      normal block devices do.  As the host cannot control where they receive
      packets from, they cannot reliably work out in advance how much memory
      they might need.  Some years ago, Peter Zijlstra developed a series of
      patches that supported swap over an NFS that at least one distribution is
      carrying within their kernels.  This patch series borrows very heavily
      from Peter's work to support swapping over NBD as a pre-requisite to
      supporting swap-over-NFS.  The bulk of the complexity is concerned with
      preserving memory that is allocated from the PFMEMALLOC reserves for use
      by the network layer which is needed for both NBD and NFS.
      
      Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
      	preserve access to pages allocated under low memory situations
      	to callers that are freeing memory.
      
      Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
      
      Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
      	reserves without setting PFMEMALLOC.
      
      Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
      	for later use by network packet processing.
      
      Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
      
      Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
      
      Patches 7-12 allows network processing to use PFMEMALLOC reserves when
      	the socket has been marked as being used by the VM to clean pages. If
      	packets are received and stored in pages that were allocated under
      	low-memory situations and are unrelated to the VM, the packets
      	are dropped.
      
      	Patch 11 reintroduces __skb_alloc_page which the networking
      	folk may object to but is needed in some cases to propogate
      	pfmemalloc from a newly allocated page to an skb. If there is a
      	strong objection, this patch can be dropped with the impact being
      	that swap-over-network will be slower in some cases but it should
      	not fail.
      
      Patch 13 is a micro-optimisation to avoid a function call in the
      	common case.
      
      Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
      	PFMEMALLOC if necessary.
      
      Patch 15 notes that it is still possible for the PFMEMALLOC reserve
      	to be depleted. To prevent this, direct reclaimers get throttled on
      	a waitqueue if 50% of the PFMEMALLOC reserves are depleted.  It is
      	expected that kswapd and the direct reclaimers already running
      	will clean enough pages for the low watermark to be reached and
      	the throttled processes are woken up.
      
      Patch 16 adds a statistic to track how often processes get throttled
      
      Some basic performance testing was run using kernel builds, netperf on
      loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
      sysbench.  Each of them were expected to use the sl*b allocators
      reasonably heavily but there did not appear to be significant performance
      variances.
      
      For testing swap-over-NBD, a machine was booted with 2G of RAM with a
      swapfile backed by NBD.  8*NUM_CPU processes were started that create
      anonymous memory mappings and read them linearly in a loop.  The total
      size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
      memory pressure.
      
      Without the patches and using SLUB, the machine locks up within minutes
      and runs to completion with them applied.  With SLAB, the story is
      different as an unpatched kernel run to completion.  However, the patched
      kernel completed the test 45% faster.
      
      MICRO
                                               3.5.0-rc2 3.5.0-rc2
      					 vanilla     swapnbd
      Unrecognised test vmscan-anon-mmap-write
      MMTests Statistics: duration
      Sys Time Running Test (seconds)             197.80    173.07
      User+Sys Time Running Test (seconds)        206.96    182.03
      Total Elapsed Time (seconds)               3240.70   1762.09
      
      This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
      
      Allocations of pages below the min watermark run a risk of the machine
      hanging due to a lack of memory.  To prevent this, only callers who have
      PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
      allowed to allocate with ALLOC_NO_WATERMARKS.  Once they are allocated to
      a slab though, nothing prevents other callers consuming free objects
      within those slabs.  This patch limits access to slab pages that were
      alloced from the PFMEMALLOC reserves.
      
      When this patch is applied, pages allocated from below the low watermark
      are returned with page->pfmemalloc set and it is up to the caller to
      determine how the page should be protected.  SLAB restricts access to any
      page with page->pfmemalloc set to callers which are known to able to
      access the PFMEMALLOC reserve.  If one is not available, an attempt is
      made to allocate a new page rather than use a reserve.  SLUB is a bit more
      relaxed in that it only records if the current per-CPU page was allocated
      from PFMEMALLOC reserve and uses another partial slab if the caller does
      not have the necessary GFP or process flags.  This was found to be
      sufficient in tests to avoid hangs due to SLUB generally maintaining
      smaller lists than SLAB.
      
      In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
      a slab allocation even though free objects are available because they are
      being preserved for callers that are freeing pages.
      
      [a.p.zijlstra@chello.nl: Original implementation]
      [sebastian@breakpoint.cc: Correct order of page flag clearing]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      072bb0aa
  8. 09 7月, 2012 4 次提交
  9. 02 7月, 2012 4 次提交
  10. 20 6月, 2012 1 次提交
  11. 14 6月, 2012 4 次提交
  12. 22 3月, 2012 1 次提交
    • M
      cpuset: mm: reduce large amounts of memory barrier related damage v3 · cc9a6c87
      Mel Gorman 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") wins a super prize for the largest number of
      memory barriers entered into fast paths for one commit.
      
      [get|put]_mems_allowed is incredibly heavy with pairs of full memory
      barriers inserted into a number of hot paths.  This was detected while
      investigating at large page allocator slowdown introduced some time
      after 2.6.32.  The largest portion of this overhead was shown by
      oprofile to be at an mfence introduced by this commit into the page
      allocator hot path.
      
      For extra style points, the commit introduced the use of yield() in an
      implementation of what looks like a spinning mutex.
      
      This patch replaces the full memory barriers on both read and write
      sides with a sequence counter with just read barriers on the fast path
      side.  This is much cheaper on some architectures, including x86.  The
      main bulk of the patch is the retry logic if the nodemask changes in a
      manner that can cause a false failure.
      
      While updating the nodemask, a check is made to see if a false failure
      is a risk.  If it is, the sequence number gets bumped and parallel
      allocators will briefly stall while the nodemask update takes place.
      
      In a page fault test microbenchmark, oprofile samples from
      __alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
      actual results were
      
                                   3.3.0-rc3          3.3.0-rc3
                                   rc3-vanilla        nobarrier-v2r1
          Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
          Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
          Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
          Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
          Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
          Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
          Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
          Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
          Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
          Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
          Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
          Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
          Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
          Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
          Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
          MMTests Statistics: duration
          Sys Time Running Test (seconds)             135.68    132.17
          User+Sys Time Running Test (seconds)         164.2    160.13
          Total Elapsed Time (seconds)                123.46    120.87
      
      The overall improvement is small but the System CPU time is much
      improved and roughly in correlation to what oprofile reported (these
      performance figures are without profiling so skew is expected).  The
      actual number of page faults is noticeably improved.
      
      For benchmarks like kernel builds, the overall benefit is marginal but
      the system CPU time is slightly reduced.
      
      To test the actual bug the commit fixed I opened two terminals.  The
      first ran within a cpuset and continually ran a small program that
      faulted 100M of anonymous data.  In a second window, the nodemask of the
      cpuset was continually randomised in a loop.
      
      Without the commit, the program would fail every so often (usually
      within 10 seconds) and obviously with the commit everything worked fine.
      With this patch applied, it also worked fine so the fix should be
      functionally equivalent.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc9a6c87
  13. 10 3月, 2012 1 次提交
  14. 23 1月, 2012 1 次提交
  15. 10 1月, 2012 1 次提交
  16. 05 12月, 2011 1 次提交
  17. 17 11月, 2011 1 次提交
  18. 11 11月, 2011 2 次提交
  19. 28 9月, 2011 1 次提交
    • V
      mm: restrict access to slab files under procfs and sysfs · ab067e99
      Vasiliy Kulikov 提交于
      Historically /proc/slabinfo and files under /sys/kernel/slab/* have
      world read permissions and are accessible to the world.  slabinfo
      contains rather private information related both to the kernel and
      userspace tasks.  Depending on the situation, it might reveal either
      private information per se or information useful to make another
      targeted attack.  Some examples of what can be learned by
      reading/watching for /proc/slabinfo entries:
      
      1) dentry (and different *inode*) number might reveal other processes fs
      activity.  The number of dentry "active objects" doesn't strictly show
      file count opened/touched by a process, however, there is a good
      correlation between them.  The patch "proc: force dcache drop on
      unauthorized access" relies on the privacy of dentry count.
      
      2) different inode entries might reveal the same information as (1), but
      these are more fine granted counters.  If a filesystem is mounted in a
      private mount point (or even a private namespace) and fs type differs from
      other mounted fs types, fs activity in this mount point/namespace is
      revealed.  If there is a single ecryptfs mount point, the whole fs
      activity of a single user is revealed.  Number of files in ecryptfs
      mount point is a private information per se.
      
      3) fuse_* reveals number of files / fs activity of a user in a user
      private mount point.  It is approx. the same severity as ecryptfs
      infoleak in (2).
      
      4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
      which can be otherwise hidden by "chmod 0700 /sys/".  With 0444 slabinfo
      the precise number of sysfs files is known to the world.
      
      5) buffer_head might reveal some kernel activity.  With other
      information leaks an attacker might identify what specific kernel
      routines generate buffer_head activity.
      
      6) *kmalloc* infoleaks are very situational.  Attacker should watch for
      the specific kmalloc size entry and filter the noise related to the unrelated
      kernel activity.  If an attacker has relatively silent victim system, he
      might get rather precise counters.
      
      Additional information sources might significantly increase the slabinfo
      infoleak benefits.  E.g. if an attacker knows that the processes
      activity on the system is very low (only core daemons like syslog and
      cron), he may run setxid binaries / trigger local daemon activity /
      trigger network services activity / await sporadic cron jobs activity
      / etc. and get rather precise counters for fs and network activity of
      these privileged tasks, which is unknown otherwise.
      
      Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
      exploitation of kernel heap overflows (and possibly, other bugs).  The
      related discussion:
      
      http://thread.gmane.org/gmane.linux.kernel/1108378
      
      To keep compatibility with old permission model where non-root
      monitoring daemon could watch for kernel memleaks though slabinfo one
      should do:
      
          groupadd slabinfo
          usermod -a -G slabinfo $MONITOR_USER
      
      And add the following commands to init scripts (to mountall.conf in
      Ubuntu's upstart case):
      
          chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
          chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Reviewed-by: NKees Cook <kees@ubuntu.com>
      Reviewed-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Acked-by: NChristoph Lameter <cl@gentwo.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      CC: Valdis.Kletnieks@vt.edu
      CC: Linus Torvalds <torvalds@linux-foundation.org>
      CC: Alan Cox <alan@linux.intel.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      ab067e99
  20. 04 8月, 2011 2 次提交
  21. 01 8月, 2011 1 次提交
    • S
      slab: use print_hex_dump · fdde6abb
      Sebastian Andrzej Siewior 提交于
      Less code and the advantage of ascii dump.
      
      before:
      | Slab corruption: names_cache start=c5788000, len=4096
      | 000: 6b 6b 01 00 00 00 56 00 00 00 24 00 00 00 2a 00
      | 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      | 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff
      | 030: ff ff ff ff e2 b4 17 18 c7 e4 08 06 00 01 08 00
      | 040: 06 04 00 01 e2 b4 17 18 c7 e4 0a 00 00 01 00 00
      | 050: 00 00 00 00 0a 00 00 02 6b 6b 6b 6b 6b 6b 6b 6b
      
      after:
      | Slab corruption: size-4096 start=c38a9000, len=4096
      | 000: 6b 6b 01 00 00 00 56 00 00 00 24 00 00 00 2a 00  kk....V...$...*.
      | 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      | 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff  ................
      | 030: ff ff ff ff d2 56 5f aa db 9c 08 06 00 01 08 00  .....V_.........
      | 040: 06 04 00 01 d2 56 5f aa db 9c 0a 00 00 01 00 00  .....V_.........
      | 050: 00 00 00 00 0a 00 00 02 6b 6b 6b 6b 6b 6b 6b 6b  ........kkkkkkkk
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      fdde6abb
  22. 31 7月, 2011 1 次提交
  23. 28 7月, 2011 1 次提交