1. 23 3月, 2011 18 次提交
    • M
      mm: truncate: change remove_from_page_cache · 5adc7b51
      Minchan Kim 提交于
      This patch series changes remove_from_page_cache()'s page ref counting
      rule.  Page cache ref count is decreased in delete_from_page_cache().  So
      we don't need to decrease the page reference in callers.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5adc7b51
    • M
      mm: shmem: change remove_from_page_cache · 4c73b1bc
      Minchan Kim 提交于
      This patch series changes remove_from_page_cache()'s page ref counting
      rule.  Page cache ref count is decreased in delete_from_page_cache().  So
      we don't need to decrease the page reference in callers.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c73b1bc
    • M
      mm: introduce delete_from_page_cache() · 97cecb5a
      Minchan Kim 提交于
      Presently we increase the page refcount in add_to_page_cache() but don't
      decrease it in remove_from_page_cache().  Such asymmetry adds confusion,
      requiring that callers notice it and a comment explaining why they release
      a page reference.  It's not a good API.
      
      A long time ago, Hugh tried it (http://lkml.org/lkml/2004/10/24/140) but
      gave up because reiser4's drop_page() had to unlock the page between
      removing it from page cache and doing the page_cache_release().  But now
      the situation is changed.  I think at least things in current mainline
      don't have any obstacles.  The problem is for out-of-mainline filesystems
      - if they have done such things as reiser4, this patch could be a problem
      but they will discover this at compile time since we remove
      remove_from_page_cache().
      
      This patch:
      
      This function works as just wrapper remove_from_page_cache().  The
      difference is that it decreases page references in itself.  So caller have
      to make sure it has a page reference before calling.
      
      This patch is ready for removing remove_from_page_cache().
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Edward Shishkin <edward.shishkin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97cecb5a
    • M
      mm: add replace_page_cache_page() function · ef6a3c63
      Miklos Szeredi 提交于
      This function basically does:
      
           remove_from_page_cache(old);
           page_cache_release(old);
           add_to_page_cache_locked(new);
      
      Except it does this atomically, so there's no possibility for the "add" to
      fail because of a race.
      
      If memory cgroups are enabled, then the memory cgroup charge is also moved
      from the old page to the new.
      
      This function is currently used by fuse to move pages into the page cache
      on read, instead of copying the page contents.
      
      [minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef6a3c63
    • G
      mm: allow GUP to fail instead of waiting on a page · 318b275f
      Gleb Natapov 提交于
      GUP user may want to try to acquire a reference to a page if it is already
      in memory, but not if IO, to bring it in, is needed.  For example KVM may
      tell vcpu to schedule another guest process if current one is trying to
      access swapped out page.  Meanwhile, the page will be swapped in and the
      guest process, that depends on it, will be able to run again.
      
      This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and
      FOLL_NOWAIT follow_page flags.  FAULT_FLAG_RETRY_NOWAIT, when used in
      conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that
      it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY
      instead.
      
      [akpm@linux-foundation.org: improve FOLL_NOWAIT comment]
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      318b275f
    • P
      mm: notifier_from_errno() cleanup · 5fda1bd5
      Prarit Bhargava 提交于
      While looking at some other notifier callbacks I noticed this code could
      use a simple cleanup.
      
      notifier_from_errno() no longer needs the if (ret)/else conditional.  That
      same conditional is now done in notifier_from_errno().
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fda1bd5
    • D
      oom: suppress nodes that are not allowed from meminfo on page alloc failure · cbf978bf
      David Rientjes 提交于
      Displaying extremely verbose meminfo for all nodes on the system is
      overkill for page allocation failures when the context restricts that
      allocation to only a subset of nodes.  We don't particularly care about
      the state of all nodes when some are not allowed in the current context,
      they can have an abundance of memory but we can't allocate from that part
      of memory.
      
      This patch suppresses disallowed nodes from the meminfo dump on a page
      allocation failure if the context requires it.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbf978bf
    • D
      oom: suppress show_mem() for many nodes in irq context on page alloc failure · 29423e77
      David Rientjes 提交于
      When a page allocation failure occurs, show_mem() is called to dump the
      state of the VM so users may understand what happened to get into that
      condition.
      
      This output, however, can be extremely verbose.  In irq context, it may
      result in significant delays that incur NMI watchdog timeouts when the
      machine is large (we use CONFIG_NODES_SHIFT > 8 here to define a "large"
      machine since the length of the show_mem() output is proportional to the
      number of possible nodes).
      
      This patch suppresses the show_mem() call in irq context when the kernel
      has CONFIG_NODES_SHIFT > 8.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29423e77
    • D
      oom: suppress nodes that are not allowed from meminfo on oom kill · ddd588b5
      David Rientjes 提交于
      The oom killer is extremely verbose for machines with a large number of
      cpus and/or nodes.  This verbosity can often be harmful if it causes other
      important messages to be scrolled from the kernel log and incurs a
      signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT >
      8.
      
      This patch causes only memory information to be displayed for nodes that
      are allowed by current's cpuset when dumping the VM state.  Information
      for all other nodes is irrelevant to the oom condition; we don't care if
      there's an abundance of memory elsewhere if we can't access it.
      
      This only affects the behavior of dumping memory information when an oom
      is triggered.  Other dumps, such as for sysrq+m, still display the
      unfiltered form when using the existing show_mem() interface.
      
      Additionally, the per-cpu pageset statistics are extremely verbose in oom
      killer output, so it is now suppressed.  This removes
      
      	nodes_weight(current->mems_allowed) * (1 + nr_cpus)
      
      lines from the oom killer output.
      
      Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed
      nodes.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ddd588b5
    • M
      mm/compaction: check migrate_pages's return value instead of list_empty() · 9d502c1c
      Minchan Kim 提交于
      Many migrate_page's caller check return value instead of list_empy by
      cf608ac1 ("mm: compaction: fix COMPACTPAGEFAILED counting").  This patch
      makes compaction's migrate_pages consistent with others.  This patch
      should not change old behavior.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d502c1c
    • A
      mm: compaction: prevent kswapd compacting memory to reduce CPU usage · d527caf2
      Andrea Arcangeli 提交于
      This patch reverts 5a03b051 ("thp: use compaction in kswapd for GFP_ATOMIC
      order > 0") due to reports stating that kswapd CPU usage was higher and
      IRQs were being disabled more frequently.  This was reported at
      http://www.spinics.net/linux/fedora/alsa-user/msg09885.html.
      
      Without this patch applied, CPU usage by kswapd hovers around the 20% mark
      according to the tester (Arthur Marsh:
      http://www.spinics.net/linux/fedora/alsa-user/msg09899.html).  With this
      patch applied, it's around 2%.
      
      The problem is not related to THP which specifies __GFP_NO_KSWAPD but is
      triggered by high-order allocations hitting the low watermark for their
      order and waking kswapd on kernels with CONFIG_COMPACTION set.  The most
      common trigger for this is network cards configured for jumbo frames but
      it's also possible it'll be triggered by fork-heavy workloads (order-1)
      and some wireless cards which depend on order-1 allocations.
      
      The symptoms for the user will be high CPU usage by kswapd in low-memory
      situations which could be confused with another writeback problem.  While
      a patch like 5a03b051 may be reintroduced in the future, this patch plays
      it safe for now and reverts it.
      
      [mel@csn.ul.ie: Beefed up the changelog]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reported-by: NArthur Marsh <arthur.marsh@internode.on.net>
      Tested-by: NArthur Marsh <arthur.marsh@internode.on.net>
      Cc: <stable@kernel.org>		[2.6.38.1]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d527caf2
    • N
      mm: vmap area cache · 89699605
      Nick Piggin 提交于
      Provide a free area cache for the vmalloc virtual address allocator, based
      on the algorithm used by the user virtual memory allocator.
      
      This reduces the number of rbtree operations and linear traversals over
      the vmap extents in order to find a free area, by starting off at the last
      point that a free area was found.
      
      The free area cache is reset if areas are freed behind it, or if we are
      searching for a smaller area or alignment than last time.  So allocation
      patterns are not changed (verified by corner-case and random test cases in
      userspace testing).
      
      This solves a regression caused by lazy vunmap TLB purging introduced in
      db64fe02 (mm: rewrite vmap layer).  That patch will leave extents in the
      vmap allocator after they are vunmapped, and until a significant number
      accumulate that can be flushed in a single batch.  So in a workload that
      vmalloc/vfree frequently, a chain of extents will build up from
      VMALLOC_START address, which have to be iterated over each time (giving an
      O(n) type of behaviour).
      
      After this patch, the search will start from where it left off, giving
      closer to an amortized O(1).
      
      This is verified to solve regressions reported Steven in GFS2, and Avi in
      KVM.
      
      Hugh's update:
      
      : I tried out the recent mmotm, and on one machine was fortunate to hit
      : the BUG_ON(first->va_start < addr) which seems to have been stalling
      : your vmap area cache patch ever since May.
      
      : I can get you addresses etc, I did dump a few out; but once I stared
      : at them, it was easier just to look at the code: and I cannot see how
      : you would be so sure that first->va_start < addr, once you've done
      : that addr = ALIGN(max(...), align) above, if align is over 0x1000
      : (align was 0x8000 or 0x4000 in the cases I hit: ioremaps like Steve).
      
      : I originally got around it by just changing the
      : 		if (first->va_start < addr) {
      : to
      : 		while (first->va_start < addr) {
      : without thinking about it any further; but that seemed unsatisfactory,
      : why would we want to loop here when we've got another very similar
      : loop just below it?
      
      : I am never going to admit how long I've spent trying to grasp your
      : "while (n)" rbtree loop just above this, the one with the peculiar
      : 		if (!first && tmp->va_start < addr + size)
      : in.  That's unfamiliar to me, I'm guessing it's designed to save a
      : subsequent rb_next() in a few circumstances (at risk of then setting
      : a wrong cached_hole_size?); but they did appear few to me, and I didn't
      : feel I could sign off something with that in when I don't grasp it,
      : and it seems responsible for extra code and mistaken BUG_ON below it.
      
      : I've reverted to the familiar rbtree loop that find_vma() does (but
      : with va_end >= addr as you had, to respect the additional guard page):
      : and then (given that cached_hole_size starts out 0) I don't see the
      : need for any complications below it.  If you do want to keep that loop
      : as you had it, please add a comment to explain what it's trying to do,
      : and where addr is relative to first when you emerge from it.
      
      : Aren't your tests "size <= cached_hole_size" and
      : "addr + size > first->va_start" forgetting the guard page we want
      : before the next area?  I've changed those.
      
      : I have not changed your many "addr + size - 1 < addr" overflow tests,
      : but have since come to wonder, shouldn't they be "addr + size < addr"
      : tests - won't the vend checks go wrong if addr + size is 0?
      
      : I have added a few comments - Wolfgang Wander's 2.6.13 description of
      : 1363c3cd Avoiding mmap fragmentation
      : helped me a lot, perhaps a pointer to that would be good too.  And I found
      : it easier to understand when I renamed cached_start slightly and moved the
      : overflow label down.
      
      : This patch would go after your mm-vmap-area-cache.patch in mmotm.
      : Trivially, nobody is going to get that BUG_ON with this patch, and it
      : appears to work fine on my machines; but I have not given it anything like
      : the testing you did on your original, and may have broken all the
      : performance you were aiming for.  Please take a look and test it out
      : integrate with yours if you're satisfied - thanks.
      
      [akpm@linux-foundation.org: add locking comment]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reported-and-tested-by: NSteven Whitehouse <swhiteho@redhat.com>
      Reported-and-tested-by: NAvi Kivity <avi@redhat.com>
      Tested-by: N"Barry J. Marson" <bmarson@redhat.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89699605
    • D
      oom: avoid deferring oom killer if exiting task is being traced · edd45544
      David Rientjes 提交于
      The oom killer naturally defers killing anything if it finds an eligible
      task that is already exiting and has yet to detach its ->mm.  This avoids
      unnecessarily killing tasks when one is already in the exit path and may
      free enough memory that the oom killer is no longer needed.  This is
      detected by PF_EXITING since threads that have already detached its ->mm
      are no longer considered at all.
      
      The problem with always deferring when a thread is PF_EXITING, however, is
      that it may never actually exit when being traced, specifically if another
      task is tracing it with PTRACE_O_TRACEEXIT.  The oom killer does not want
      to defer in this case since there is no guarantee that thread will ever
      exit without intervention.
      
      This patch will now only defer the oom killer when a thread is PF_EXITING
      and no ptracer has stopped its progress in the exit path.  It also ensures
      that a child is sacrificed for the chosen parent only if it has a
      different ->mm as the comment implies: this ensures that the thread group
      leader is always targeted appropriately.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: <stable@kernel.org>		[2.6.38.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edd45544
    • A
      oom: skip zombies when iterating tasklist · 30e2b41f
      Andrey Vagin 提交于
      We shouldn't defer oom killing if a thread has already detached its ->mm
      and still has TIF_MEMDIE set.  Memory needs to be freed, so find kill
      other threads that pin the same ->mm or find another task to kill.
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.38.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30e2b41f
    • D
      oom: prevent unnecessary oom kills or kernel panics · 3a5dda7a
      David Rientjes 提交于
      This patch prevents unnecessary oom kills or kernel panics by reverting
      two commits:
      
      	495789a5 (oom: make oom_score to per-process value)
      	cef1d352 (oom: multi threaded process coredump don't make deadlock)
      
      First, 495789a5 (oom: make oom_score to per-process value) ignores the
      fact that all threads in a thread group do not necessarily exit at the
      same time.
      
      It is imperative that select_bad_process() detect threads that are in the
      exit path, specifically those with PF_EXITING set, to prevent needlessly
      killing additional tasks.  If a process is oom killed and the thread group
      leader exits, select_bad_process() cannot detect the other threads that
      are PF_EXITING by iterating over only processes.  Thus, it currently
      chooses another task unnecessarily for oom kill or panics the machine when
      nothing else is eligible.
      
      By iterating over threads instead, it is possible to detect threads that
      are exiting and nominate them for oom kill so they get access to memory
      reserves.
      
      Second, cef1d352 (oom: multi threaded process coredump don't make
      deadlock) erroneously avoids making the oom killer a no-op when an
      eligible thread other than current isfound to be exiting.  We want to
      detect this situation so that we may allow that exiting thread time to
      exit and free its memory; if it is able to exit on its own, that should
      free memory so current is no loner oom.  If it is not able to exit on its
      own, the oom killer will nominate it for oom kill which, in this case,
      only means it will get access to memory reserves.
      
      Without this change, it is easy for the oom killer to unnecessarily target
      tasks when all threads of a victim don't exit before the thread group
      leader or, in the worst case, panic the machine.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: <stable@kernel.org>		[2.6.38.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a5dda7a
    • M
      mm: swap: unlock swapfile inode mutex before closing file on bad swapfiles · 52c50567
      Mel Gorman 提交于
      If an administrator tries to swapon a file backed by NFS, the inode mutex is
      taken (as it is for any swapfile) but later identified to be a bad swapfile
      due to the lack of bmap and tries to cleanup. During cleanup, an attempt is
      made to close the file but with inode->i_mutex still held. Closing an NFS
      file syncs it which tries to acquire the inode mutex leading to deadlock. If
      lockdep is enabled the following appears on the console;
      
          =============================================
          [ INFO: possible recursive locking detected ]
          2.6.38-rc8-autobuild #1
          ---------------------------------------------
          swapon/2192 is trying to acquire lock:
           (&sb->s_type->i_mutex_key#13){+.+.+.}, at: vfs_fsync_range+0x47/0x7c
      
          but task is already holding lock:
           (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7
      
          other info that might help us debug this:
          1 lock held by swapon/2192:
           #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7
      
          stack backtrace:
          Pid: 2192, comm: swapon Not tainted 2.6.38-rc8-autobuild #1
          Call Trace:
              __lock_acquire+0x2eb/0x1623
              find_get_pages_tag+0x14a/0x174
              pagevec_lookup_tag+0x25/0x2e
              vfs_fsync_range+0x47/0x7c
              lock_acquire+0xd3/0x100
              vfs_fsync_range+0x47/0x7c
              nfs_flush_one+0x0/0xdf [nfs]
              mutex_lock_nested+0x40/0x2b1
              vfs_fsync_range+0x47/0x7c
              vfs_fsync_range+0x47/0x7c
              vfs_fsync+0x1c/0x1e
              nfs_file_flush+0x64/0x69 [nfs]
              filp_close+0x43/0x72
              sys_swapon+0xa39/0xae7
              sysret_check+0x2e/0x69
              system_call_fastpath+0x16/0x1b
      
      This patch releases the mutex if its held before calling filep_close()
      so swapon fails as expected without deadlock when the swapfile is backed
      by NFS.  If accepted for 2.6.39, it should also be considered a -stable
      candidate for 2.6.38 and 2.6.37.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: <stable@kernel.org>		[2.6.37+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52c50567
    • C
      slub: Add statistics for this_cmpxchg_double failures · 4fdccdfb
      Christoph Lameter 提交于
      Add some statistics for debugging.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      4fdccdfb
    • C
      slub: Add missing irq restore for the OOM path · 2fd66c51
      Christoph Lameter 提交于
      OOM path is missing the irq restore in the CONFIG_CMPXCHG_LOCAL case.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      2fd66c51
  2. 21 3月, 2011 1 次提交
  3. 18 3月, 2011 4 次提交
  4. 15 3月, 2011 2 次提交
  5. 14 3月, 2011 3 次提交
  6. 12 3月, 2011 3 次提交
  7. 11 3月, 2011 2 次提交
    • C
      Lockless (and preemptless) fastpaths for slub · 8a5ec0ba
      Christoph Lameter 提交于
      Use the this_cpu_cmpxchg_double functionality to implement a lockless
      allocation algorithm on arches that support fast this_cpu_ops.
      
      Each of the per cpu pointers is paired with a transaction id that ensures
      that updates of the per cpu information can only occur in sequence on
      a certain cpu.
      
      A transaction id is a "long" integer that is comprised of an event number
      and the cpu number. The event number is incremented for every change to the
      per cpu state. This means that the cmpxchg instruction can verify for an
      update that nothing interfered and that we are updating the percpu structure
      for the processor where we picked up the information and that we are also
      currently on that processor when we update the information.
      
      This results in a significant decrease of the overhead in the fastpaths. It
      also makes it easy to adopt the fast path for realtime kernels since this
      is lockless and does not require the use of the current per cpu area
      over the critical section. It is only important that the per cpu area is
      current at the beginning of the critical section and at the end.
      
      So there is no need even to disable preemption.
      
      Test results show that the fastpath cycle count is reduced by up to ~ 40%
      (alloc/free test goes from ~140 cycles down to ~80). The slowpath for kfree
      adds a few cycles.
      
      Sadly this does nothing for the slowpath which is where the main issues with
      performance in slub are but the best case performance rises significantly.
      (For that see the more complex slub patches that require cmpxchg_double)
      
      Kmalloc: alloc/free test
      
      Before:
      
      10000 times kmalloc(8)/kfree -> 134 cycles
      10000 times kmalloc(16)/kfree -> 152 cycles
      10000 times kmalloc(32)/kfree -> 144 cycles
      10000 times kmalloc(64)/kfree -> 142 cycles
      10000 times kmalloc(128)/kfree -> 142 cycles
      10000 times kmalloc(256)/kfree -> 132 cycles
      10000 times kmalloc(512)/kfree -> 132 cycles
      10000 times kmalloc(1024)/kfree -> 135 cycles
      10000 times kmalloc(2048)/kfree -> 135 cycles
      10000 times kmalloc(4096)/kfree -> 135 cycles
      10000 times kmalloc(8192)/kfree -> 144 cycles
      10000 times kmalloc(16384)/kfree -> 754 cycles
      
      After:
      
      10000 times kmalloc(8)/kfree -> 78 cycles
      10000 times kmalloc(16)/kfree -> 78 cycles
      10000 times kmalloc(32)/kfree -> 82 cycles
      10000 times kmalloc(64)/kfree -> 88 cycles
      10000 times kmalloc(128)/kfree -> 79 cycles
      10000 times kmalloc(256)/kfree -> 79 cycles
      10000 times kmalloc(512)/kfree -> 85 cycles
      10000 times kmalloc(1024)/kfree -> 82 cycles
      10000 times kmalloc(2048)/kfree -> 82 cycles
      10000 times kmalloc(4096)/kfree -> 85 cycles
      10000 times kmalloc(8192)/kfree -> 82 cycles
      10000 times kmalloc(16384)/kfree -> 706 cycles
      
      Kmalloc: Repeatedly allocate then free test
      
      Before:
      
      10000 times kmalloc(8) -> 211 cycles kfree -> 113 cycles
      10000 times kmalloc(16) -> 174 cycles kfree -> 115 cycles
      10000 times kmalloc(32) -> 235 cycles kfree -> 129 cycles
      10000 times kmalloc(64) -> 222 cycles kfree -> 120 cycles
      10000 times kmalloc(128) -> 343 cycles kfree -> 139 cycles
      10000 times kmalloc(256) -> 827 cycles kfree -> 147 cycles
      10000 times kmalloc(512) -> 1048 cycles kfree -> 272 cycles
      10000 times kmalloc(1024) -> 2043 cycles kfree -> 528 cycles
      10000 times kmalloc(2048) -> 4002 cycles kfree -> 571 cycles
      10000 times kmalloc(4096) -> 7740 cycles kfree -> 628 cycles
      10000 times kmalloc(8192) -> 8062 cycles kfree -> 850 cycles
      10000 times kmalloc(16384) -> 8895 cycles kfree -> 1249 cycles
      
      After:
      
      10000 times kmalloc(8) -> 190 cycles kfree -> 129 cycles
      10000 times kmalloc(16) -> 76 cycles kfree -> 123 cycles
      10000 times kmalloc(32) -> 126 cycles kfree -> 124 cycles
      10000 times kmalloc(64) -> 181 cycles kfree -> 128 cycles
      10000 times kmalloc(128) -> 310 cycles kfree -> 140 cycles
      10000 times kmalloc(256) -> 809 cycles kfree -> 165 cycles
      10000 times kmalloc(512) -> 1005 cycles kfree -> 269 cycles
      10000 times kmalloc(1024) -> 1999 cycles kfree -> 527 cycles
      10000 times kmalloc(2048) -> 3967 cycles kfree -> 570 cycles
      10000 times kmalloc(4096) -> 7658 cycles kfree -> 637 cycles
      10000 times kmalloc(8192) -> 8111 cycles kfree -> 859 cycles
      10000 times kmalloc(16384) -> 8791 cycles kfree -> 1173 cycles
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      8a5ec0ba
    • C
      slub: Get rid of slab_free_hook_irq() · d3f661d6
      Christoph Lameter 提交于
      The following patch will make the fastpaths lockless and will no longer
      require interrupts to be disabled. Calling the free hook with irq disabled
      will no longer be possible.
      
      Move the slab_free_hook_irq() logic into slab_free_hook. Only disable
      interrupts if the features are selected that require callbacks with
      interrupts off and reenable after calls have been made.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      d3f661d6
  8. 05 3月, 2011 3 次提交
  9. 01 3月, 2011 1 次提交
  10. 27 2月, 2011 1 次提交
  11. 26 2月, 2011 2 次提交