1. 15 5月, 2012 2 次提交
    • D
      mm: frontswap: core frontswap functionality · 29f233cf
      Dan Magenheimer 提交于
      This patch, 3of4, provides the core frontswap code that interfaces between
      the hooks in the swap subsystem and a frontswap backend via frontswap_ops.
      
      ---
      New file added: mm/frontswap.c
      
      [v14: add support for writethrough, per suggestion by aarcange@redhat.com]
      [v11: sjenning@linux.vnet.ibm.com: s/puts/failed_puts/]
      [v10: sjenning@linux.vnet.ibm.com: fix debugfs calls on 32-bit]
      [v9: akpm@linux-foundation.org: change "flush" to "invalidate", part 1]
      [v9: akpm@linux-foundation.org: mark some statics __read_mostly]
      [v9: akpm@linux-foundation.org: add clarifying comments]
      [v9: akpm@linux-foundation.org: no need to loop repeating try_to_unuse]
      [v9: error27@gmail.com: remove superfluous check for NULL]
      [v8: rebase to 3.0-rc4]
      [v8: kamezawa.hiroyu@jp.fujitsu.com: add comment to clarify find_next_to_unuse]
      [v7: rebase to 3.0-rc3]
      [v7: JBeulich@novell.com: use new static inlines, no-ops if not config'd]
      [v6: rebase to 3.1-rc1]
      [v6: lliubbo@gmail.com: use vzalloc]
      [v6: lliubbo@gmail.com: fix null pointer deref if vzalloc fails]
      [v6: konrad.wilk@oracl.com: various checks and code clarifications/comments]
      [v4: rebase to 2.6.39]
      Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Acked-by: NJan Beulich <JBeulich@novell.com>
      Acked-by: NSeth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Rik Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      [v12: Squashed s/flush/invalidate/ in]
      [v15: A bit of cleanup and seperate DEBUGFS]
      Signed-off-by: NKonrad Wilk <konrad.wilk@oracle.com>
      29f233cf
    • D
      mm: frontswap: core swap subsystem hooks and headers · 38b5faf4
      Dan Magenheimer 提交于
      This patch, 2of4, contains the changes to the core swap subsystem.
      This includes:
      
      (1) makes available core swap data structures (swap_lock, swap_list and
      swap_info) that are needed by frontswap.c but we don't need to expose them
      to the dozens of files that include swap.h so we create a new swapfile.h
      just to extern-ify these and modify their declarations to non-static
      
      (2) adds frontswap-related elements to swap_info_struct.  Frontswap_map
      points to vzalloc'ed one-bit-per-swap-page metadata that indicates
      whether the swap page is in frontswap or in the device and frontswap_pages
      counts how many pages are in frontswap.
      
      (3) adds hooks in the swap subsystem and extends try_to_unuse so that
      frontswap_shrink can do a "partial swapoff".
      
      Note that a failed frontswap_map allocation is safe... failure is noted
      by lack of "FS" in the subsequent printk.
      
      ---
      
      [v14: rebase to 3.4-rc2]
      [v10: no change]
      [v9: akpm@linux-foundation.org: mark some statics __read_mostly]
      [v9: akpm@linux-foundation.org: add clarifying comments]
      [v9: akpm@linux-foundation.org: no need to loop repeating try_to_unuse]
      [v9: error27@gmail.com: remove superfluous check for NULL]
      [v8: rebase to 3.0-rc4]
      [v8: kamezawa.hiroyu@jp.fujitsu.com: change counter to atomic_t to avoid races]
      [v8: kamezawa.hiroyu@jp.fujitsu.com: comment to clarify informational counters]
      [v7: rebase to 3.0-rc3]
      [v7: JBeulich@novell.com: add new swap struct elements only if config'd]
      [v6: rebase to 3.0-rc1]
      [v6: lliubbo@gmail.com: fix null pointer deref if vzalloc fails]
      [v6: konrad.wilk@oracl.com: various checks and code clarifications/comments]
      [v5: no change from v4]
      [v4: rebase to 2.6.39]
      Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Reviewed-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NJan Beulich <JBeulich@novell.com>
      Acked-by: NSeth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Rik Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      [v11: Rebased, fixed mm/swapfile.c context change]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      38b5faf4
  2. 29 3月, 2012 7 次提交
    • K
      radix-tree: use iterators in find_get_pages* functions · 0fc9d104
      Konstantin Khlebnikov 提交于
      Replace radix_tree_gang_lookup_slot() and
      radix_tree_gang_lookup_tag_slot() in page-cache lookup functions with
      brand-new radix-tree direct iterating.  This avoids the double-scanning
      and pointer copying.
      
      Iterator don't stop after nr_pages page-get fails in a row, it continue
      lookup till the radix-tree end.  Thus we can safely remove these restart
      conditions.
      
      Unfortunately, old implementation didn't forbid nr_pages == 0, this corner
      case does not fit into new code, so the patch adds an extra check at the
      beginning.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Tested-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fc9d104
    • G
      mm: only IPI CPUs to drain local pages if they exist · 74046494
      Gilad Ben-Yossef 提交于
      Calculate a cpumask of CPUs with per-cpu pages in any zone and only send
      an IPI requesting CPUs to drain these pages to the buddy allocator if they
      actually have pages when asked to flush.
      
      This patch saves 85%+ of IPIs asking to drain per-cpu pages in case of
      severe memory pressure that leads to OOM since in these cases multiple,
      possibly concurrent, allocation requests end up in the direct reclaim code
      path so when the per-cpu pages end up reclaimed on first allocation
      failure for most of the proceeding allocation attempts until the memory
      pressure is off (possibly via the OOM killer) there are no per-cpu pages
      on most CPUs (and there can easily be hundreds of them).
      
      This also has the side effect of shortening the average latency of direct
      reclaim by 1 or more order of magnitude since waiting for all the CPUs to
      ACK the IPI takes a long time.
      
      Tested by running "hackbench 400" on a 8 CPU x86 VM and observing the
      difference between the number of direct reclaim attempts that end up in
      drain_all_pages() and those were more then 1/2 of the online CPU had any
      per-cpu page in them, using the vmstat counters introduced in the next
      patch in the series and using proc/interrupts.
      
      In the test sceanrio, this was seen to save around 3600 global
      IPIs after trigerring an OOM on a concurrent workload:
      
      $ cat /proc/vmstat | tail -n 2
      pcp_global_drain 0
      pcp_global_ipi_saved 0
      
      $ cat /proc/interrupts | grep CAL
      CAL:          1          2          1          2
                2          2          2          2   Function call interrupts
      
      $ hackbench 400
      [OOM messages snipped]
      
      $ cat /proc/vmstat | tail -n 2
      pcp_global_drain 3647
      pcp_global_ipi_saved 3642
      
      $ cat /proc/interrupts | grep CAL
      CAL:          6         13          6          3
                3          3         1 2          7   Function call interrupts
      
      Please note that if the global drain is removed from the direct reclaim
      path as a patch from Mel Gorman currently suggests this should be replaced
      with an on_each_cpu_cond invocation.
      Signed-off-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Acked-by: NMichal Nazarewicz <mina86@mina86.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74046494
    • G
      slub: only IPI CPUs that have per cpu obj to flush · a8364d55
      Gilad Ben-Yossef 提交于
      flush_all() is called for each kmem_cache_destroy().  So every cache being
      destroyed dynamically ends up sending an IPI to each CPU in the system,
      regardless if the cache has ever been used there.
      
      For example, if you close the Infinband ipath driver char device file, the
      close file ops calls kmem_cache_destroy().  So running some infiniband
      config tool on one a single CPU dedicated to system tasks might interrupt
      the rest of the 127 CPUs dedicated to some CPU intensive or latency
      sensitive task.
      
      I suspect there is a good chance that every line in the output of "git
      grep kmem_cache_destroy linux/ | grep '\->'" has a similar scenario.
      
      This patch attempts to rectify this issue by sending an IPI to flush the
      per cpu objects back to the free lists only to CPUs that seem to have such
      objects.
      
      The check which CPU to IPI is racy but we don't care since asking a CPU
      without per cpu objects to flush does no damage and as far as I can tell
      the flush_all by itself is racy against allocs on remote CPUs anyway, so
      if you required the flush_all to be determinstic, you had to arrange for
      locking regardless.
      
      Without this patch the following artificial test case:
      
      $ cd /sys/kernel/slab
      $ for DIR in *; do cat $DIR/alloc_calls > /dev/null; done
      
      produces 166 IPIs on an cpuset isolated CPU. With it it produces none.
      
      The code path of memory allocation failure for CPUMASK_OFFSTACK=y
      config was tested using fault injection framework.
      Signed-off-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Sasha Levin <levinsasha928@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Michal Nazarewicz <mina86@mina86.org>
      Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com>
      Cc: Milton Miller <miltonm@bga.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8364d55
    • H
      swapon: check validity of swap_flags · d15cab97
      Hugh Dickins 提交于
      Most system calls taking flags first check that the flags passed in are
      valid, and that helps userspace to detect when new flags are supported.
      
      But swapon never did so: start checking now, to help if we ever want to
      support more swap_flags in future.
      
      It's difficult to get stray bits set in an int, and swapon is not widely
      used, so this is most unlikely to break any userspace; but we can just
      revert if it turns out to do so.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d15cab97
    • D
      mm, coredump: fail allocations when coredumping instead of oom killing · 29fd66d2
      David Rientjes 提交于
      The size of coredump files is limited by RLIMIT_CORE, however, allocating
      large amounts of memory results in three negative consequences:
      
       - the coredumping process may be chosen for oom kill and quickly deplete
         all memory reserves in oom conditions preventing further progress from
         being made or tasks from exiting,
      
       - the coredumping process may cause other processes to be oom killed
         without fault of their own as the result of a SIGSEGV, for example, in
         the coredumping process, or
      
       - the coredumping process may result in a livelock while writing to the
         dump file if it needs memory to allocate while other threads are in
         the exit path waiting on the coredumper to complete.
      
      This is fixed by implying __GFP_NORETRY in the page allocator for
      coredumping processes when reclaim has failed so the allocations fail and
      the process continues to exit.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29fd66d2
    • A
      mm: thp: fix up pmd_trans_unstable() locations · 45f83cef
      Andrea Arcangeli 提交于
      pmd_trans_unstable() should be called before pmd_offset_map() in the
      locations where the mmap_sem is held for reading.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45f83cef
    • H
      mm for fs: add truncate_pagecache_range() · 623e3db9
      Hugh Dickins 提交于
      Holepunching filesystems ext4 and xfs are using truncate_inode_pages_range
      but forgetting to unmap pages first (ocfs2 remembers).  This is not really
      a bug, since races already require truncate_inode_page() to handle that
      case once the page is locked; but it can be very inefficient if the file
      being punched happens to be mapped into many vmas.
      
      Provide a drop-in replacement truncate_pagecache_range() which does the
      unmapping pass first, handling the awkward mismatch between arguments to
      truncate_inode_pages_range() and arguments to unmap_mapping_range().
      
      Note that holepunching does not unmap privately COWed pages in the range:
      POSIX requires that we do so when truncating, but it's hard to justify,
      difficult to implement without an i_size cutoff, and no filesystem is
      attempting to implement it.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      623e3db9
  3. 25 3月, 2012 1 次提交
  4. 24 3月, 2012 4 次提交
    • J
      coredump: add VM_NODUMP, MADV_NODUMP, MADV_CLEAR_NODUMP · accb61fe
      Jason Baron 提交于
      Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit
      for 'VM_NODUMP' flag.  The idea is is to add a new madvise() flag:
      MADV_DONTDUMP, which can be set by applications to specifically request
      memory regions which should not dump core.
      
      The specific application I have in mind is qemu: we can add a flag there
      that wouldn't dump all of guest memory when qemu dumps core.  This flag
      might also be useful for security sensitive apps that want to absolutely
      make sure that parts of memory are not dumped.  To clear the flag use:
      MADV_DODUMP.
      
      [akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland]
      [akpm@linux-foundation.org: fix up the architectures which broke]
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Acked-by: NRoland McGrath <roland@hack.frob.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      accb61fe
    • J
      coredump: remove VM_ALWAYSDUMP flag · 909af768
      Jason Baron 提交于
      The motivation for this patchset was that I was looking at a way for a
      qemu-kvm process, to exclude the guest memory from its core dump, which
      can be quite large.  There are already a number of filter flags in
      /proc/<pid>/coredump_filter, however, these allow one to specify 'types'
      of kernel memory, not specific address ranges (which is needed in this
      case).
      
      Since there are no more vma flags available, the first patch eliminates
      the need for the 'VM_ALWAYSDUMP' flag.  The flag is used internally by
      the kernel to mark vdso and vsyscall pages.  However, it is simple
      enough to check if a vma covers a vdso or vsyscall page without the need
      for this flag.
      
      The second patch then replaces the 'VM_ALWAYSDUMP' flag with a new
      'VM_NODUMP' flag, which can be set by userspace using new madvise flags:
      'MADV_DONTDUMP', and unset via 'MADV_DODUMP'.  The core dump filters
      continue to work the same as before unless 'MADV_DONTDUMP' is set on the
      region.
      
      The qemu code which implements this features is at:
      
        http://people.redhat.com/~jbaron/qemu-dump/qemu-dump.patch
      
      In my testing the qemu core dump shrunk from 383MB -> 13MB with this
      patch.
      
      I also believe that the 'MADV_DONTDUMP' flag might be useful for
      security sensitive apps, which might want to select which areas are
      dumped.
      
      This patch:
      
      The VM_ALWAYSDUMP flag is currently used by the coredump code to
      indicate that a vma is part of a vsyscall or vdso section.  However, we
      can determine if a vma is in one these sections by checking it against
      the gate_vma and checking for a non-NULL return value from
      arch_vma_name().  Thus, freeing a valuable vma bit.
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Acked-by: NRoland McGrath <roland@hack.frob.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Avi Kivity <avi@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      909af768
    • O
      signal: oom_kill_task: use SEND_SIG_FORCED instead of force_sig() · d2d39309
      Oleg Nesterov 提交于
      Change oom_kill_task() to use do_send_sig_info(SEND_SIG_FORCED) instead
      of force_sig(SIGKILL).  With the recent changes we do not need force_ to
      kill the CLONE_NEWPID tasks.
      
      And this is more correct.  force_sig() can race with the exiting thread
      even if oom_kill_task() checks p->mm != NULL, while
      do_send_sig_info(group => true) kille the whole process.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2d39309
    • H
      mm: hugetlb: cleanup duplicated code in unmapping vm range · 6629326b
      Hillf Danton 提交于
      Fix code duplication in __unmap_hugepage_range(), such as pte_page() and
      huge_pte_none().
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6629326b
  5. 23 3月, 2012 1 次提交
  6. 22 3月, 2012 25 次提交