1. 20 5月, 2012 1 次提交
    • H
      memcg,thp: fix res_counter:96 regression · 62ade86a
      Hugh Dickins 提交于
      Occasionally, testing memcg's move_charge_at_immigrate on rc7 shows
      a flurry of hundreds of warnings at kernel/res_counter.c:96, where
      res_counter_uncharge_locked() does WARN_ON(counter->usage < val).
      
      The first trace of each flurry implicates __mem_cgroup_cancel_charge()
      of mc.precharge, and an audit of mc.precharge handling points to
      mem_cgroup_move_charge_pte_range()'s THP handling in commit 12724850
      ("memcg: avoid THP split in task migration").
      
      Checking !mc.precharge is good everywhere else, when a single page is to
      be charged; but here the "mc.precharge -= HPAGE_PMD_NR" likely to
      follow, is liable to result in underflow (a lot can change since the
      precharge was estimated).
      
      Simply check against HPAGE_PMD_NR: there's probably a better
      alternative, trying precharge for more, splitting if unsuccessful; but
      this one-liner is safer for now - no kernel/res_counter.c:96 warnings
      seen in 26 hours.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62ade86a
  2. 18 5月, 2012 1 次提交
    • M
      slub: missing test for partial pages flush work in flush_all() · 02e1a9cd
      majianpeng 提交于
      I found some kernel messages such as:
      
          SLUB raid5-md127: kmem_cache_destroy called for cache that still has objects.
          Pid: 6143, comm: mdadm Tainted: G           O 3.4.0-rc6+        #75
          Call Trace:
          kmem_cache_destroy+0x328/0x400
          free_conf+0x2d/0xf0 [raid456]
          stop+0x41/0x60 [raid456]
          md_stop+0x1a/0x60 [md_mod]
          do_md_stop+0x74/0x470 [md_mod]
          md_ioctl+0xff/0x11f0 [md_mod]
          blkdev_ioctl+0xd8/0x7a0
          block_ioctl+0x3b/0x40
          do_vfs_ioctl+0x96/0x560
          sys_ioctl+0x91/0xa0
          system_call_fastpath+0x16/0x1b
      
      Then using kmemleak I found these messages:
      
          unreferenced object 0xffff8800b6db7380 (size 112):
            comm "mdadm", pid 5783, jiffies 4294810749 (age 90.589s)
            hex dump (first 32 bytes):
              01 01 db b6 ad 4e ad de ff ff ff ff ff ff ff ff  .....N..........
              ff ff ff ff ff ff ff ff 98 40 4a 82 ff ff ff ff  .........@J.....
            backtrace:
              kmemleak_alloc+0x21/0x50
              kmem_cache_alloc+0xeb/0x1b0
              kmem_cache_open+0x2f1/0x430
              kmem_cache_create+0x158/0x320
              setup_conf+0x649/0x770 [raid456]
              run+0x68b/0x840 [raid456]
              md_run+0x529/0x940 [md_mod]
              do_md_run+0x18/0xc0 [md_mod]
              md_ioctl+0xba8/0x11f0 [md_mod]
              blkdev_ioctl+0xd8/0x7a0
              block_ioctl+0x3b/0x40
              do_vfs_ioctl+0x96/0x560
              sys_ioctl+0x91/0xa0
              system_call_fastpath+0x16/0x1b
      
      This bug was introduced by commit a8364d55 ("slub: only IPI CPUs that
      have per cpu obj to flush"), which did not include checks for per cpu
      partial pages being present on a cpu.
      Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Tested-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02e1a9cd
  3. 12 5月, 2012 1 次提交
    • H
      mm: raise MemFree by reverting percpu_pagelist_fraction to 0 · 1b76b02f
      Hugh Dickins 提交于
      Why is there less MemFree than there used to be?  It perturbed a test,
      so I've just been bisecting linux-next, and now find the offender went
      upstream yesterday.
      
      Commit 93278814 "mm: fix division by 0 in percpu_pagelist_fraction()"
      mistakenly initialized percpu_pagelist_fraction to the sysctl's minimum 8,
      which leaves 1/8th of memory on percpu lists (on each cpu??); but most of
      us expect it to be left unset at 0 (and it's not then used as a divisor).
      
        MemTotal: 8061476kB  8061476kB  8061476kB  8061476kB  8061476kB  8061476kB
        Repetitive test with percpu_pagelist_fraction 8:
        MemFree:  6948420kB  6237172kB  6949696kB  6840692kB  6949048kB  6862984kB
        Same test with percpu_pagelist_fraction back to 0:
        MemFree:  7945000kB  7944908kB  7948568kB  7949060kB  7948796kB  7948812kB
      Signed-off-by: NHugh Dickins <hughd@google.com>
      [ We really should fix the crazy sysctl interface too, but that's a
        separate thing - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b76b02f
  4. 11 5月, 2012 4 次提交
  5. 10 5月, 2012 2 次提交
  6. 26 4月, 2012 6 次提交
    • S
      mm: fix NULL ptr dereference in move_pages · 6e8b09ea
      Sasha Levin 提交于
      Commit 3268c63e ("mm: fix move/migrate_pages() race on task struct") has
      added an odd construct where 'mm' is checked for being NULL, and if it is,
      it would get dereferenced anyways by mput()ing it.
      Signed-off-by: NSasha Levin <levinsasha928@gmail.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e8b09ea
    • S
      mm: fix NULL ptr dereference in migrate_pages · f2a9ef88
      Sasha Levin 提交于
      Commit 3268c63e ("mm: fix move/migrate_pages() race on task struct") has
      added an odd construct where 'mm' is checked for being NULL, and if it is,
      it would get dereferenced anyways by mput()ing it.
      
      This would lead to the following NULL ptr deref and BUG() when calling
      migrate_pages() with a pid that has no mm struct:
      
      [25904.193704] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
      [25904.194235] IP: [<ffffffff810b0de7>] mmput+0x27/0xf0
      [25904.194235] PGD 773e6067 PUD 77da0067 PMD 0
      [25904.194235] Oops: 0002 [#1] PREEMPT SMP
      [25904.194235] CPU 2
      [25904.194235] Pid: 31608, comm: trinity Tainted: G        W    3.4.0-rc2-next-20120412-sasha #69
      [25904.194235] RIP: 0010:[<ffffffff810b0de7>]  [<ffffffff810b0de7>] mmput+0x27/0xf0
      [25904.194235] RSP: 0018:ffff880077d49e08  EFLAGS: 00010202
      [25904.194235] RAX: 0000000000000286 RBX: 0000000000000000 RCX: 0000000000000000
      [25904.194235] RDX: ffff880075ef8000 RSI: 000000000000023d RDI: 0000000000000286
      [25904.194235] RBP: ffff880077d49e18 R08: 0000000000000001 R09: 0000000000000001
      [25904.194235] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      [25904.194235] R13: 00000000ffffffea R14: ffff880034287740 R15: ffff8800218d3010
      [25904.194235] FS:  00007fc8b244c700(0000) GS:ffff880029800000(0000) knlGS:0000000000000000
      [25904.194235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [25904.194235] CR2: 0000000000000050 CR3: 00000000767c6000 CR4: 00000000000406e0
      [25904.194235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [25904.194235] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [25904.194235] Process trinity (pid: 31608, threadinfo ffff880077d48000, task ffff880075ef8000)
      [25904.194235] Stack:
      [25904.194235]  ffff8800342876c0 0000000000000000 ffff880077d49f78 ffffffff811b8020
      [25904.194235]  ffffffff811b7d91 ffff880075ef8000 ffff88002256d200 0000000000000000
      [25904.194235]  00000000000003ff 0000000000000000 0000000000000000 0000000000000000
      [25904.194235] Call Trace:
      [25904.194235]  [<ffffffff811b8020>] sys_migrate_pages+0x340/0x3a0
      [25904.194235]  [<ffffffff811b7d91>] ? sys_migrate_pages+0xb1/0x3a0
      [25904.194235]  [<ffffffff8266cbb9>] system_call_fastpath+0x16/0x1b
      [25904.194235] Code: c9 c3 66 90 55 31 d2 48 89 e5 be 3d 02 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 48 89 fb 48 c7 c7 cf 0e e1 82 e8 69 18 03 00 <f0> ff 4b 50 0f 94 c0 84 c0 0f 84 aa 00 00 00 48 89 df e8 72 f1
      [25904.194235] RIP  [<ffffffff810b0de7>] mmput+0x27/0xf0
      [25904.194235]  RSP <ffff880077d49e08>
      [25904.194235] CR2: 0000000000000050
      [25904.348999] ---[ end trace a307b3ed40206b4b ]---
      Signed-off-by: NSasha Levin <levinsasha928@gmail.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2a9ef88
    • Y
      mm: fix up the vmscan stat in vmstat · 904249aa
      Ying Han 提交于
      The "pgsteal" stat is confusing because it counts both direct reclaim as
      well as background reclaim.  However, we have "kswapd_steal" which also
      counts background reclaim value.
      
      This patch fixes it and also makes it match the existng "pgscan_" stats.
      
      Test:
      pgsteal_kswapd_dma32 447623
      pgsteal_kswapd_normal 42272677
      pgsteal_kswapd_movable 0
      pgsteal_direct_dma32 2801
      pgsteal_direct_normal 44353270
      pgsteal_direct_movable 0
      Signed-off-by: NYing Han <yinghan@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      904249aa
    • K
      mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma · b1c12cbc
      Konstantin Khlebnikov 提交于
      Fix a gcc warning (and bug?) introduced in cc9a6c87 ("cpuset: mm: reduce
      large amounts of memory barrier related damage v3")
      
      Local variable "page" can be uninitialized if the nodemask from vma policy
      does not intersects with nodemask from cpuset.  Even if it doesn't happens
      it is better to initialize this variable explicitly than to introduce
      a kernel oops in a weird corner case.
      
      mm/hugetlb.c: In function `alloc_huge_page':
      mm/hugetlb.c:1135:5: warning: `page' may be used uninitialized in this function
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1c12cbc
    • J
      mm: memcg: move pc lookup point to commit_charge() · ce587e65
      Johannes Weiner 提交于
      None of the callsites actually need the page_cgroup descriptor
      themselves, so just pass the page and do the look up in there.
      
      We already had two bugs (6568d4a9 'mm: memcg: update the correct soft
      limit tree during migration' and 'memcg: fix Bad page state after
      replace_page_cache') where the passed page and pc were not referring
      to the same page frame.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce587e65
    • D
      mm: nobootmem: Correct alloc_bootmem semantics. · 4e1c2b28
      David Miller 提交于
      The comments above __alloc_bootmem_node() claim that the code will
      first try the allocation using 'goal' and if that fails it will
      try again but with the 'goal' requirement dropped.
      
      Unfortunately, this is not what the code does, so fix it to do so.
      
      This is important for nobootmem conversions to architectures such
      as sparc where MAX_DMA_ADDRESS is infinity.
      
      On such architectures all of the allocations done by generic spots,
      such as the sparse-vmemmap implementation, will pass in:
      
      	__pa(MAX_DMA_ADDRESS)
      
      as the goal, and with the limit given as "-1" this will always fail
      unless we add the appropriate fallback logic here.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e1c2b28
  7. 24 4月, 2012 1 次提交
    • H
      mm: fix s390 BUG by __set_page_dirty_no_writeback on swap · aca50bd3
      Hugh Dickins 提交于
      Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390
      3.0.13: called from __set_page_dirty_nobuffers() when page_remove_rmap()
      tries to transfer dirty flag from s390 storage key to struct page and
      radix_tree.
      
      That would be because of reclaim's shrink_page_list() calling
      add_to_swap() on this page at the same time: first PageSwapCache is set
      (causing page_mapping(page) to appear as &swapper_space), then
      page->private set, then tree_lock taken, then page inserted into
      radix_tree - so there's an interval before taking the lock when the
      radix_tree slot is empty.
      
      We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
      before the SetPageSwapCache.  But a better fix is simply to do what's
      five years overdue: Ken Chen introduced __set_page_dirty_no_writeback()
      (if !PageDirty TestSetPageDirty) for tmpfs to skip all the radix_tree
      overhead, and swap is just the same - it ignores the radix_tree tag, and
      does not participate in dirty page accounting, so should be using
      __set_page_dirty_no_writeback() too.
      
      s390 testing now confirms that this does indeed fix the problem.
      Reported-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aca50bd3
  8. 21 4月, 2012 5 次提交
  9. 19 4月, 2012 1 次提交
    • H
      memcg: fix Bad page state after replace_page_cache · 9b7f43af
      Hugh Dickins 提交于
      My 9ce70c02 "memcg: fix deadlock by inverting lrucare nesting" put a
      nasty little bug into v3.3's version of mem_cgroup_replace_page_cache(),
      sometimes used for FUSE.  Replacing __mem_cgroup_commit_charge_lrucare()
      by __mem_cgroup_commit_charge(), I used the "pc" pointer set up earlier:
      but it's for oldpage, and needs now to be for newpage.  Once oldpage was
      freed, its PageCgroupUsed bit (cleared above but set again here) caused
      "Bad page state" messages - and perhaps worse, being missed from newpage.
      (I didn't find this by using FUSE, but in reusing the function for tmpfs.)
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org [v3.3 only]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b7f43af
  10. 13 4月, 2012 4 次提交
  11. 30 3月, 2012 1 次提交
  12. 29 3月, 2012 7 次提交
    • K
      radix-tree: use iterators in find_get_pages* functions · 0fc9d104
      Konstantin Khlebnikov 提交于
      Replace radix_tree_gang_lookup_slot() and
      radix_tree_gang_lookup_tag_slot() in page-cache lookup functions with
      brand-new radix-tree direct iterating.  This avoids the double-scanning
      and pointer copying.
      
      Iterator don't stop after nr_pages page-get fails in a row, it continue
      lookup till the radix-tree end.  Thus we can safely remove these restart
      conditions.
      
      Unfortunately, old implementation didn't forbid nr_pages == 0, this corner
      case does not fit into new code, so the patch adds an extra check at the
      beginning.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Tested-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fc9d104
    • G
      mm: only IPI CPUs to drain local pages if they exist · 74046494
      Gilad Ben-Yossef 提交于
      Calculate a cpumask of CPUs with per-cpu pages in any zone and only send
      an IPI requesting CPUs to drain these pages to the buddy allocator if they
      actually have pages when asked to flush.
      
      This patch saves 85%+ of IPIs asking to drain per-cpu pages in case of
      severe memory pressure that leads to OOM since in these cases multiple,
      possibly concurrent, allocation requests end up in the direct reclaim code
      path so when the per-cpu pages end up reclaimed on first allocation
      failure for most of the proceeding allocation attempts until the memory
      pressure is off (possibly via the OOM killer) there are no per-cpu pages
      on most CPUs (and there can easily be hundreds of them).
      
      This also has the side effect of shortening the average latency of direct
      reclaim by 1 or more order of magnitude since waiting for all the CPUs to
      ACK the IPI takes a long time.
      
      Tested by running "hackbench 400" on a 8 CPU x86 VM and observing the
      difference between the number of direct reclaim attempts that end up in
      drain_all_pages() and those were more then 1/2 of the online CPU had any
      per-cpu page in them, using the vmstat counters introduced in the next
      patch in the series and using proc/interrupts.
      
      In the test sceanrio, this was seen to save around 3600 global
      IPIs after trigerring an OOM on a concurrent workload:
      
      $ cat /proc/vmstat | tail -n 2
      pcp_global_drain 0
      pcp_global_ipi_saved 0
      
      $ cat /proc/interrupts | grep CAL
      CAL:          1          2          1          2
                2          2          2          2   Function call interrupts
      
      $ hackbench 400
      [OOM messages snipped]
      
      $ cat /proc/vmstat | tail -n 2
      pcp_global_drain 3647
      pcp_global_ipi_saved 3642
      
      $ cat /proc/interrupts | grep CAL
      CAL:          6         13          6          3
                3          3         1 2          7   Function call interrupts
      
      Please note that if the global drain is removed from the direct reclaim
      path as a patch from Mel Gorman currently suggests this should be replaced
      with an on_each_cpu_cond invocation.
      Signed-off-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Acked-by: NMichal Nazarewicz <mina86@mina86.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74046494
    • G
      slub: only IPI CPUs that have per cpu obj to flush · a8364d55
      Gilad Ben-Yossef 提交于
      flush_all() is called for each kmem_cache_destroy().  So every cache being
      destroyed dynamically ends up sending an IPI to each CPU in the system,
      regardless if the cache has ever been used there.
      
      For example, if you close the Infinband ipath driver char device file, the
      close file ops calls kmem_cache_destroy().  So running some infiniband
      config tool on one a single CPU dedicated to system tasks might interrupt
      the rest of the 127 CPUs dedicated to some CPU intensive or latency
      sensitive task.
      
      I suspect there is a good chance that every line in the output of "git
      grep kmem_cache_destroy linux/ | grep '\->'" has a similar scenario.
      
      This patch attempts to rectify this issue by sending an IPI to flush the
      per cpu objects back to the free lists only to CPUs that seem to have such
      objects.
      
      The check which CPU to IPI is racy but we don't care since asking a CPU
      without per cpu objects to flush does no damage and as far as I can tell
      the flush_all by itself is racy against allocs on remote CPUs anyway, so
      if you required the flush_all to be determinstic, you had to arrange for
      locking regardless.
      
      Without this patch the following artificial test case:
      
      $ cd /sys/kernel/slab
      $ for DIR in *; do cat $DIR/alloc_calls > /dev/null; done
      
      produces 166 IPIs on an cpuset isolated CPU. With it it produces none.
      
      The code path of memory allocation failure for CPUMASK_OFFSTACK=y
      config was tested using fault injection framework.
      Signed-off-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Sasha Levin <levinsasha928@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Michal Nazarewicz <mina86@mina86.org>
      Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com>
      Cc: Milton Miller <miltonm@bga.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8364d55
    • H
      swapon: check validity of swap_flags · d15cab97
      Hugh Dickins 提交于
      Most system calls taking flags first check that the flags passed in are
      valid, and that helps userspace to detect when new flags are supported.
      
      But swapon never did so: start checking now, to help if we ever want to
      support more swap_flags in future.
      
      It's difficult to get stray bits set in an int, and swapon is not widely
      used, so this is most unlikely to break any userspace; but we can just
      revert if it turns out to do so.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d15cab97
    • D
      mm, coredump: fail allocations when coredumping instead of oom killing · 29fd66d2
      David Rientjes 提交于
      The size of coredump files is limited by RLIMIT_CORE, however, allocating
      large amounts of memory results in three negative consequences:
      
       - the coredumping process may be chosen for oom kill and quickly deplete
         all memory reserves in oom conditions preventing further progress from
         being made or tasks from exiting,
      
       - the coredumping process may cause other processes to be oom killed
         without fault of their own as the result of a SIGSEGV, for example, in
         the coredumping process, or
      
       - the coredumping process may result in a livelock while writing to the
         dump file if it needs memory to allocate while other threads are in
         the exit path waiting on the coredumper to complete.
      
      This is fixed by implying __GFP_NORETRY in the page allocator for
      coredumping processes when reclaim has failed so the allocations fail and
      the process continues to exit.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29fd66d2
    • A
      mm: thp: fix up pmd_trans_unstable() locations · 45f83cef
      Andrea Arcangeli 提交于
      pmd_trans_unstable() should be called before pmd_offset_map() in the
      locations where the mmap_sem is held for reading.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45f83cef
    • H
      mm for fs: add truncate_pagecache_range() · 623e3db9
      Hugh Dickins 提交于
      Holepunching filesystems ext4 and xfs are using truncate_inode_pages_range
      but forgetting to unmap pages first (ocfs2 remembers).  This is not really
      a bug, since races already require truncate_inode_page() to handle that
      case once the page is locked; but it can be very inefficient if the file
      being punched happens to be mapped into many vmas.
      
      Provide a drop-in replacement truncate_pagecache_range() which does the
      unmapping pass first, handling the awkward mismatch between arguments to
      truncate_inode_pages_range() and arguments to unmap_mapping_range().
      
      Note that holepunching does not unmap privately COWed pages in the range:
      POSIX requires that we do so when truncating, but it's hard to justify,
      difficult to implement without an i_size cutoff, and no filesystem is
      attempting to implement it.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      623e3db9
  13. 25 3月, 2012 1 次提交
  14. 24 3月, 2012 4 次提交
    • J
      coredump: add VM_NODUMP, MADV_NODUMP, MADV_CLEAR_NODUMP · accb61fe
      Jason Baron 提交于
      Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit
      for 'VM_NODUMP' flag.  The idea is is to add a new madvise() flag:
      MADV_DONTDUMP, which can be set by applications to specifically request
      memory regions which should not dump core.
      
      The specific application I have in mind is qemu: we can add a flag there
      that wouldn't dump all of guest memory when qemu dumps core.  This flag
      might also be useful for security sensitive apps that want to absolutely
      make sure that parts of memory are not dumped.  To clear the flag use:
      MADV_DODUMP.
      
      [akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland]
      [akpm@linux-foundation.org: fix up the architectures which broke]
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Acked-by: NRoland McGrath <roland@hack.frob.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      accb61fe
    • J
      coredump: remove VM_ALWAYSDUMP flag · 909af768
      Jason Baron 提交于
      The motivation for this patchset was that I was looking at a way for a
      qemu-kvm process, to exclude the guest memory from its core dump, which
      can be quite large.  There are already a number of filter flags in
      /proc/<pid>/coredump_filter, however, these allow one to specify 'types'
      of kernel memory, not specific address ranges (which is needed in this
      case).
      
      Since there are no more vma flags available, the first patch eliminates
      the need for the 'VM_ALWAYSDUMP' flag.  The flag is used internally by
      the kernel to mark vdso and vsyscall pages.  However, it is simple
      enough to check if a vma covers a vdso or vsyscall page without the need
      for this flag.
      
      The second patch then replaces the 'VM_ALWAYSDUMP' flag with a new
      'VM_NODUMP' flag, which can be set by userspace using new madvise flags:
      'MADV_DONTDUMP', and unset via 'MADV_DODUMP'.  The core dump filters
      continue to work the same as before unless 'MADV_DONTDUMP' is set on the
      region.
      
      The qemu code which implements this features is at:
      
        http://people.redhat.com/~jbaron/qemu-dump/qemu-dump.patch
      
      In my testing the qemu core dump shrunk from 383MB -> 13MB with this
      patch.
      
      I also believe that the 'MADV_DONTDUMP' flag might be useful for
      security sensitive apps, which might want to select which areas are
      dumped.
      
      This patch:
      
      The VM_ALWAYSDUMP flag is currently used by the coredump code to
      indicate that a vma is part of a vsyscall or vdso section.  However, we
      can determine if a vma is in one these sections by checking it against
      the gate_vma and checking for a non-NULL return value from
      arch_vma_name().  Thus, freeing a valuable vma bit.
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Acked-by: NRoland McGrath <roland@hack.frob.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Avi Kivity <avi@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      909af768
    • O
      signal: oom_kill_task: use SEND_SIG_FORCED instead of force_sig() · d2d39309
      Oleg Nesterov 提交于
      Change oom_kill_task() to use do_send_sig_info(SEND_SIG_FORCED) instead
      of force_sig(SIGKILL).  With the recent changes we do not need force_ to
      kill the CLONE_NEWPID tasks.
      
      And this is more correct.  force_sig() can race with the exiting thread
      even if oom_kill_task() checks p->mm != NULL, while
      do_send_sig_info(group => true) kille the whole process.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2d39309
    • H
      mm: hugetlb: cleanup duplicated code in unmapping vm range · 6629326b
      Hillf Danton 提交于
      Fix code duplication in __unmap_hugepage_range(), such as pte_page() and
      huge_pte_none().
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6629326b
  15. 23 3月, 2012 1 次提交