1. 25 10月, 2013 14 次提交
  2. 29 8月, 2013 1 次提交
    • A
      memcg: check that kmem_cache has memcg_params before accessing it · 6f6b8951
      Andrey Vagin 提交于
      If the system had a few memory groups and all of them were destroyed,
      memcg_limited_groups_array_size has non-zero value, but all new caches
      are created without memcg_params, because memcg_kmem_enabled() returns
      false.
      
      We try to enumirate child caches in a few places and all of them are
      potentially dangerous.
      
      For example my kernel is compiled with CONFIG_SLAB and it crashed when I
      tryed to mount a NFS share after a few experiments with kmemcg.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff8118166a>] do_tune_cpucache+0x8a/0xd0
        PGD b942a067 PUD b999f067 PMD 0
        Oops: 0000 [#1] SMP
        Modules linked in: fscache(+) ip6table_filter ip6_tables iptable_filter ip_tables i2c_piix4 pcspkr virtio_net virtio_balloon i2c_core floppy
        CPU: 0 PID: 357 Comm: modprobe Not tainted 3.11.0-rc7+ #59
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff8800b9f98240 ti: ffff8800ba32e000 task.ti: ffff8800ba32e000
        RIP: 0010:[<ffffffff8118166a>]  [<ffffffff8118166a>] do_tune_cpucache+0x8a/0xd0
        RSP: 0018:ffff8800ba32fb70  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
        RDX: 0000000000000000 RSI: ffff8800b9f98910 RDI: 0000000000000246
        RBP: ffff8800ba32fba0 R08: 0000000000000002 R09: 0000000000000004
        R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000010
        R13: 0000000000000008 R14: 00000000000000d0 R15: ffff8800375d0200
        FS:  00007f55f1378740(0000) GS:ffff8800bfa00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00007f24feba57a0 CR3: 0000000037b51000 CR4: 00000000000006f0
        Call Trace:
          enable_cpucache+0x49/0x100
          setup_cpu_cache+0x215/0x280
          __kmem_cache_create+0x2fa/0x450
          kmem_cache_create_memcg+0x214/0x350
          kmem_cache_create+0x2b/0x30
          fscache_init+0x19b/0x230 [fscache]
          do_one_initcall+0xfa/0x1b0
          load_module+0x1c41/0x26d0
          SyS_finit_module+0x86/0xb0
          system_call_fastpath+0x16/0x1b
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f6b8951
  3. 28 8月, 2013 1 次提交
  4. 25 8月, 2013 1 次提交
  5. 24 8月, 2013 1 次提交
  6. 16 8月, 2013 1 次提交
    • L
      Fix TLB gather virtual address range invalidation corner cases · 2b047252
      Linus Torvalds 提交于
      Ben Tebulin reported:
      
       "Since v3.7.2 on two independent machines a very specific Git
        repository fails in 9/10 cases on git-fsck due to an SHA1/memory
        failures.  This only occurs on a very specific repository and can be
        reproduced stably on two independent laptops.  Git mailing list ran
        out of ideas and for me this looks like some very exotic kernel issue"
      
      and bisected the failure to the backport of commit 53a59fc6 ("mm:
      limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").
      
      That commit itself is not actually buggy, but what it does is to make it
      much more likely to hit the partial TLB invalidation case, since it
      introduces a new case in tlb_next_batch() that previously only ever
      happened when running out of memory.
      
      The real bug is that the TLB gather virtual memory range setup is subtly
      buggered.  It was introduced in commit 597e1c35 ("mm/mmu_gather:
      enable tlb flush range in generic mmu_gather"), and the range handling
      was already fixed at least once in commit e6c495a9 ("mm: fix the TLB
      range flushed when __tlb_remove_page() runs out of slots"), but that fix
      was not complete.
      
      The problem with the TLB gather virtual address range is that it isn't
      set up by the initial tlb_gather_mmu() initialization (which didn't get
      the TLB range information), but it is set up ad-hoc later by the
      functions that actually flush the TLB.  And so any such case that forgot
      to update the TLB range entries would potentially miss TLB invalidates.
      
      Rather than try to figure out exactly which particular ad-hoc range
      setup was missing (I personally suspect it's the hugetlb case in
      zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
      did), this patch just gets rid of the problem at the source: make the
      TLB range information available to tlb_gather_mmu(), and initialize it
      when initializing all the other tlb gather fields.
      
      This makes the patch larger, but conceptually much simpler.  And the end
      result is much more understandable; even if you want to play games with
      partial ranges when invalidating the TLB contents in chunks, now the
      range information is always there, and anybody who doesn't want to
      bother with it won't introduce subtle bugs.
      
      Ben verified that this fixes his problem.
      Reported-bisected-and-tested-by: NBen Tebulin <tebulin@googlemail.com>
      Build-testing-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Build-testing-by: NRichard Weinberger <richard.weinberger@gmail.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b047252
  7. 14 8月, 2013 3 次提交
  8. 09 8月, 2013 1 次提交
    • L
      Revert "slub: do not put a slab to cpu partial list when cpu_partial is 0" · 37090506
      Linus Torvalds 提交于
      This reverts commit 318df36e.
      
      This commit caused Steven Rostedt's hackbench runs to run out of memory
      due to a leak.  As noted by Joonsoo Kim, it is buggy in the following
      scenario:
      
       "I guess, you may set 0 to all kmem caches's cpu_partial via sysfs,
        doesn't it?
      
        In this case, memory leak is possible in following case.  Code flow of
        possible leak is follwing case.
      
         * in __slab_free()
         1. (!new.inuse || !prior) && !was_frozen
         2. !kmem_cache_debug && !prior
         3. new.frozen = 1
         4. after cmpxchg_double_slab, run the (!n) case with new.frozen=1
         5. with this patch, put_cpu_partial() doesn't do anything,
            because this cache's cpu_partial is 0
         6. return
      
        In step 5, leak occur"
      
      And Steven does indeed have cpu_partial set to 0 due to RT testing.
      
      Joonsoo is cooking up a patch, but everybody agrees that reverting this
      for now is the right thing to do.
      Reported-and-bisected-by: NSteven Rostedt <rostedt@goodmis.org>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37090506
  9. 05 8月, 2013 1 次提交
  10. 01 8月, 2013 7 次提交
  11. 17 7月, 2013 1 次提交
  12. 15 7月, 2013 2 次提交
    • P
      kernel: delete __cpuinit usage from all core kernel files · 0db0628d
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the uses of the __cpuinit macros from C files in
      the core kernel directories (kernel, init, lib, mm, and include)
      that don't really have a specific maintainer.
      
      [1] https://lkml.org/lkml/2013/5/20/589Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      0db0628d
    • S
      slub: Check for page NULL before doing the node_match check · c25f195e
      Steven Rostedt 提交于
      In the -rt kernel (mrg), we hit the following dump:
      
      BUG: unable to handle kernel NULL pointer dereference at           (null)
      IP: [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
      PGD a2d39067 PUD b1641067 PMD 0
      Oops: 0000 [#1] PREEMPT SMP
      Modules linked in: sunrpc cpufreq_ondemand ipv6 tg3 joydev sg serio_raw pcspkr k8temp amd64_edac_mod edac_core i2c_piix4 e100 mii shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom sata_svw ata_generic pata_acpi pata_serverworks radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
      CPU 3
      Pid: 20878, comm: hackbench Not tainted 3.6.11-rt25.14.el6rt.x86_64 #1 empty empty/Tyan Transport GT24-B3992
      RIP: 0010:[<ffffffff811573f1>]  [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
      RSP: 0018:ffff8800a9b17d70  EFLAGS: 00010213
      RAX: 0000000000000000 RBX: 0000000001200011 RCX: ffff8800a06d8000
      RDX: 0000000004d92a03 RSI: 00000000000000d0 RDI: ffff88013b805500
      RBP: ffff8800a9b17dc0 R08: ffff88023fd14d10 R09: ffffffff81041cbd
      R10: 00007f4e3f06e9d0 R11: 0000000000000246 R12: ffff88013b805500
      R13: ffff8801ff46af40 R14: 0000000000000001 R15: 0000000000000000
      FS:  00007f4e3f06e700(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000000 CR3: 00000000a2d3a000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process hackbench (pid: 20878, threadinfo ffff8800a9b16000, task ffff8800a06d8000)
      Stack:
       ffff8800a9b17da0 ffffffff81202e08 ffff8800a9b17de0 000000d001200011
       0000000001200011 0000000001200011 0000000000000000 0000000000000000
       00007f4e3f06e9d0 0000000000000000 ffff8800a9b17e60 ffffffff81041cbd
      Call Trace:
       [<ffffffff81202e08>] ? current_has_perm+0x68/0x80
       [<ffffffff81041cbd>] copy_process+0xdd/0x15b0
       [<ffffffff810a2125>] ? rt_up_read+0x25/0x30
       [<ffffffff8104369a>] do_fork+0x5a/0x360
       [<ffffffff8107c66b>] ? migrate_enable+0xeb/0x220
       [<ffffffff8100b068>] sys_clone+0x28/0x30
       [<ffffffff81527423>] stub_clone+0x13/0x20
       [<ffffffff81527152>] ? system_call_fastpath+0x16/0x1b
      Code: 89 fc 89 75 cc 41 89 d6 4d 8b 04 24 65 4c 03 04 25 48 ae 00 00 49 8b 50 08 4d 8b 28 49 8b 40 10 4d 85 ed 74 12 41 83 fe ff 74 27 <48> 8b 00 48 c1 e8 3a 41 39 c6 74 1b 8b 75 cc 4c 89 c9 44 89 f2
      RIP  [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
       RSP <ffff8800a9b17d70>
      CR2: 0000000000000000
      ---[ end trace 0000000000000002 ]---
      
      Now, this uses SLUB pretty much unmodified, but as it is the -rt kernel
      with CONFIG_PREEMPT_RT set, spinlocks are mutexes, although they do
      disable migration. But the SLUB code is relatively lockless, and the
      spin_locks there are raw_spin_locks (not converted to mutexes), thus I
      believe this bug can happen in mainline without -rt features. The -rt
      patch is just good at triggering mainline bugs ;-)
      
      Anyway, looking at where this crashed, it seems that the page variable
      can be NULL when passed to the node_match() function (which does not
      check if it is NULL). When this happens we get the above panic.
      
      As page is only used in slab_alloc() to check if the node matches, if
      it's NULL I'm assuming that we can say it doesn't and call the
      __slab_alloc() code. Is this a correct assumption?
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c25f195e
  13. 11 7月, 2013 3 次提交
    • M
      mm: remove free_area_cache · 98d1e64f
      Michel Lespinasse 提交于
      Since all architectures have been converted to use vm_unmapped_area(),
      there is no remaining use for the free_area_cache.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98d1e64f
    • S
      zswap: add to mm/ · 2b281117
      Seth Jennings 提交于
      zswap is a thin backend for frontswap that takes pages that are in the
      process of being swapped out and attempts to compress them and store
      them in a RAM-based memory pool.  This can result in a significant I/O
      reduction on the swap device and, in the case where decompressing from
      RAM is faster than reading from the swap device, can also improve
      workload performance.
      
      It also has support for evicting swap pages that are currently
      compressed in zswap to the swap device on an LRU(ish) basis.  This
      functionality makes zswap a true cache in that, once the cache is full,
      the oldest pages can be moved out of zswap to the swap device so newer
      pages can be compressed and stored in zswap.
      
      This patch adds the zswap driver to mm/
      Signed-off-by: NSeth Jennings <sjenning@linux.vnet.ibm.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Jenifer Hopper <jhopper@us.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Hugh Dickens <hughd@google.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b281117
    • S
      zbud: add to mm/ · 4e2e2770
      Seth Jennings 提交于
      zbud is an special purpose allocator for storing compressed pages.  It
      is designed to store up to two compressed pages per physical page.
      While this design limits storage density, it has simple and
      deterministic reclaim properties that make it preferable to a higher
      density approach when reclaim will be used.
      
      zbud works by storing compressed pages, or "zpages", together in pairs
      in a single memory page called a "zbud page".  The first buddy is "left
      justifed" at the beginning of the zbud page, and the last buddy is
      "right justified" at the end of the zbud page.  The benefit is that if
      either buddy is freed, the freed buddy space, coalesced with whatever
      slack space that existed between the buddies, results in the largest
      possible free region within the zbud page.
      
      zbud also provides an attractive lower bound on density.  The ratio of
      zpages to zbud pages can not be less than 1.  This ensures that zbud can
      never "do harm" by using more pages to store zpages than the
      uncompressed zpages would have used on their own.
      
      This implementation is a rewrite of the zbud allocator internally used
      by zcache in the driver/staging tree.  The rewrite was necessary to
      remove some of the zcache specific elements that were ingrained
      throughout and provide a generic allocation interface that can later be
      used by zsmalloc and others.
      
      This patch adds zbud to mm/ for later use by zswap.
      Signed-off-by: NSeth Jennings <sjenning@linux.vnet.ibm.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Jenifer Hopper <jhopper@us.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Hugh Dickens <hughd@google.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e2e2770
  14. 10 7月, 2013 3 次提交