1. 12 9月, 2013 18 次提交
    • P
      mm/page_alloc.c: fix coding style and spelling · b8af2941
      Pintu Kumar 提交于
      Fix all errors reported by checkpatch and some small spelling mistakes.
      Signed-off-by: NPintu Kumar <pintu.k@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8af2941
    • S
      swap: make cluster allocation per-cpu · ebc2a1a6
      Shaohua Li 提交于
      swap cluster allocation is to get better request merge to improve
      performance.  But the cluster is shared globally, if multiple tasks are
      doing swap, this will cause interleave disk access.  While multiple tasks
      swap is quite common, for example, each numa node has a kswapd thread
      doing swap and multiple threads/processes doing direct page reclaim.
      
      ioscheduler can't help too much here, because tasks don't send swapout IO
      down to block layer in the meantime.  Block layer does merge some IOs, but
      a lot not, depending on how many tasks are doing swapout concurrently.  In
      practice, I've seen a lot of small size IO in swapout workloads.
      
      We makes the cluster allocation per-cpu here.  The interleave disk access
      issue goes away.  All tasks swapout to their own cluster, so swapout will
      become sequential, which can be easily merged to big size IO.  If one CPU
      can't get its per-cpu cluster (for example, there is no free cluster
      anymore in the swap), it will fallback to scan swap_map.  The CPU can
      still continue swap.  We don't need recycle free swap entries of other
      CPUs.
      
      In my test (swap to a 2-disk raid0 partition), this improves around 10%
      swapout throughput, and request size is increased significantly.
      
      How does this impact swap readahead is uncertain though.  On one side,
      page reclaim always isolates and swaps several adjancent pages, this will
      make page reclaim write the pages sequentially and benefit readahead.  On
      the other side, several CPU write pages interleave means the pages don't
      live _sequentially_ but relatively _near_.  In the per-cpu allocation
      case, if adjancent pages are written by different cpus, they will live
      relatively _far_.  So how this impacts swap readahead depends on how many
      pages page reclaim isolates and swaps one time.  If the number is big,
      this patch will benefit swap readahead.  Of course, this is about
      sequential access pattern.  The patch has no impact for random access
      pattern, because the new cluster allocation algorithm is just for SSD.
      
      Alternative solution is organizing swap layout to be per-mm instead of
      this per-cpu approach.  In the per-mm layout, we allocate a disk range for
      each mm, so pages of one mm live in swap disk adjacently.  per-mm layout
      has potential issues of lock contention if multiple reclaimers are swap
      pages from one mm.  For a sequential workload, per-mm layout is better to
      implement swap readahead, because pages from the mm are adjacent in disk.
      But per-cpu layout isn't very bad in this workload, as page reclaim always
      isolates and swaps several pages one time, such pages will still live in
      disk sequentially and readahead can utilize this.  For a random workload,
      per-mm layout isn't beneficial of request merge, because it's quite
      possible pages from different mm are swapout in the meantime and IO can't
      be merged in per-mm layout.  while with per-cpu layout we can merge
      requests from any mm.  Considering random workload is more popular in
      workloads with swap (and per-cpu approach isn't too bad for sequential
      workload too), I'm choosing per-cpu layout.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebc2a1a6
    • S
      swap: fix races exposed by swap discard · edfe23da
      Shaohua Li 提交于
      The previous patch can expose races, according to Hugh:
      
      swapoff was sometimes failing with "Cannot allocate memory", coming from
      try_to_unuse()'s -ENOMEM: it needs to allow for swap_duplicate() failing
      on a free entry temporarily SWAP_MAP_BAD while being discarded.
      
      We should use ACCESS_ONCE() there, and whenever accessing swap_map
      locklessly; but rather than peppering it throughout try_to_unuse(), just
      declare *swap_map with volatile.
      
      try_to_unuse() is accustomed to *swap_map going down racily, but not
      necessarily to it jumping up from 0 to SWAP_MAP_BAD: we'll be safer to
      prevent that transition once SWP_WRITEOK is switched off, when it's a
      waste of time to issue discards anyway (swapon can do a whole discard).
      
      Another issue is:
      
      In swapin_readahead(), read_swap_cache_async() can read a bad swap entry,
      because we don't check if readahead swap entry is bad.  This doesn't break
      anything but such swapin page is wasteful and can only be freed at page
      reclaim.  We should avoid read such swap entry.  And in discard, we mark
      swap entry SWAP_MAP_BAD and then switch it to normal when discard is
      finished.  If readahead reads such swap entry, we have the same issue, so
      we much check if swap entry is bad too.
      
      Thanks Hugh to inspire swapin_readahead could use bad swap entry.
      
      [include Hugh's patch 'swap: fix swapoff ENOMEMs from discard']
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edfe23da
    • S
      swap: make swap discard async · 815c2c54
      Shaohua Li 提交于
      swap can do cluster discard for SSD, which is good, but there are some
      problems here:
      
      1. swap do the discard just before page reclaim gets a swap entry and
         writes the disk sectors.  This is useless for high end SSD, because an
         overwrite to a sector implies a discard to original sector too.  A
         discard + overwrite == overwrite.
      
      2. the purpose of doing discard is to improve SSD firmware garbage
         collection.  Idealy we should send discard as early as possible, so
         firmware can do something smart.  Sending discard just after swap entry
         is freed is considered early compared to sending discard before write.
         Of course, if workload is already bound to gc speed, sending discard
         earlier or later doesn't make
      
      3. block discard is a sync API, which will delay scan_swap_map()
         significantly.
      
      4. Write and discard command can be executed parallel in PCIe SSD.
         Making swap discard async can make execution more efficiently.
      
      This patch makes swap discard async and moves discard to where swap entry
      is freed.  Discard and write have no dependence now, so above issues can
      be avoided.  Idealy we should do discard for any freed sectors, but some
      SSD discard is very slow.  This patch still does discard for a whole
      cluster.
      
      My test does a several round of 'mmap, write, unmap', which will trigger a
      lot of swap discard.  In a fusionio card, with this patch, the test
      runtime is reduced to 18% of the time without it, so around 5.5x faster.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      815c2c54
    • S
      swap: change block allocation algorithm for SSD · 2a8f9449
      Shaohua Li 提交于
      I'm using a fast SSD to do swap.  scan_swap_map() sometimes uses up to
      20~30% CPU time (when cluster is hard to find, the CPU time can be up to
      80%), which becomes a bottleneck.  scan_swap_map() scans a byte array to
      search a 256 page cluster, which is very slow.
      
      Here I introduced a simple algorithm to search cluster.  Since we only
      care about 256 pages cluster, we can just use a counter to track if a
      cluster is free.  Every 256 pages use one int to store the counter.  If
      the counter of a cluster is 0, the cluster is free.  All free clusters
      will be added to a list, so searching cluster is very efficient.  With
      this, scap_swap_map() overhead disappears.
      
      This might help low end SD card swap too.  Because if the cluster is
      aligned, SD firmware can do flash erase more efficiently.
      
      We only enable the algorithm for SSD.  Hard disk swap isn't fast enough
      and has downside with the algorithm which might introduce regression (see
      below).
      
      The patch slightly changes which cluster is choosen.  It always adds free
      cluster to list tail.  This can help wear leveling for low end SSD too.
      And if no cluster found, the scan_swap_map() will do search from the end
      of last cluster.  So if no cluster found, the scan_swap_map() will do
      search from the end of last free cluster, which is random.  For SSD, this
      isn't a problem at all.
      
      Another downside is the cluster must be aligned to 256 pages, which will
      reduce the chance to find a cluster.  I would expect this isn't a big
      problem for SSD because of the non-seek penality.  (And this is the reason
      I only enable the algorithm for SSD).
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a8f9449
    • C
      mm/page_alloc.c: use '__paginginit' instead of '__init' · 15ca220e
      Chen Gang 提交于
      set_pageblock_order() may be called when memory hotplug, so need use
      '__paginginit' instead of '__init'.
      
      The related warning:
      
        The function __meminit .free_area_init_node() references
        a function __init .set_pageblock_order().
        If .set_pageblock_order is only used by .free_area_init_node then
        annotate .set_pageblock_order with a matching annotation.
      Signed-off-by: NChen Gang <gang.chen@asianux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15ca220e
    • J
      mm: fix negative left shift count when PAGE_SHIFT > 20 · a7e83318
      Jerry Zhou 提交于
      When PAGE_SHIFT > 20, the result of "20 - PAGE_SHIFT" is negative. The
      previous calculating here will generate an unexpected result. In
      addition, if PAGE_SIZE >= 1MB, The memory size of "numentries" was
      already integral multiple of 1MB.
      Signed-off-by: NJerry Zhou <uulinux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7e83318
    • J
      mm: replace strict_strtoul() with kstrtoul() · 3dbb95f7
      Jingoo Han 提交于
      The use of strict_strtoul() is not preferred, because strict_strtoul() is
      obsolete.  Thus, kstrtoul() should be used.
      Signed-off-by: NJingoo Han <jg1.han@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3dbb95f7
    • D
      mm: vmstats: track TLB flush stats on UP too · 6df46865
      Dave Hansen 提交于
      The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush
      counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled
      for SMP.
      
      UP systems do not do remote TLB flushes, so compile those counters out on
      UP.
      
      arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly.  This is
      probably an optimization since both the mtrr code and __flush_tlb() write
      cr4.  It would probably be safe to make that a flush_tlb_all() (and then
      get these statistics), but the mtrr code is ancient and I'm hesitant to
      touch it other than to just stick in the counters.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6df46865
    • D
      mm: vmstats: tlb flush counters · 9824cf97
      Dave Hansen 提交于
      I was investigating some TLB flush scaling issues and realized that we do
      not have any good methods for figuring out how many TLB flushes we are
      doing.
      
      It would be nice to be able to do these in generic code, but the
      arch-independent calls don't explicitly specify whether we actually need
      to do remote flushes or not.  In the end, we really need to know if we
      actually _did_ global vs.  local invalidations, so that leaves us with few
      options other than to muck with the counters from arch-specific code.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9824cf97
    • S
      mm/zswap.c: get swapper address_space by using macro · 822518dc
      Sunghan Suh 提交于
      There is a proper macro to get the corresponding swapper address space
      from a swap entry.  Instead of directly accessing "swapper_spaces" array,
      use the "swap_address_space" macro.
      Signed-off-by: NSunghan Suh <sunghan.suh@samsung.com>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: NSeth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      822518dc
    • O
      mm: mmap_region: kill correct_wcount/inode, use allow_write_access() · e8686772
      Oleg Nesterov 提交于
      correct_wcount and inode in mmap_region() just complicate the code.  This
      boolean was needed previously, when deny_write_access() was called before
      vma_merge(), now we can simply check VM_DENYWRITE and do
      allow_write_access() if it is set.
      
      allow_write_access() checks file != NULL, so this is safe even if it was
      possible to use VM_DENYWRITE && !file.  Just we need to ensure we use the
      same file which was deny_write_access()'ed, so the patch also moves "file
      = vma->vm_file" down after allow_write_access().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Colin Cross <ccross@android.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8686772
    • O
      mm: do_mmap_pgoff: cleanup the usage of file_inode() · 077bf22b
      Oleg Nesterov 提交于
      Simple cleanup.  Move "struct inode *inode" variable into "if (file)"
      block to simplify the code and avoid the unnecessary check.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Colin Cross <ccross@android.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      077bf22b
    • O
      mm: shift VM_GROWS* check from mmap_region() to do_mmap_pgoff() · b2c56e4f
      Oleg Nesterov 提交于
      mmap() doesn't allow the non-anonymous mappings with VM_GROWS* bit set.
      In particular this means that mmap_region()->vma_merge(file, vm_flags)
      must always fail if "vm_flags & VM_GROWS" is set incorrectly.
      
      So it does not make sense to check VM_GROWS* after we already allocated
      the new vma, the only caller, do_mmap_pgoff(), which can pass this flag
      can do the check itself.
      
      And this looks a bit more correct, mmap_region() already unmapped the
      old mapping at this stage. But if mmap() is going to fail, it should
      avoid do_munmap() if possible.
      
      Note: we check VM_GROWS at the end to ensure that do_mmap_pgoff() won't
      return EINVAL in the case when it currently returns another error code.
      
      Many thanks to Hugh who nacked the buggy v1.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2c56e4f
    • A
      mm/swapfile.c: convert to pr_foo() · 465c47fd
      Andrew Morton 提交于
      A few 80-col gymnastics were cleaned up as a result.
      
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      465c47fd
    • R
      swap: warn when a swap area overflows the maximum size · d6bbbd29
      Raymond Jennings 提交于
      It is possible to swapon a swap area that is too big for the pte width
      to handle.
      
      Presently this failure happens silently.
      
      Instead, emit a diagnostic to warn the user.
      
      Testing results, root prompt commands and kernel log messages:
      
      # lvresize /dev/system/swap --size 16G
      # mkswap /dev/system/swap
      # swapon /dev/system/swap
      
      Jul  7 04:27:22 warfang kernel: Adding 16777212k swap
      on /dev/mapper/system-swap.  Priority:-1 extents:1 across:16777212k
      
      # lvresize /dev/system/swap --size 64G
      # mkswap /dev/system/swap
      # swapon /dev/system/swap
      
      Jul  7 04:27:22 warfang kernel: Truncating oversized swap area, only
      using 33554432k out of 67108860k
      Jul  7 04:27:22 warfang kernel: Adding 33554428k swap
      on /dev/mapper/system-swap.  Priority:-1 extents:1 across:33554428k
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: NRaymond Jennings <shentino@gmail.com>
      Acked-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6bbbd29
    • V
      mm/madvise.c: fix coding-style errors · ec9bed9d
      Vladimir Cernov 提交于
      This fixes following errors:
      	- ERROR: "(foo*)" should be "(foo *)"
      	- ERROR: "foo ** bar" should be "foo **bar"
      Signed-off-by: NVladimir Cernov <gg.kaspersky@gmail.com>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec9bed9d
    • O
      mm: mempolicy: turn vma_set_policy() into vma_dup_policy() · ef0855d3
      Oleg Nesterov 提交于
      Simple cleanup.  Every user of vma_set_policy() does the same work, this
      looks a bit annoying imho.  And the new trivial helper which does
      mpol_dup() + vma_set_policy() to simplify the callers.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef0855d3
  2. 04 9月, 2013 2 次提交
  3. 29 8月, 2013 2 次提交
    • M
      s390/mm: implement software referenced bits · 0944fe3f
      Martin Schwidefsky 提交于
      The last remaining use for the storage key of the s390 architecture
      is reference counting. The alternative is to make page table entries
      invalid while they are old. On access the fault handler marks the
      pte/pmd as young which makes the pte/pmd valid if the access rights
      allow read access. The pte/pmd invalidations required for software
      managed reference bits cost a bit of performance, on the other hand
      the RRBE/RRBM instructions to read and reset the referenced bits are
      quite expensive as well.
      Reviewed-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      0944fe3f
    • A
      memcg: check that kmem_cache has memcg_params before accessing it · 6f6b8951
      Andrey Vagin 提交于
      If the system had a few memory groups and all of them were destroyed,
      memcg_limited_groups_array_size has non-zero value, but all new caches
      are created without memcg_params, because memcg_kmem_enabled() returns
      false.
      
      We try to enumirate child caches in a few places and all of them are
      potentially dangerous.
      
      For example my kernel is compiled with CONFIG_SLAB and it crashed when I
      tryed to mount a NFS share after a few experiments with kmemcg.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff8118166a>] do_tune_cpucache+0x8a/0xd0
        PGD b942a067 PUD b999f067 PMD 0
        Oops: 0000 [#1] SMP
        Modules linked in: fscache(+) ip6table_filter ip6_tables iptable_filter ip_tables i2c_piix4 pcspkr virtio_net virtio_balloon i2c_core floppy
        CPU: 0 PID: 357 Comm: modprobe Not tainted 3.11.0-rc7+ #59
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff8800b9f98240 ti: ffff8800ba32e000 task.ti: ffff8800ba32e000
        RIP: 0010:[<ffffffff8118166a>]  [<ffffffff8118166a>] do_tune_cpucache+0x8a/0xd0
        RSP: 0018:ffff8800ba32fb70  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
        RDX: 0000000000000000 RSI: ffff8800b9f98910 RDI: 0000000000000246
        RBP: ffff8800ba32fba0 R08: 0000000000000002 R09: 0000000000000004
        R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000010
        R13: 0000000000000008 R14: 00000000000000d0 R15: ffff8800375d0200
        FS:  00007f55f1378740(0000) GS:ffff8800bfa00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00007f24feba57a0 CR3: 0000000037b51000 CR4: 00000000000006f0
        Call Trace:
          enable_cpucache+0x49/0x100
          setup_cpu_cache+0x215/0x280
          __kmem_cache_create+0x2fa/0x450
          kmem_cache_create_memcg+0x214/0x350
          kmem_cache_create+0x2b/0x30
          fscache_init+0x19b/0x230 [fscache]
          do_one_initcall+0xfa/0x1b0
          load_module+0x1c41/0x26d0
          SyS_finit_module+0x86/0xb0
          system_call_fastpath+0x16/0x1b
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f6b8951
  4. 28 8月, 2013 1 次提交
  5. 27 8月, 2013 2 次提交
  6. 25 8月, 2013 1 次提交
  7. 24 8月, 2013 1 次提交
  8. 20 8月, 2013 2 次提交
  9. 16 8月, 2013 1 次提交
    • L
      Fix TLB gather virtual address range invalidation corner cases · 2b047252
      Linus Torvalds 提交于
      Ben Tebulin reported:
      
       "Since v3.7.2 on two independent machines a very specific Git
        repository fails in 9/10 cases on git-fsck due to an SHA1/memory
        failures.  This only occurs on a very specific repository and can be
        reproduced stably on two independent laptops.  Git mailing list ran
        out of ideas and for me this looks like some very exotic kernel issue"
      
      and bisected the failure to the backport of commit 53a59fc6 ("mm:
      limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").
      
      That commit itself is not actually buggy, but what it does is to make it
      much more likely to hit the partial TLB invalidation case, since it
      introduces a new case in tlb_next_batch() that previously only ever
      happened when running out of memory.
      
      The real bug is that the TLB gather virtual memory range setup is subtly
      buggered.  It was introduced in commit 597e1c35 ("mm/mmu_gather:
      enable tlb flush range in generic mmu_gather"), and the range handling
      was already fixed at least once in commit e6c495a9 ("mm: fix the TLB
      range flushed when __tlb_remove_page() runs out of slots"), but that fix
      was not complete.
      
      The problem with the TLB gather virtual address range is that it isn't
      set up by the initial tlb_gather_mmu() initialization (which didn't get
      the TLB range information), but it is set up ad-hoc later by the
      functions that actually flush the TLB.  And so any such case that forgot
      to update the TLB range entries would potentially miss TLB invalidates.
      
      Rather than try to figure out exactly which particular ad-hoc range
      setup was missing (I personally suspect it's the hugetlb case in
      zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
      did), this patch just gets rid of the problem at the source: make the
      TLB range information available to tlb_gather_mmu(), and initialize it
      when initializing all the other tlb gather fields.
      
      This makes the patch larger, but conceptually much simpler.  And the end
      result is much more understandable; even if you want to play games with
      partial ranges when invalidating the TLB contents in chunks, now the
      range information is always there, and anybody who doesn't want to
      bother with it won't introduce subtle bugs.
      
      Ben verified that this fixes his problem.
      Reported-bisected-and-tested-by: NBen Tebulin <tebulin@googlemail.com>
      Build-testing-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Build-testing-by: NRichard Weinberger <richard.weinberger@gmail.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b047252
  10. 14 8月, 2013 3 次提交
  11. 13 8月, 2013 1 次提交
  12. 09 8月, 2013 6 次提交
    • T
      cgroup: make css_for_each_descendant() and friends include the origin css in the iteration · bd8815a6
      Tejun Heo 提交于
      Previously, all css descendant iterators didn't include the origin
      (root of subtree) css in the iteration.  The reasons were maintaining
      consistency with css_for_each_child() and that at the time of
      introduction more use cases needed skipping the origin anyway;
      however, given that css_is_descendant() considers self to be a
      descendant, omitting the origin css has become more confusing and
      looking at the accumulated use cases rather clearly indicates that
      including origin would result in simpler code overall.
      
      While this is a change which can easily lead to subtle bugs, cgroup
      API including the iterators has recently gone through major
      restructuring and no out-of-tree changes will be applicable without
      adjustments making this a relatively acceptable opportunity for this
      type of change.
      
      The conversions are mostly straight-forward.  If the iteration block
      had explicit origin handling before or after, it's moved inside the
      iteration.  If not, if (pos == origin) continue; is added.  Some
      conversions add extra reference get/put around origin handling by
      consolidating origin handling and the rest.  While the extra ref
      operations aren't strictly necessary, this shouldn't cause any
      noticeable difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      bd8815a6
    • T
      cgroup: make cftype->[un]register_event() deal with cgroup_subsys_state instead of cgroup · 81eeaf04
      Tejun Heo 提交于
      cgroup is in the process of converting to css (cgroup_subsys_state)
      from cgroup as the principal subsystem interface handle.  This is
      mostly to prepare for the unified hierarchy support where css's will
      be created and destroyed dynamically but also helps cleaning up
      subsystem implementations as css is usually what they are interested
      in anyway.
      
      cftype->[un]register_event() is among the remaining couple interfaces
      which still use struct cgroup.  Convert it to cgroup_subsys_state.
      The conversion is mostly mechanical and removes the last users of
      mem_cgroup_from_cont() and cg_to_vmpressure(), which are removed.
      
      v2: indentation update as suggested by Li Zefan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      81eeaf04
    • T
      cgroup: make task iterators deal with cgroup_subsys_state instead of cgroup · 72ec7029
      Tejun Heo 提交于
      cgroup is in the process of converting to css (cgroup_subsys_state)
      from cgroup as the principal subsystem interface handle.  This is
      mostly to prepare for the unified hierarchy support where css's will
      be created and destroyed dynamically but also helps cleaning up
      subsystem implementations as css is usually what they are interested
      in anyway.
      
      This patch converts task iterators to deal with css instead of cgroup.
      Note that under unified hierarchy, different sets of tasks will be
      considered belonging to a given cgroup depending on the subsystem in
      question and making the iterators deal with css instead cgroup
      provides them with enough information about the iteration.
      
      While at it, fix several function comment formats in cpuset.c.
      
      This patch doesn't introduce any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      72ec7029
    • T
      cgroup: make cgroup_task_iter remember the cgroup being iterated · c59cd3d8
      Tejun Heo 提交于
      Currently all cgroup_task_iter functions require @cgrp to be passed
      in, which is superflous and increases chance of usage error.  Make
      cgroup_task_iter remember the cgroup being iterated and drop @cgrp
      argument from next and end functions.
      
      This patch doesn't introduce any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      c59cd3d8
    • T
      cgroup: rename cgroup_iter to cgroup_task_iter · 0942eeee
      Tejun Heo 提交于
      cgroup now has multiple iterators and it's quite confusing to have
      something which walks over tasks of a single cgroup named cgroup_iter.
      Let's rename it to cgroup_task_iter.
      
      While at it, reformat / update comments and replace the overview
      comment above the interface function decls with proper function
      comments.  Such overview can be useful but function comments should be
      more than enough here.
      
      This is pure rename and doesn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      0942eeee
    • T
      cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroup · 492eb21b
      Tejun Heo 提交于
      cgroup is currently in the process of transitioning to using css
      (cgroup_subsys_state) as the primary handle instead of cgroup in
      subsystem API.  For hierarchy iterators, this is beneficial because
      
      * In most cases, css is the only thing subsystems care about anyway.
      
      * On the planned unified hierarchy, iterations for different
        subsystems will need to skip over different subtrees of the
        hierarchy depending on which subsystems are enabled on each cgroup.
        Passing around css makes it unnecessary to explicitly specify the
        subsystem in question as css is intersection between cgroup and
        subsystem
      
      * For the planned unified hierarchy, css's would need to be created
        and destroyed dynamically independent from cgroup hierarchy.  Having
        cgroup core manage css iteration makes enforcing deref rules a lot
        easier.
      
      Most subsystem conversions are straight-forward.  Noteworthy changes
      are
      
      * blkio: cgroup_to_blkcg() is no longer used.  Removed.
      
      * freezer: cgroup_freezer() is no longer used.  Removed.
      
      * devices: cgroup_to_devcgroup() is no longer used.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      492eb21b