1. 07 1月, 2011 3 次提交
    • N
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin 提交于
      RCU free the struct inode. This will allow:
      
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fa0d7e3d
    • N
      fs: dcache remove dcache_lock · b5c84bf6
      Nick Piggin 提交于
      dcache_lock no longer protects anything. remove it.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b5c84bf6
    • N
      kernel: kmem_ptr_validate considered harmful · ccd35fb9
      Nick Piggin 提交于
      This is a nasty and error prone API. It is no longer used, remove it.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      ccd35fb9
  2. 04 1月, 2011 1 次提交
  3. 31 12月, 2010 1 次提交
  4. 27 12月, 2010 1 次提交
  5. 24 12月, 2010 2 次提交
  6. 23 12月, 2010 3 次提交
  7. 22 12月, 2010 1 次提交
  8. 18 12月, 2010 1 次提交
    • C
      vmstat: User per cpu atomics to avoid interrupt disable / enable · 7c839120
      Christoph Lameter 提交于
      Currently the operations to increment vm counters must disable interrupts
      in order to not mess up their housekeeping of counters.
      
      So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer
      count on preremption being disabled we still have some minor issues.
      The fetching of the counter thresholds is racy.
      A threshold from another cpu may be applied if we happen to be
      rescheduled on another cpu.  However, the following vmstat operation
      will then bring the counter again under the threshold limit.
      
      The operations for __xxx_zone_state are not changed since the caller
      has taken care of the synchronization needs (and therefore the cycle
      count is even less than the optimized version for the irq disable case
      provided here).
      
      The optimization using this_cpu_cmpxchg will only be used if the arch
      supports efficient this_cpu_ops (must have CONFIG_CMPXCHG_LOCAL set!)
      
      The use of this_cpu_cmpxchg reduces the cycle count for the counter
      operations by %80 (inc_zone_page_state goes from 170 cycles to 32).
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      7c839120
  9. 17 12月, 2010 3 次提交
  10. 16 12月, 2010 1 次提交
    • T
      install_special_mapping skips security_file_mmap check. · 462e635e
      Tavis Ormandy 提交于
      The install_special_mapping routine (used, for example, to setup the
      vdso) skips the security check before insert_vm_struct, allowing a local
      attacker to bypass the mmap_min_addr security restriction by limiting
      the available pages for special mappings.
      
      bprm_mm_init() also skips the check, and although I don't think this can
      be used to bypass any restrictions, I don't see any reason not to have
      the security check.
      
        $ uname -m
        x86_64
        $ cat /proc/sys/vm/mmap_min_addr
        65536
        $ cat install_special_mapping.s
        section .bss
            resb BSS_SIZE
        section .text
            global _start
            _start:
                mov     eax, __NR_pause
                int     0x80
        $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
        $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
        $ ./install_special_mapping &
        [1] 14303
        $ cat /proc/14303/maps
        0000f000-00010000 r-xp 00000000 00:00 0                                  [vdso]
        00010000-00011000 r-xp 00001000 00:19 2453665                            /home/taviso/install_special_mapping
        00011000-ffffe000 rwxp 00000000 00:00 0                                  [stack]
      
      It's worth noting that Red Hat are shipping with mmap_min_addr set to
      4096.
      Signed-off-by: NTavis Ormandy <taviso@google.com>
      Acked-by: NKees Cook <kees@ubuntu.com>
      Acked-by: NRobert Swiecki <swiecki@google.com>
      [ Changed to not drop the error code - akpm ]
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      462e635e
  11. 15 12月, 2010 1 次提交
    • T
      workqueue: convert cancel_rearming_delayed_work[queue]() users to cancel_delayed_work_sync() · afe2c511
      Tejun Heo 提交于
      cancel_rearming_delayed_work[queue]() has been superceded by
      cancel_delayed_work_sync() quite some time ago.  Convert all the
      in-kernel users.  The conversions are completely equivalent and
      trivial.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      Acked-by: NEvgeniy Polyakov <zbr@ioremap.net>
      Cc: Jeff Garzik <jgarzik@pobox.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: netdev@vger.kernel.org
      Cc: Anton Vorontsov <cbou@mail.ru>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: xfs-masters@oss.sgi.com
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: netfilter-devel@vger.kernel.org
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: linux-nfs@vger.kernel.org
      afe2c511
  12. 07 12月, 2010 2 次提交
  13. 04 12月, 2010 2 次提交
    • T
      slub: Fix a crash during slabinfo -v · 37d57443
      Tero Roponen 提交于
      Commit f7cb1933 ("SLUB: Pass active
      and inactive redzone flags instead of boolean to debug functions")
      missed two instances of check_object(). This caused a lot of warnings
      during 'slabinfo -v' finally leading to a crash:
      
        BUG ext4_xattr: Freepointer corrupt
        ...
        BUG buffer_head: Freepointer corrupt
        ...
        BUG ext4_alloc_context: Freepointer corrupt
        ...
        ...
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff810a291f>] file_sb_list_del+0x1c/0x35
        PGD 79d78067 PUD 79e67067 PMD 0
        Oops: 0002 [#1] SMP
        last sysfs file: /sys/kernel/slab/:t-0000192/validate
      
      This patch fixes the problem by converting the two missed instances.
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTero Roponen <tero.roponen@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      37d57443
    • T
      slub: Fix a crash during slabinfo -v · 8165984a
      Tero Roponen 提交于
      Commit f7cb1933 ("SLUB: Pass active
      and inactive redzone flags instead of boolean to debug functions")
      missed two instances of check_object(). This caused a lot of warnings
      during 'slabinfo -v' finally leading to a crash:
      
        BUG ext4_xattr: Freepointer corrupt
        ...
        BUG buffer_head: Freepointer corrupt
        ...
        BUG ext4_alloc_context: Freepointer corrupt
        ...
        ...
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff810a291f>] file_sb_list_del+0x1c/0x35
        PGD 79d78067 PUD 79e67067 PMD 0
        Oops: 0002 [#1] SMP
        last sysfs file: /sys/kernel/slab/:t-0000192/validate
      
      This patch fixes the problem by converting the two missed instances.
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTero Roponen <tero.roponen@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      8165984a
  14. 03 12月, 2010 6 次提交
    • K
      ksm: annotate ksm_thread_mutex is no deadlock source · a0b0f58c
      KOSAKI Motohiro 提交于
      commit 62b61f61 ("ksm: memory hotremove migration only") caused the
      following new lockdep warning.
      
        =======================================================
        [ INFO: possible circular locking dependency detected ]
        -------------------------------------------------------
        bash/1621 is trying to acquire lock:
         ((memory_chain).rwsem){.+.+.+}, at: [<ffffffff81079339>]
        __blocking_notifier_call_chain+0x69/0xc0
      
        but task is already holding lock:
         (ksm_thread_mutex){+.+.+.}, at: [<ffffffff8113a3aa>]
        ksm_memory_callback+0x3a/0xc0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (ksm_thread_mutex){+.+.+.}:
             [<ffffffff8108b70a>] lock_acquire+0xaa/0x140
             [<ffffffff81505d74>] __mutex_lock_common+0x44/0x3f0
             [<ffffffff81506228>] mutex_lock_nested+0x48/0x60
             [<ffffffff8113a3aa>] ksm_memory_callback+0x3a/0xc0
             [<ffffffff8150c21c>] notifier_call_chain+0x8c/0xe0
             [<ffffffff8107934e>] __blocking_notifier_call_chain+0x7e/0xc0
             [<ffffffff810793a6>] blocking_notifier_call_chain+0x16/0x20
             [<ffffffff813afbfb>] memory_notify+0x1b/0x20
             [<ffffffff81141b7c>] remove_memory+0x1cc/0x5f0
             [<ffffffff813af53d>] memory_block_change_state+0xfd/0x1a0
             [<ffffffff813afd62>] store_mem_state+0xe2/0xf0
             [<ffffffff813a0bb0>] sysdev_store+0x20/0x30
             [<ffffffff811bc116>] sysfs_write_file+0xe6/0x170
             [<ffffffff8114f398>] vfs_write+0xc8/0x190
             [<ffffffff8114fc14>] sys_write+0x54/0x90
             [<ffffffff810028b2>] system_call_fastpath+0x16/0x1b
      
        -> #0 ((memory_chain).rwsem){.+.+.+}:
             [<ffffffff8108b5ba>] __lock_acquire+0x155a/0x1600
             [<ffffffff8108b70a>] lock_acquire+0xaa/0x140
             [<ffffffff81506601>] down_read+0x51/0xa0
             [<ffffffff81079339>] __blocking_notifier_call_chain+0x69/0xc0
             [<ffffffff810793a6>] blocking_notifier_call_chain+0x16/0x20
             [<ffffffff813afbfb>] memory_notify+0x1b/0x20
             [<ffffffff81141f1e>] remove_memory+0x56e/0x5f0
             [<ffffffff813af53d>] memory_block_change_state+0xfd/0x1a0
             [<ffffffff813afd62>] store_mem_state+0xe2/0xf0
             [<ffffffff813a0bb0>] sysdev_store+0x20/0x30
             [<ffffffff811bc116>] sysfs_write_file+0xe6/0x170
             [<ffffffff8114f398>] vfs_write+0xc8/0x190
             [<ffffffff8114fc14>] sys_write+0x54/0x90
             [<ffffffff810028b2>] system_call_fastpath+0x16/0x1b
      
      But it's a false positive.  Both memory_chain.rwsem and ksm_thread_mutex
      have an outer lock (mem_hotplug_mutex).  So they cannot deadlock.
      
      Thus, This patch annotate ksm_thread_mutex is not deadlock source.
      
      [akpm@linux-foundation.org: update comment, from Hugh]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0b0f58c
    • K
      mem-hotplug: introduce {un}lock_memory_hotplug() · 20d6c96b
      KOSAKI Motohiro 提交于
      Presently hwpoison is using lock_system_sleep() to prevent a race with
      memory hotplug.  However lock_system_sleep() is a no-op if
      CONFIG_HIBERNATION=n.  Therefore we need a new lock.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20d6c96b
    • J
      vmalloc: eagerly clear ptes on vunmap · 64141da5
      Jeremy Fitzhardinge 提交于
      On stock 2.6.37-rc4, running:
      
        # mount lilith:/export /mnt/lilith
        # find  /mnt/lilith/ -type f -print0 | xargs -0 file
      
      crashes the machine fairly quickly under Xen.  Often it results in oops
      messages, but the couple of times I tried just now, it just hung quietly
      and made Xen print some rude messages:
      
          (XEN) mm.c:2389:d80 Bad type (saw 7400000000000001 != exp
          3000000000000000) for mfn 1d7058 (pfn 18fa7)
          (XEN) mm.c:964:d80 Attempt to create linear p.t. with write perms
          (XEN) mm.c:2389:d80 Bad type (saw 7400000000000010 != exp
          1000000000000000) for mfn 1d2e04 (pfn 1d1fb)
          (XEN) mm.c:2965:d80 Error while pinning mfn 1d2e04
      
      Which means the domain tried to map a pagetable page RW, which would
      allow it to map arbitrary memory, so Xen stopped it.  This is because
      vm_unmap_ram() left some pages mapped in the vmalloc area after NFS had
      finished with them, and those pages got recycled as pagetable pages
      while still having these RW aliases.
      
      Removing those mappings immediately removes the Xen-visible aliases, and
      so it has no problem with those pages being reused as pagetable pages.
      Deferring the TLB flush doesn't upset Xen because it can flush the TLB
      itself as needed to maintain its invariants.
      
      When unmapping a region in the vmalloc space, clear the ptes
      immediately.  There's no point in deferring this because there's no
      amortization benefit.
      
      The TLBs are left dirty, and they are flushed lazily to amortize the
      cost of the IPIs.
      
      This specific motivation for this patch is an oops-causing regression
      since 2.6.36 when using NFS under Xen, triggered by the NFS client's use
      of vm_map_ram() introduced in 56e4ebf8 ("NFS: readdir with vmapped
      pages") .  XFS also uses vm_map_ram() and could cause similar problems.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Bryan Schumaker <bjschuma@netapp.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64141da5
    • W
      vmstat: fix dirty threshold ordering · e172662d
      Wu Fengguang 提交于
      The nr_dirty_[background_]threshold fields are misplaced before the
      numa_* fields, and users will read strange values.
      
      This is the right order.  Before patch, nr_dirty_background_threshold
      will read as 0 (the value from numa_miss).
      
      	numa_hit 128501
      	numa_miss 0
      	numa_foreign 0
      	numa_interleave 7388
      	numa_local 128501
      	numa_other 0
      	nr_dirty_threshold 144291
      	nr_dirty_background_threshold 72145
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Michael Rubin <mrubin@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e172662d
    • Z
      mm/mempolicy.c: add rcu read lock to protect pid structure · 55cfaa3c
      Zeng Zhaoming 提交于
      find_task_by_vpid() should be protected by rcu_read_lock(), to prevent
      free_pid() reclaiming pid.
      Signed-off-by: NZeng Zhaoming <zengzm.kernel@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55cfaa3c
    • D
      mm/hugetlb.c: avoid double unlock_page() in hugetlb_fault() · 1f64d69c
      Dean Nelson 提交于
      Have hugetlb_fault() call unlock_page(page) only if it had previously
      called lock_page(page).
      
      Setting CONFIG_DEBUG_VM=y and then running the libhugetlbfs test suite,
      resulted in the tripping of VM_BUG_ON(!PageLocked(page)) in
      unlock_page() having been called by hugetlb_fault() when page ==
      pagecache_page.  This patch remedied the problem.
      Signed-off-by: NDean Nelson <dnelson@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f64d69c
  15. 02 12月, 2010 1 次提交
    • L
      Call the filesystem back whenever a page is removed from the page cache · 6072d13c
      Linus Torvalds 提交于
      NFS needs to be able to release objects that are stored in the page
      cache once the page itself is no longer visible from the page cache.
      
      This patch adds a callback to the address space operations that allows
      filesystems to perform page cleanups once the page has been removed
      from the page cache.
      
      Original patch by: Linus Torvalds <torvalds@linux-foundation.org>
      [trondmy: cover the cases of invalidate_inode_pages2() and
                truncate_inode_pages()]
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      6072d13c
  16. 29 11月, 2010 2 次提交
    • J
      Kill off a bunch of warning: ‘inline’ is not at beginning of declaration · fa9f90be
      Jesper Juhl 提交于
      These warnings are spewed during a build of a 'allnoconfig' kernel
      (especially the ones from u64_stats_sync.h show up a lot) when building
      with -Wextra (which I often do)..
      They are
        a) annoying
        b) easy to get rid of.
      This patch kills them off.
      
      include/linux/u64_stats_sync.h:70:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:77:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:84:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:96:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:115:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:127:1: warning: ‘inline’ is not at beginning of declaration
      kernel/time.c:241:1: warning: ‘inline’ is not at beginning of declaration
      kernel/time.c:257:1: warning: ‘inline’ is not at beginning of declaration
      kernel/perf_event.c:4513:1: warning: ‘inline’ is not at beginning of declaration
      mm/page_alloc.c:4012:1: warning: ‘inline’ is not at beginning of declaration
      Signed-off-by: NJesper Juhl <jj@chaosbits.net>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      fa9f90be
    • S
      tracing/slab: Move kmalloc tracepoint out of inline code · 85beb586
      Steven Rostedt 提交于
      The tracepoint for kmalloc is in the slab inlined code which causes
      every instance of kmalloc to have the tracepoint.
      
      This patch moves the tracepoint out of the inline code to the
      slab C file, which removes a large number of inlined trace
      points.
      
        objdump -dr vmlinux.slab| grep 'jmpq.*<trace_kmalloc' |wc -l
      213
        objdump -dr vmlinux.slab.patched| grep 'jmpq.*<trace_kmalloc' |wc -l
      1
      
      This also has a nice impact on size.
      
         text	   data	    bss	    dec	    hex	filename
      7023060	2121564	2482432	11627056	 b16a30	vmlinux.slab
      6970579	2109772	2482432	11562783	 b06f1f	vmlinux.slab.patched
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      85beb586
  17. 25 11月, 2010 6 次提交
    • D
      mm: remove call to find_vma in pagewalk for non-hugetlbfs · 5f0af70a
      David Sterba 提交于
      Commit d33b9f45 ("mm: hugetlb: fix hugepage memory leak in
      walk_page_range()") introduces a check if a vma is a hugetlbfs one and
      later in 5dc37642 ("mm hugetlb: add hugepage support to pagemap") it is
      moved under #ifdef CONFIG_HUGETLB_PAGE but a needless find_vma call is
      left behind and its result is not used anywhere else in the function.
      
      The side-effect of caching vma for @addr inside walk->mm is neither
      utilized in walk_page_range() nor in called functions.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f0af70a
    • K
      mm/page_alloc.c: fix build_all_zonelist() where percpu_alloc() is wrongly... · e9959f0f
      KAMEZAWA Hiroyuki 提交于
      mm/page_alloc.c: fix build_all_zonelist() where percpu_alloc() is wrongly called under stop_machine_run()
      
      During memory hotplug, build_allzonelists() may be called under
      stop_machine_run().  In this function, setup_zone_pageset() is called.
      But it's bug because it will do page allocation under stop_machine_run().
      
      Here is a report from Alok Kataria.
      
        BUG: sleeping function called from invalid context at kernel/mutex.c:94
        in_atomic(): 0, irqs_disabled(): 1, pid: 4, name: migration/0
        Pid: 4, comm: migration/0 Not tainted 2.6.35.6-45.fc14.x86_64 #1
        Call Trace:
         [<ffffffff8103d12b>] __might_sleep+0xeb/0xf0
         [<ffffffff81468245>] mutex_lock+0x24/0x50
         [<ffffffff8110eaa6>] pcpu_alloc+0x6d/0x7ee
         [<ffffffff81048888>] ? load_balance+0xbe/0x60e
         [<ffffffff8103a1b3>] ? rt_se_boosted+0x21/0x2f
         [<ffffffff8103e1cf>] ? dequeue_rt_stack+0x18b/0x1ed
         [<ffffffff8110f237>] __alloc_percpu+0x10/0x12
         [<ffffffff81465e22>] setup_zone_pageset+0x38/0xbe
         [<ffffffff810d6d81>] ? build_zonelists_node.clone.58+0x79/0x8c
         [<ffffffff81452539>] __build_all_zonelists+0x419/0x46c
         [<ffffffff8108ef01>] ? cpu_stopper_thread+0xb2/0x198
         [<ffffffff8108f075>] stop_machine_cpu_stop+0x8e/0xc5
         [<ffffffff8108efe7>] ? stop_machine_cpu_stop+0x0/0xc5
         [<ffffffff8108ef57>] cpu_stopper_thread+0x108/0x198
         [<ffffffff81467a37>] ? schedule+0x5b2/0x5cc
         [<ffffffff8108ee4f>] ? cpu_stopper_thread+0x0/0x198
         [<ffffffff81065f29>] kthread+0x7f/0x87
         [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
         [<ffffffff81065eaa>] ? kthread+0x0/0x87
         [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
        Built 5 zonelists in Node order, mobility grouping on.  Total pages: 289456
        Policy zone: Normal
      
      This patch tries to fix the issue by moving setup_zone_pageset() out from
      stop_machine_run(). It's obviously not necessary to be called under
      stop_machine_run().
      
      [akpm@linux-foundation.org: remove unneeded local]
      Reported-by: NAlok Kataria <akataria@vmware.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Petr Vandrovec <petr@vmware.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9959f0f
    • M
      cgroups: make swap accounting default behavior configurable · a42c390c
      Michal Hocko 提交于
      Swap accounting can be configured by CONFIG_CGROUP_MEM_RES_CTLR_SWAP
      configuration option and then it is turned on by default.  There is a boot
      option (noswapaccount) which can disable this feature.
      
      This makes it hard for distributors to enable the configuration option as
      this feature leads to a bigger memory consumption and this is a no-go for
      general purpose distribution kernel.  On the other hand swap accounting
      may be very usuful for some workloads.
      
      This patch adds a new configuration option which controls the default
      behavior (CGROUP_MEM_RES_CTLR_SWAP_ENABLED).  If the option is selected
      then the feature is turned on by default.
      
      It also adds a new boot parameter swapaccount[=1|0] which enhances the
      original noswapaccount parameter semantic by means of enable/disable logic
      (defaults to 1 if no value is provided to be still consistent with
      noswapaccount).
      
      The default behavior is unchanged (if CONFIG_CGROUP_MEM_RES_CTLR_SWAP is
      enabled then CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED is enabled as well)
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a42c390c
    • D
      memcg: avoid deadlock between move charge and try_charge() · b1dd693e
      Daisuke Nishimura 提交于
      __mem_cgroup_try_charge() can be called under down_write(&mmap_sem)(e.g.
      mlock does it). This means it can cause deadlock if it races with move charge:
      
      Ex.1)
                      move charge             |        try charge
        --------------------------------------+------------------------------
          mem_cgroup_can_attach()             |  down_write(&mmap_sem)
            mc.moving_task = current          |    ..
            mem_cgroup_precharge_mc()         |  __mem_cgroup_try_charge()
              mem_cgroup_count_precharge()    |    prepare_to_wait()
                down_read(&mmap_sem)          |    if (mc.moving_task)
                -> cannot aquire the lock     |    -> true
                                              |      schedule()
      
      Ex.2)
                      move charge             |        try charge
        --------------------------------------+------------------------------
          mem_cgroup_can_attach()             |
            mc.moving_task = current          |
            mem_cgroup_precharge_mc()         |
              mem_cgroup_count_precharge()    |
                down_read(&mmap_sem)          |
                ..                            |
                up_read(&mmap_sem)            |
                                              |  down_write(&mmap_sem)
          mem_cgroup_move_task()              |    ..
            mem_cgroup_move_charge()          |  __mem_cgroup_try_charge()
              down_read(&mmap_sem)            |    prepare_to_wait()
              -> cannot aquire the lock       |    if (mc.moving_task)
                                              |    -> true
                                              |      schedule()
      
      To avoid this deadlock, we do all the move charge works (both can_attach() and
      attach()) under one mmap_sem section.
      And after this patch, we set/clear mc.moving_task outside mc.lock, because we
      use the lock only to check mc.from/to.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1dd693e
    • K
      memcg: fix false positive VM_BUG on non-SMP · 112bc2e1
      Kirill A. Shutemov 提交于
      Fix this:
      
        kernel BUG at mm/memcontrol.c:2155!
        invalid opcode: 0000 [#1]
        last sysfs file:
      
        Pid: 18, comm: sh Not tainted 2.6.37-rc3 #3 /Bochs
        EIP: 0060:[<c10731b2>] EFLAGS: 00000246 CPU: 0
        EIP is at mem_cgroup_move_account+0xe2/0xf0
        EAX: 00000004 EBX: c6f931d4 ECX: c681c300 EDX: c681c000
        ESI: c681c300 EDI: ffffffea EBP: c681c000 ESP: c46f3e30
         DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
        Process sh (pid: 18, ti=c46f2000 task=c6826e60 task.ti=c46f2000)
        Stack:
         00000155 c681c000 0805f000 c46ee180 c46f3e5c c7058820 c1074d37 00000000
         08060000 c46db9a0 c46ec080 c7058820 0805f000 08060000 c46f3e98 c1074c50
         c106c75e c46f3e98 c46ec080 08060000 0805ffff c46db9a0 c46f3e98 c46e0340
        Call Trace:
         [<c1074d37>] ? mem_cgroup_move_charge_pte_range+0xe7/0x130
         [<c1074c50>] ? mem_cgroup_move_charge_pte_range+0x0/0x130
         [<c106c75e>] ? walk_page_range+0xee/0x1d0
         [<c10725d6>] ? mem_cgroup_move_task+0x66/0x90
         [<c1074c50>] ? mem_cgroup_move_charge_pte_range+0x0/0x130
         [<c1072570>] ? mem_cgroup_move_task+0x0/0x90
         [<c1042616>] ? cgroup_attach_task+0x136/0x200
         [<c1042878>] ? cgroup_tasks_write+0x48/0xc0
         [<c1041e9e>] ? cgroup_file_write+0xde/0x220
         [<c101398d>] ? do_page_fault+0x17d/0x3f0
         [<c108a79d>] ? alloc_fd+0x2d/0xd0
         [<c1041dc0>] ? cgroup_file_write+0x0/0x220
         [<c1077ba2>] ? vfs_write+0x92/0xc0
         [<c1077c81>] ? sys_write+0x41/0x70
         [<c1140e3d>] ? syscall_call+0x7/0xb
        Code: 03 00 74 09 8b 44 24 04 e8 1c f1 ff ff 89 73 04 8d 86 b0 00 00 00 b9 01 00 00 00 89 da 31 ff e8 65 f5 ff ff e9 4d ff ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 90 8d b4 26 00 00 00 00 83 ec 10 8b 0d f4 e3
        EIP: [<c10731b2>] mem_cgroup_move_account+0xe2/0xf0 SS:ESP 0068:c46f3e30
        ---[ end trace 7daa1582159b6532 ]---
      
      lock_page_cgroup and unlock_page_cgroup are implemented using
      bit_spinlock.  bit_spinlock doesn't touch the bit if we are on non-SMP
      machine, so we can't use the bit to check whether the lock was taken.
      
      Let's introduce is_page_cgroup_locked based on bit_spin_is_locked instead
      of PageCgroupLocked to fix it.
      
      [akpm@linux-foundation.org: s/is_page_cgroup_locked/page_is_cgroup_locked/]
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      112bc2e1
    • S
      nommu: yield CPU while disposing VM · 04c34961
      Steven J. Magnani 提交于
      Depending on processor speed, page size, and the amount of memory a
      process is allowed to amass, cleanup of a large VM may freeze the system
      for many seconds.  This can result in a watchdog timeout.
      
      Make sure other tasks receive some service when cleaning up large VMs.
      Signed-off-by: NSteven J. Magnani <steve@digidescorp.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04c34961
  18. 14 11月, 2010 1 次提交
  19. 13 11月, 2010 1 次提交
    • T
      block: make blkdev_get/put() handle exclusive access · e525fd89
      Tejun Heo 提交于
      Over time, block layer has accumulated a set of APIs dealing with bdev
      open, close, claim and release.
      
      * blkdev_get/put() are the primary open and close functions.
      
      * bd_claim/release() deal with exclusive open.
      
      * open/close_bdev_exclusive() are combination of open and claim and
        the other way around, respectively.
      
      * bd_link/unlink_disk_holder() to create and remove holder/slave
        symlinks.
      
      * open_by_devnum() wraps bdget() + blkdev_get().
      
      The interface is a bit confusing and the decoupling of open and claim
      makes it impossible to properly guarantee exclusive access as
      in-kernel open + claim sequence can disturb the existing exclusive
      open even before the block layer knows the current open if for another
      exclusive access.  Reorganize the interface such that,
      
      * blkdev_get() is extended to include exclusive access management.
        @holder argument is added and, if is @FMODE_EXCL specified, it will
        gain exclusive access atomically w.r.t. other exclusive accesses.
      
      * blkdev_put() is similarly extended.  It now takes @mode argument and
        if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
        the last exclusive claim is released, the holder/slave symlinks are
        removed automatically.
      
      * bd_claim/release() and close_bdev_exclusive() are no longer
        necessary and either made static or removed.
      
      * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
        is no longer necessary and removed.
      
      * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
        and blkdev_get().  It also has an unexpected extra bdev_read_only()
        test which probably should be moved into blkdev_get().
      
      * open_by_devnum() is modified to take @holder argument and pass it to
        blkdev_get().
      
      Most of bdev open/close operations are unified into blkdev_get/put()
      and most exclusive accesses are tested atomically at the open time (as
      it should).  This cleans up code and removes some, both valid and
      invalid, but unnecessary all the same, corner cases.
      
      open_bdev_exclusive() and open_by_devnum() can use further cleanup -
      rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
      special features.  Well, let's leave them for another day.
      
      Most conversions are straight-forward.  drbd conversion is a bit more
      involved as there was some reordering, but the logic should stay the
      same.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Brown <neilb@suse.de>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: dm-devel@redhat.com
      Cc: drbd-dev@lists.linbit.com
      Cc: Leo Chen <leochen@broadcom.com>
      Cc: Scott Branden <sbranden@broadcom.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      e525fd89
  20. 12 11月, 2010 1 次提交
    • N
      radix-tree: fix RCU bug · 27d20fdd
      Nick Piggin 提交于
      Salman Qazi describes the following radix-tree bug:
      
      In the following case, we get can get a deadlock:
      
      0.  The radix tree contains two items, one has the index 0.
      1.  The reader (in this case find_get_pages) takes the rcu_read_lock.
      2.  The reader acquires slot(s) for item(s) including the index 0 item.
      3.  The non-zero index item is deleted, and as a consequence the other item is
          moved to the root of the tree. The place where it used to be is queued for
          deletion after the readers finish.
      3b. The zero item is deleted, removing it from the direct slot, it remains in
          the rcu-delayed indirect node.
      4.  The reader looks at the index 0 slot, and finds that the page has 0 ref
          count
      5.  The reader looks at it again, hoping that the item will either be freed or
          the ref count will increase. This never happens, as the slot it is looking
          at will never be updated. Also, this slot can never be reclaimed because
          the reader is holding rcu_read_lock and is in an infinite loop.
      
      The fix is to re-use the same "indirect" pointer case that requires a slot
      lookup retry into a general "retry the lookup" bit.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Reported-by: NSalman Qazi <sqazi@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27d20fdd