1. 14 1月, 2011 8 次提交
  2. 07 1月, 2011 3 次提交
    • N
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin 提交于
      RCU free the struct inode. This will allow:
      
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fa0d7e3d
    • N
      fs: dcache remove dcache_lock · b5c84bf6
      Nick Piggin 提交于
      dcache_lock no longer protects anything. remove it.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b5c84bf6
    • N
      kernel: kmem_ptr_validate considered harmful · ccd35fb9
      Nick Piggin 提交于
      This is a nasty and error prone API. It is no longer used, remove it.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      ccd35fb9
  3. 04 1月, 2011 1 次提交
  4. 31 12月, 2010 1 次提交
  5. 27 12月, 2010 1 次提交
  6. 24 12月, 2010 2 次提交
  7. 23 12月, 2010 3 次提交
  8. 22 12月, 2010 1 次提交
  9. 18 12月, 2010 1 次提交
    • C
      vmstat: User per cpu atomics to avoid interrupt disable / enable · 7c839120
      Christoph Lameter 提交于
      Currently the operations to increment vm counters must disable interrupts
      in order to not mess up their housekeeping of counters.
      
      So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer
      count on preremption being disabled we still have some minor issues.
      The fetching of the counter thresholds is racy.
      A threshold from another cpu may be applied if we happen to be
      rescheduled on another cpu.  However, the following vmstat operation
      will then bring the counter again under the threshold limit.
      
      The operations for __xxx_zone_state are not changed since the caller
      has taken care of the synchronization needs (and therefore the cycle
      count is even less than the optimized version for the irq disable case
      provided here).
      
      The optimization using this_cpu_cmpxchg will only be used if the arch
      supports efficient this_cpu_ops (must have CONFIG_CMPXCHG_LOCAL set!)
      
      The use of this_cpu_cmpxchg reduces the cycle count for the counter
      operations by %80 (inc_zone_page_state goes from 170 cycles to 32).
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      7c839120
  10. 17 12月, 2010 3 次提交
  11. 16 12月, 2010 1 次提交
    • T
      install_special_mapping skips security_file_mmap check. · 462e635e
      Tavis Ormandy 提交于
      The install_special_mapping routine (used, for example, to setup the
      vdso) skips the security check before insert_vm_struct, allowing a local
      attacker to bypass the mmap_min_addr security restriction by limiting
      the available pages for special mappings.
      
      bprm_mm_init() also skips the check, and although I don't think this can
      be used to bypass any restrictions, I don't see any reason not to have
      the security check.
      
        $ uname -m
        x86_64
        $ cat /proc/sys/vm/mmap_min_addr
        65536
        $ cat install_special_mapping.s
        section .bss
            resb BSS_SIZE
        section .text
            global _start
            _start:
                mov     eax, __NR_pause
                int     0x80
        $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
        $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
        $ ./install_special_mapping &
        [1] 14303
        $ cat /proc/14303/maps
        0000f000-00010000 r-xp 00000000 00:00 0                                  [vdso]
        00010000-00011000 r-xp 00001000 00:19 2453665                            /home/taviso/install_special_mapping
        00011000-ffffe000 rwxp 00000000 00:00 0                                  [stack]
      
      It's worth noting that Red Hat are shipping with mmap_min_addr set to
      4096.
      Signed-off-by: NTavis Ormandy <taviso@google.com>
      Acked-by: NKees Cook <kees@ubuntu.com>
      Acked-by: NRobert Swiecki <swiecki@google.com>
      [ Changed to not drop the error code - akpm ]
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      462e635e
  12. 15 12月, 2010 1 次提交
    • T
      workqueue: convert cancel_rearming_delayed_work[queue]() users to cancel_delayed_work_sync() · afe2c511
      Tejun Heo 提交于
      cancel_rearming_delayed_work[queue]() has been superceded by
      cancel_delayed_work_sync() quite some time ago.  Convert all the
      in-kernel users.  The conversions are completely equivalent and
      trivial.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      Acked-by: NEvgeniy Polyakov <zbr@ioremap.net>
      Cc: Jeff Garzik <jgarzik@pobox.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: netdev@vger.kernel.org
      Cc: Anton Vorontsov <cbou@mail.ru>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: xfs-masters@oss.sgi.com
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: netfilter-devel@vger.kernel.org
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: linux-nfs@vger.kernel.org
      afe2c511
  13. 07 12月, 2010 2 次提交
  14. 04 12月, 2010 2 次提交
    • T
      slub: Fix a crash during slabinfo -v · 37d57443
      Tero Roponen 提交于
      Commit f7cb1933 ("SLUB: Pass active
      and inactive redzone flags instead of boolean to debug functions")
      missed two instances of check_object(). This caused a lot of warnings
      during 'slabinfo -v' finally leading to a crash:
      
        BUG ext4_xattr: Freepointer corrupt
        ...
        BUG buffer_head: Freepointer corrupt
        ...
        BUG ext4_alloc_context: Freepointer corrupt
        ...
        ...
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff810a291f>] file_sb_list_del+0x1c/0x35
        PGD 79d78067 PUD 79e67067 PMD 0
        Oops: 0002 [#1] SMP
        last sysfs file: /sys/kernel/slab/:t-0000192/validate
      
      This patch fixes the problem by converting the two missed instances.
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTero Roponen <tero.roponen@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      37d57443
    • T
      slub: Fix a crash during slabinfo -v · 8165984a
      Tero Roponen 提交于
      Commit f7cb1933 ("SLUB: Pass active
      and inactive redzone flags instead of boolean to debug functions")
      missed two instances of check_object(). This caused a lot of warnings
      during 'slabinfo -v' finally leading to a crash:
      
        BUG ext4_xattr: Freepointer corrupt
        ...
        BUG buffer_head: Freepointer corrupt
        ...
        BUG ext4_alloc_context: Freepointer corrupt
        ...
        ...
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff810a291f>] file_sb_list_del+0x1c/0x35
        PGD 79d78067 PUD 79e67067 PMD 0
        Oops: 0002 [#1] SMP
        last sysfs file: /sys/kernel/slab/:t-0000192/validate
      
      This patch fixes the problem by converting the two missed instances.
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTero Roponen <tero.roponen@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      8165984a
  15. 03 12月, 2010 6 次提交
    • K
      ksm: annotate ksm_thread_mutex is no deadlock source · a0b0f58c
      KOSAKI Motohiro 提交于
      commit 62b61f61 ("ksm: memory hotremove migration only") caused the
      following new lockdep warning.
      
        =======================================================
        [ INFO: possible circular locking dependency detected ]
        -------------------------------------------------------
        bash/1621 is trying to acquire lock:
         ((memory_chain).rwsem){.+.+.+}, at: [<ffffffff81079339>]
        __blocking_notifier_call_chain+0x69/0xc0
      
        but task is already holding lock:
         (ksm_thread_mutex){+.+.+.}, at: [<ffffffff8113a3aa>]
        ksm_memory_callback+0x3a/0xc0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (ksm_thread_mutex){+.+.+.}:
             [<ffffffff8108b70a>] lock_acquire+0xaa/0x140
             [<ffffffff81505d74>] __mutex_lock_common+0x44/0x3f0
             [<ffffffff81506228>] mutex_lock_nested+0x48/0x60
             [<ffffffff8113a3aa>] ksm_memory_callback+0x3a/0xc0
             [<ffffffff8150c21c>] notifier_call_chain+0x8c/0xe0
             [<ffffffff8107934e>] __blocking_notifier_call_chain+0x7e/0xc0
             [<ffffffff810793a6>] blocking_notifier_call_chain+0x16/0x20
             [<ffffffff813afbfb>] memory_notify+0x1b/0x20
             [<ffffffff81141b7c>] remove_memory+0x1cc/0x5f0
             [<ffffffff813af53d>] memory_block_change_state+0xfd/0x1a0
             [<ffffffff813afd62>] store_mem_state+0xe2/0xf0
             [<ffffffff813a0bb0>] sysdev_store+0x20/0x30
             [<ffffffff811bc116>] sysfs_write_file+0xe6/0x170
             [<ffffffff8114f398>] vfs_write+0xc8/0x190
             [<ffffffff8114fc14>] sys_write+0x54/0x90
             [<ffffffff810028b2>] system_call_fastpath+0x16/0x1b
      
        -> #0 ((memory_chain).rwsem){.+.+.+}:
             [<ffffffff8108b5ba>] __lock_acquire+0x155a/0x1600
             [<ffffffff8108b70a>] lock_acquire+0xaa/0x140
             [<ffffffff81506601>] down_read+0x51/0xa0
             [<ffffffff81079339>] __blocking_notifier_call_chain+0x69/0xc0
             [<ffffffff810793a6>] blocking_notifier_call_chain+0x16/0x20
             [<ffffffff813afbfb>] memory_notify+0x1b/0x20
             [<ffffffff81141f1e>] remove_memory+0x56e/0x5f0
             [<ffffffff813af53d>] memory_block_change_state+0xfd/0x1a0
             [<ffffffff813afd62>] store_mem_state+0xe2/0xf0
             [<ffffffff813a0bb0>] sysdev_store+0x20/0x30
             [<ffffffff811bc116>] sysfs_write_file+0xe6/0x170
             [<ffffffff8114f398>] vfs_write+0xc8/0x190
             [<ffffffff8114fc14>] sys_write+0x54/0x90
             [<ffffffff810028b2>] system_call_fastpath+0x16/0x1b
      
      But it's a false positive.  Both memory_chain.rwsem and ksm_thread_mutex
      have an outer lock (mem_hotplug_mutex).  So they cannot deadlock.
      
      Thus, This patch annotate ksm_thread_mutex is not deadlock source.
      
      [akpm@linux-foundation.org: update comment, from Hugh]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0b0f58c
    • K
      mem-hotplug: introduce {un}lock_memory_hotplug() · 20d6c96b
      KOSAKI Motohiro 提交于
      Presently hwpoison is using lock_system_sleep() to prevent a race with
      memory hotplug.  However lock_system_sleep() is a no-op if
      CONFIG_HIBERNATION=n.  Therefore we need a new lock.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20d6c96b
    • J
      vmalloc: eagerly clear ptes on vunmap · 64141da5
      Jeremy Fitzhardinge 提交于
      On stock 2.6.37-rc4, running:
      
        # mount lilith:/export /mnt/lilith
        # find  /mnt/lilith/ -type f -print0 | xargs -0 file
      
      crashes the machine fairly quickly under Xen.  Often it results in oops
      messages, but the couple of times I tried just now, it just hung quietly
      and made Xen print some rude messages:
      
          (XEN) mm.c:2389:d80 Bad type (saw 7400000000000001 != exp
          3000000000000000) for mfn 1d7058 (pfn 18fa7)
          (XEN) mm.c:964:d80 Attempt to create linear p.t. with write perms
          (XEN) mm.c:2389:d80 Bad type (saw 7400000000000010 != exp
          1000000000000000) for mfn 1d2e04 (pfn 1d1fb)
          (XEN) mm.c:2965:d80 Error while pinning mfn 1d2e04
      
      Which means the domain tried to map a pagetable page RW, which would
      allow it to map arbitrary memory, so Xen stopped it.  This is because
      vm_unmap_ram() left some pages mapped in the vmalloc area after NFS had
      finished with them, and those pages got recycled as pagetable pages
      while still having these RW aliases.
      
      Removing those mappings immediately removes the Xen-visible aliases, and
      so it has no problem with those pages being reused as pagetable pages.
      Deferring the TLB flush doesn't upset Xen because it can flush the TLB
      itself as needed to maintain its invariants.
      
      When unmapping a region in the vmalloc space, clear the ptes
      immediately.  There's no point in deferring this because there's no
      amortization benefit.
      
      The TLBs are left dirty, and they are flushed lazily to amortize the
      cost of the IPIs.
      
      This specific motivation for this patch is an oops-causing regression
      since 2.6.36 when using NFS under Xen, triggered by the NFS client's use
      of vm_map_ram() introduced in 56e4ebf8 ("NFS: readdir with vmapped
      pages") .  XFS also uses vm_map_ram() and could cause similar problems.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Bryan Schumaker <bjschuma@netapp.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64141da5
    • W
      vmstat: fix dirty threshold ordering · e172662d
      Wu Fengguang 提交于
      The nr_dirty_[background_]threshold fields are misplaced before the
      numa_* fields, and users will read strange values.
      
      This is the right order.  Before patch, nr_dirty_background_threshold
      will read as 0 (the value from numa_miss).
      
      	numa_hit 128501
      	numa_miss 0
      	numa_foreign 0
      	numa_interleave 7388
      	numa_local 128501
      	numa_other 0
      	nr_dirty_threshold 144291
      	nr_dirty_background_threshold 72145
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Michael Rubin <mrubin@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e172662d
    • Z
      mm/mempolicy.c: add rcu read lock to protect pid structure · 55cfaa3c
      Zeng Zhaoming 提交于
      find_task_by_vpid() should be protected by rcu_read_lock(), to prevent
      free_pid() reclaiming pid.
      Signed-off-by: NZeng Zhaoming <zengzm.kernel@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55cfaa3c
    • D
      mm/hugetlb.c: avoid double unlock_page() in hugetlb_fault() · 1f64d69c
      Dean Nelson 提交于
      Have hugetlb_fault() call unlock_page(page) only if it had previously
      called lock_page(page).
      
      Setting CONFIG_DEBUG_VM=y and then running the libhugetlbfs test suite,
      resulted in the tripping of VM_BUG_ON(!PageLocked(page)) in
      unlock_page() having been called by hugetlb_fault() when page ==
      pagecache_page.  This patch remedied the problem.
      Signed-off-by: NDean Nelson <dnelson@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f64d69c
  16. 02 12月, 2010 1 次提交
    • L
      Call the filesystem back whenever a page is removed from the page cache · 6072d13c
      Linus Torvalds 提交于
      NFS needs to be able to release objects that are stored in the page
      cache once the page itself is no longer visible from the page cache.
      
      This patch adds a callback to the address space operations that allows
      filesystems to perform page cleanups once the page has been removed
      from the page cache.
      
      Original patch by: Linus Torvalds <torvalds@linux-foundation.org>
      [trondmy: cover the cases of invalidate_inode_pages2() and
                truncate_inode_pages()]
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      6072d13c
  17. 29 11月, 2010 2 次提交
    • J
      Kill off a bunch of warning: ‘inline’ is not at beginning of declaration · fa9f90be
      Jesper Juhl 提交于
      These warnings are spewed during a build of a 'allnoconfig' kernel
      (especially the ones from u64_stats_sync.h show up a lot) when building
      with -Wextra (which I often do)..
      They are
        a) annoying
        b) easy to get rid of.
      This patch kills them off.
      
      include/linux/u64_stats_sync.h:70:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:77:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:84:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:96:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:115:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:127:1: warning: ‘inline’ is not at beginning of declaration
      kernel/time.c:241:1: warning: ‘inline’ is not at beginning of declaration
      kernel/time.c:257:1: warning: ‘inline’ is not at beginning of declaration
      kernel/perf_event.c:4513:1: warning: ‘inline’ is not at beginning of declaration
      mm/page_alloc.c:4012:1: warning: ‘inline’ is not at beginning of declaration
      Signed-off-by: NJesper Juhl <jj@chaosbits.net>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      fa9f90be
    • S
      tracing/slab: Move kmalloc tracepoint out of inline code · 85beb586
      Steven Rostedt 提交于
      The tracepoint for kmalloc is in the slab inlined code which causes
      every instance of kmalloc to have the tracepoint.
      
      This patch moves the tracepoint out of the inline code to the
      slab C file, which removes a large number of inlined trace
      points.
      
        objdump -dr vmlinux.slab| grep 'jmpq.*<trace_kmalloc' |wc -l
      213
        objdump -dr vmlinux.slab.patched| grep 'jmpq.*<trace_kmalloc' |wc -l
      1
      
      This also has a nice impact on size.
      
         text	   data	    bss	    dec	    hex	filename
      7023060	2121564	2482432	11627056	 b16a30	vmlinux.slab
      6970579	2109772	2482432	11562783	 b06f1f	vmlinux.slab.patched
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      85beb586
  18. 25 11月, 2010 1 次提交