1. 14 1月, 2011 3 次提交
    • B
      mm/page-writeback.c: fix __set_page_dirty_no_writeback() return value · c3f0da63
      Bob Liu 提交于
      __set_page_dirty_no_writeback() should return true if it actually
      transitioned the page from a clean to dirty state although it seems nobody
      uses its return value at present.
      Signed-off-by: NBob Liu <lliubbo@gmail.com>
      Acked-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3f0da63
    • M
      mm: vmstat: use a single setter function and callback for adjusting percpu thresholds · b44129b3
      Mel Gorman 提交于
      reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist
      to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid
      errors due to counter drift.  The functions duplicate some code so this
      patch replaces them with a single set_pgdat_percpu_threshold() that takes
      a callback function to calculate the desired threshold as a parameter.
      
      [akpm@linux-foundation.org: readability tweak]
      [kosaki.motohiro@jp.fujitsu.com: set_pgdat_percpu_threshold(): don't use for_each_online_cpu]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b44129b3
    • M
      mm: page allocator: adjust the per-cpu counter threshold when memory is low · 88f5acf8
      Mel Gorman 提交于
      Commit aa454840 ("calculate a better estimate of NR_FREE_PAGES when memory
      is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
      avoid synchronization overhead, these counters are maintained on a per-cpu
      basis and drained both periodically and when a threshold is above a
      threshold.  On large CPU systems, the difference between the estimate and
      real value of NR_FREE_PAGES can be very high.  The system can get into a
      case where pages are allocated far below the min watermark potentially
      causing livelock issues.  The commit solved the problem by taking a better
      reading of NR_FREE_PAGES when memory was low.
      
      Unfortately, as reported by Shaohua Li this accurate reading can consume a
      large amount of CPU time on systems with many sockets due to cache line
      bouncing.  This patch takes a different approach.  For large machines
      where counter drift might be unsafe and while kswapd is awake, the per-cpu
      thresholds for the target pgdat are reduced to limit the level of drift to
      what should be a safe level.  This incurs a performance penalty in heavy
      memory pressure by a factor that depends on the workload and the machine
      but the machine should function correctly without accidentally exhausting
      all memory on a node.  There is an additional cost when kswapd wakes and
      sleeps but the event is not expected to be frequent - in Shaohua's test
      case, there was one recorded sleep and wake event at least.
      
      To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
      introduced that takes a more accurate reading of NR_FREE_PAGES when called
      from wakeup_kswapd, when deciding whether it is really safe to go back to
      sleep in sleeping_prematurely() and when deciding if a zone is really
      balanced or not in balance_pgdat().  We are still using an expensive
      function but limiting how often it is called.
      
      When the test case is reproduced, the time spent in the watermark
      functions is reduced.  The following report is on the percentage of time
      spent cumulatively spent in the functions zone_nr_free_pages(),
      zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
      zone_page_state_snapshot(), zone_page_state().
      
      vanilla                      11.6615%
      disable-threshold            0.2584%
      
      David said:
      
      : We had to pull aa454840 "mm: page allocator: calculate a better estimate
      : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
      : internally because tests showed that it would cause the machine to stall
      : as the result of heavy kswapd activity.  I merged it back with this fix as
      : it is pending in the -mm tree and it solves the issue we were seeing, so I
      : definitely think this should be pushed to -stable (and I would seriously
      : consider it for 2.6.37 inclusion even at this late date).
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reported-by: NShaohua Li <shaohua.li@intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NNicolas Bareil <nico@chdir.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f5acf8
  2. 07 1月, 2011 3 次提交
    • N
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin 提交于
      RCU free the struct inode. This will allow:
      
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fa0d7e3d
    • N
      fs: dcache remove dcache_lock · b5c84bf6
      Nick Piggin 提交于
      dcache_lock no longer protects anything. remove it.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b5c84bf6
    • N
      kernel: kmem_ptr_validate considered harmful · ccd35fb9
      Nick Piggin 提交于
      This is a nasty and error prone API. It is no longer used, remove it.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      ccd35fb9
  3. 04 1月, 2011 1 次提交
  4. 31 12月, 2010 1 次提交
  5. 27 12月, 2010 1 次提交
  6. 24 12月, 2010 2 次提交
  7. 23 12月, 2010 3 次提交
  8. 22 12月, 2010 1 次提交
  9. 18 12月, 2010 1 次提交
    • C
      vmstat: User per cpu atomics to avoid interrupt disable / enable · 7c839120
      Christoph Lameter 提交于
      Currently the operations to increment vm counters must disable interrupts
      in order to not mess up their housekeeping of counters.
      
      So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer
      count on preremption being disabled we still have some minor issues.
      The fetching of the counter thresholds is racy.
      A threshold from another cpu may be applied if we happen to be
      rescheduled on another cpu.  However, the following vmstat operation
      will then bring the counter again under the threshold limit.
      
      The operations for __xxx_zone_state are not changed since the caller
      has taken care of the synchronization needs (and therefore the cycle
      count is even less than the optimized version for the irq disable case
      provided here).
      
      The optimization using this_cpu_cmpxchg will only be used if the arch
      supports efficient this_cpu_ops (must have CONFIG_CMPXCHG_LOCAL set!)
      
      The use of this_cpu_cmpxchg reduces the cycle count for the counter
      operations by %80 (inc_zone_page_state goes from 170 cycles to 32).
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      7c839120
  10. 17 12月, 2010 3 次提交
  11. 16 12月, 2010 1 次提交
    • T
      install_special_mapping skips security_file_mmap check. · 462e635e
      Tavis Ormandy 提交于
      The install_special_mapping routine (used, for example, to setup the
      vdso) skips the security check before insert_vm_struct, allowing a local
      attacker to bypass the mmap_min_addr security restriction by limiting
      the available pages for special mappings.
      
      bprm_mm_init() also skips the check, and although I don't think this can
      be used to bypass any restrictions, I don't see any reason not to have
      the security check.
      
        $ uname -m
        x86_64
        $ cat /proc/sys/vm/mmap_min_addr
        65536
        $ cat install_special_mapping.s
        section .bss
            resb BSS_SIZE
        section .text
            global _start
            _start:
                mov     eax, __NR_pause
                int     0x80
        $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
        $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
        $ ./install_special_mapping &
        [1] 14303
        $ cat /proc/14303/maps
        0000f000-00010000 r-xp 00000000 00:00 0                                  [vdso]
        00010000-00011000 r-xp 00001000 00:19 2453665                            /home/taviso/install_special_mapping
        00011000-ffffe000 rwxp 00000000 00:00 0                                  [stack]
      
      It's worth noting that Red Hat are shipping with mmap_min_addr set to
      4096.
      Signed-off-by: NTavis Ormandy <taviso@google.com>
      Acked-by: NKees Cook <kees@ubuntu.com>
      Acked-by: NRobert Swiecki <swiecki@google.com>
      [ Changed to not drop the error code - akpm ]
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      462e635e
  12. 15 12月, 2010 1 次提交
    • T
      workqueue: convert cancel_rearming_delayed_work[queue]() users to cancel_delayed_work_sync() · afe2c511
      Tejun Heo 提交于
      cancel_rearming_delayed_work[queue]() has been superceded by
      cancel_delayed_work_sync() quite some time ago.  Convert all the
      in-kernel users.  The conversions are completely equivalent and
      trivial.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      Acked-by: NEvgeniy Polyakov <zbr@ioremap.net>
      Cc: Jeff Garzik <jgarzik@pobox.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: netdev@vger.kernel.org
      Cc: Anton Vorontsov <cbou@mail.ru>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: xfs-masters@oss.sgi.com
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: netfilter-devel@vger.kernel.org
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: linux-nfs@vger.kernel.org
      afe2c511
  13. 07 12月, 2010 2 次提交
  14. 04 12月, 2010 2 次提交
    • T
      slub: Fix a crash during slabinfo -v · 37d57443
      Tero Roponen 提交于
      Commit f7cb1933 ("SLUB: Pass active
      and inactive redzone flags instead of boolean to debug functions")
      missed two instances of check_object(). This caused a lot of warnings
      during 'slabinfo -v' finally leading to a crash:
      
        BUG ext4_xattr: Freepointer corrupt
        ...
        BUG buffer_head: Freepointer corrupt
        ...
        BUG ext4_alloc_context: Freepointer corrupt
        ...
        ...
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff810a291f>] file_sb_list_del+0x1c/0x35
        PGD 79d78067 PUD 79e67067 PMD 0
        Oops: 0002 [#1] SMP
        last sysfs file: /sys/kernel/slab/:t-0000192/validate
      
      This patch fixes the problem by converting the two missed instances.
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTero Roponen <tero.roponen@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      37d57443
    • T
      slub: Fix a crash during slabinfo -v · 8165984a
      Tero Roponen 提交于
      Commit f7cb1933 ("SLUB: Pass active
      and inactive redzone flags instead of boolean to debug functions")
      missed two instances of check_object(). This caused a lot of warnings
      during 'slabinfo -v' finally leading to a crash:
      
        BUG ext4_xattr: Freepointer corrupt
        ...
        BUG buffer_head: Freepointer corrupt
        ...
        BUG ext4_alloc_context: Freepointer corrupt
        ...
        ...
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff810a291f>] file_sb_list_del+0x1c/0x35
        PGD 79d78067 PUD 79e67067 PMD 0
        Oops: 0002 [#1] SMP
        last sysfs file: /sys/kernel/slab/:t-0000192/validate
      
      This patch fixes the problem by converting the two missed instances.
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTero Roponen <tero.roponen@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      8165984a
  15. 03 12月, 2010 6 次提交
    • K
      ksm: annotate ksm_thread_mutex is no deadlock source · a0b0f58c
      KOSAKI Motohiro 提交于
      commit 62b61f61 ("ksm: memory hotremove migration only") caused the
      following new lockdep warning.
      
        =======================================================
        [ INFO: possible circular locking dependency detected ]
        -------------------------------------------------------
        bash/1621 is trying to acquire lock:
         ((memory_chain).rwsem){.+.+.+}, at: [<ffffffff81079339>]
        __blocking_notifier_call_chain+0x69/0xc0
      
        but task is already holding lock:
         (ksm_thread_mutex){+.+.+.}, at: [<ffffffff8113a3aa>]
        ksm_memory_callback+0x3a/0xc0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (ksm_thread_mutex){+.+.+.}:
             [<ffffffff8108b70a>] lock_acquire+0xaa/0x140
             [<ffffffff81505d74>] __mutex_lock_common+0x44/0x3f0
             [<ffffffff81506228>] mutex_lock_nested+0x48/0x60
             [<ffffffff8113a3aa>] ksm_memory_callback+0x3a/0xc0
             [<ffffffff8150c21c>] notifier_call_chain+0x8c/0xe0
             [<ffffffff8107934e>] __blocking_notifier_call_chain+0x7e/0xc0
             [<ffffffff810793a6>] blocking_notifier_call_chain+0x16/0x20
             [<ffffffff813afbfb>] memory_notify+0x1b/0x20
             [<ffffffff81141b7c>] remove_memory+0x1cc/0x5f0
             [<ffffffff813af53d>] memory_block_change_state+0xfd/0x1a0
             [<ffffffff813afd62>] store_mem_state+0xe2/0xf0
             [<ffffffff813a0bb0>] sysdev_store+0x20/0x30
             [<ffffffff811bc116>] sysfs_write_file+0xe6/0x170
             [<ffffffff8114f398>] vfs_write+0xc8/0x190
             [<ffffffff8114fc14>] sys_write+0x54/0x90
             [<ffffffff810028b2>] system_call_fastpath+0x16/0x1b
      
        -> #0 ((memory_chain).rwsem){.+.+.+}:
             [<ffffffff8108b5ba>] __lock_acquire+0x155a/0x1600
             [<ffffffff8108b70a>] lock_acquire+0xaa/0x140
             [<ffffffff81506601>] down_read+0x51/0xa0
             [<ffffffff81079339>] __blocking_notifier_call_chain+0x69/0xc0
             [<ffffffff810793a6>] blocking_notifier_call_chain+0x16/0x20
             [<ffffffff813afbfb>] memory_notify+0x1b/0x20
             [<ffffffff81141f1e>] remove_memory+0x56e/0x5f0
             [<ffffffff813af53d>] memory_block_change_state+0xfd/0x1a0
             [<ffffffff813afd62>] store_mem_state+0xe2/0xf0
             [<ffffffff813a0bb0>] sysdev_store+0x20/0x30
             [<ffffffff811bc116>] sysfs_write_file+0xe6/0x170
             [<ffffffff8114f398>] vfs_write+0xc8/0x190
             [<ffffffff8114fc14>] sys_write+0x54/0x90
             [<ffffffff810028b2>] system_call_fastpath+0x16/0x1b
      
      But it's a false positive.  Both memory_chain.rwsem and ksm_thread_mutex
      have an outer lock (mem_hotplug_mutex).  So they cannot deadlock.
      
      Thus, This patch annotate ksm_thread_mutex is not deadlock source.
      
      [akpm@linux-foundation.org: update comment, from Hugh]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0b0f58c
    • K
      mem-hotplug: introduce {un}lock_memory_hotplug() · 20d6c96b
      KOSAKI Motohiro 提交于
      Presently hwpoison is using lock_system_sleep() to prevent a race with
      memory hotplug.  However lock_system_sleep() is a no-op if
      CONFIG_HIBERNATION=n.  Therefore we need a new lock.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20d6c96b
    • J
      vmalloc: eagerly clear ptes on vunmap · 64141da5
      Jeremy Fitzhardinge 提交于
      On stock 2.6.37-rc4, running:
      
        # mount lilith:/export /mnt/lilith
        # find  /mnt/lilith/ -type f -print0 | xargs -0 file
      
      crashes the machine fairly quickly under Xen.  Often it results in oops
      messages, but the couple of times I tried just now, it just hung quietly
      and made Xen print some rude messages:
      
          (XEN) mm.c:2389:d80 Bad type (saw 7400000000000001 != exp
          3000000000000000) for mfn 1d7058 (pfn 18fa7)
          (XEN) mm.c:964:d80 Attempt to create linear p.t. with write perms
          (XEN) mm.c:2389:d80 Bad type (saw 7400000000000010 != exp
          1000000000000000) for mfn 1d2e04 (pfn 1d1fb)
          (XEN) mm.c:2965:d80 Error while pinning mfn 1d2e04
      
      Which means the domain tried to map a pagetable page RW, which would
      allow it to map arbitrary memory, so Xen stopped it.  This is because
      vm_unmap_ram() left some pages mapped in the vmalloc area after NFS had
      finished with them, and those pages got recycled as pagetable pages
      while still having these RW aliases.
      
      Removing those mappings immediately removes the Xen-visible aliases, and
      so it has no problem with those pages being reused as pagetable pages.
      Deferring the TLB flush doesn't upset Xen because it can flush the TLB
      itself as needed to maintain its invariants.
      
      When unmapping a region in the vmalloc space, clear the ptes
      immediately.  There's no point in deferring this because there's no
      amortization benefit.
      
      The TLBs are left dirty, and they are flushed lazily to amortize the
      cost of the IPIs.
      
      This specific motivation for this patch is an oops-causing regression
      since 2.6.36 when using NFS under Xen, triggered by the NFS client's use
      of vm_map_ram() introduced in 56e4ebf8 ("NFS: readdir with vmapped
      pages") .  XFS also uses vm_map_ram() and could cause similar problems.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Bryan Schumaker <bjschuma@netapp.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64141da5
    • W
      vmstat: fix dirty threshold ordering · e172662d
      Wu Fengguang 提交于
      The nr_dirty_[background_]threshold fields are misplaced before the
      numa_* fields, and users will read strange values.
      
      This is the right order.  Before patch, nr_dirty_background_threshold
      will read as 0 (the value from numa_miss).
      
      	numa_hit 128501
      	numa_miss 0
      	numa_foreign 0
      	numa_interleave 7388
      	numa_local 128501
      	numa_other 0
      	nr_dirty_threshold 144291
      	nr_dirty_background_threshold 72145
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Michael Rubin <mrubin@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e172662d
    • Z
      mm/mempolicy.c: add rcu read lock to protect pid structure · 55cfaa3c
      Zeng Zhaoming 提交于
      find_task_by_vpid() should be protected by rcu_read_lock(), to prevent
      free_pid() reclaiming pid.
      Signed-off-by: NZeng Zhaoming <zengzm.kernel@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55cfaa3c
    • D
      mm/hugetlb.c: avoid double unlock_page() in hugetlb_fault() · 1f64d69c
      Dean Nelson 提交于
      Have hugetlb_fault() call unlock_page(page) only if it had previously
      called lock_page(page).
      
      Setting CONFIG_DEBUG_VM=y and then running the libhugetlbfs test suite,
      resulted in the tripping of VM_BUG_ON(!PageLocked(page)) in
      unlock_page() having been called by hugetlb_fault() when page ==
      pagecache_page.  This patch remedied the problem.
      Signed-off-by: NDean Nelson <dnelson@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f64d69c
  16. 02 12月, 2010 1 次提交
    • L
      Call the filesystem back whenever a page is removed from the page cache · 6072d13c
      Linus Torvalds 提交于
      NFS needs to be able to release objects that are stored in the page
      cache once the page itself is no longer visible from the page cache.
      
      This patch adds a callback to the address space operations that allows
      filesystems to perform page cleanups once the page has been removed
      from the page cache.
      
      Original patch by: Linus Torvalds <torvalds@linux-foundation.org>
      [trondmy: cover the cases of invalidate_inode_pages2() and
                truncate_inode_pages()]
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      6072d13c
  17. 29 11月, 2010 2 次提交
    • J
      Kill off a bunch of warning: ‘inline’ is not at beginning of declaration · fa9f90be
      Jesper Juhl 提交于
      These warnings are spewed during a build of a 'allnoconfig' kernel
      (especially the ones from u64_stats_sync.h show up a lot) when building
      with -Wextra (which I often do)..
      They are
        a) annoying
        b) easy to get rid of.
      This patch kills them off.
      
      include/linux/u64_stats_sync.h:70:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:77:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:84:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:96:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:115:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:127:1: warning: ‘inline’ is not at beginning of declaration
      kernel/time.c:241:1: warning: ‘inline’ is not at beginning of declaration
      kernel/time.c:257:1: warning: ‘inline’ is not at beginning of declaration
      kernel/perf_event.c:4513:1: warning: ‘inline’ is not at beginning of declaration
      mm/page_alloc.c:4012:1: warning: ‘inline’ is not at beginning of declaration
      Signed-off-by: NJesper Juhl <jj@chaosbits.net>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      fa9f90be
    • S
      tracing/slab: Move kmalloc tracepoint out of inline code · 85beb586
      Steven Rostedt 提交于
      The tracepoint for kmalloc is in the slab inlined code which causes
      every instance of kmalloc to have the tracepoint.
      
      This patch moves the tracepoint out of the inline code to the
      slab C file, which removes a large number of inlined trace
      points.
      
        objdump -dr vmlinux.slab| grep 'jmpq.*<trace_kmalloc' |wc -l
      213
        objdump -dr vmlinux.slab.patched| grep 'jmpq.*<trace_kmalloc' |wc -l
      1
      
      This also has a nice impact on size.
      
         text	   data	    bss	    dec	    hex	filename
      7023060	2121564	2482432	11627056	 b16a30	vmlinux.slab
      6970579	2109772	2482432	11562783	 b06f1f	vmlinux.slab.patched
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      85beb586
  18. 25 11月, 2010 6 次提交
    • D
      mm: remove call to find_vma in pagewalk for non-hugetlbfs · 5f0af70a
      David Sterba 提交于
      Commit d33b9f45 ("mm: hugetlb: fix hugepage memory leak in
      walk_page_range()") introduces a check if a vma is a hugetlbfs one and
      later in 5dc37642 ("mm hugetlb: add hugepage support to pagemap") it is
      moved under #ifdef CONFIG_HUGETLB_PAGE but a needless find_vma call is
      left behind and its result is not used anywhere else in the function.
      
      The side-effect of caching vma for @addr inside walk->mm is neither
      utilized in walk_page_range() nor in called functions.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f0af70a
    • K
      mm/page_alloc.c: fix build_all_zonelist() where percpu_alloc() is wrongly... · e9959f0f
      KAMEZAWA Hiroyuki 提交于
      mm/page_alloc.c: fix build_all_zonelist() where percpu_alloc() is wrongly called under stop_machine_run()
      
      During memory hotplug, build_allzonelists() may be called under
      stop_machine_run().  In this function, setup_zone_pageset() is called.
      But it's bug because it will do page allocation under stop_machine_run().
      
      Here is a report from Alok Kataria.
      
        BUG: sleeping function called from invalid context at kernel/mutex.c:94
        in_atomic(): 0, irqs_disabled(): 1, pid: 4, name: migration/0
        Pid: 4, comm: migration/0 Not tainted 2.6.35.6-45.fc14.x86_64 #1
        Call Trace:
         [<ffffffff8103d12b>] __might_sleep+0xeb/0xf0
         [<ffffffff81468245>] mutex_lock+0x24/0x50
         [<ffffffff8110eaa6>] pcpu_alloc+0x6d/0x7ee
         [<ffffffff81048888>] ? load_balance+0xbe/0x60e
         [<ffffffff8103a1b3>] ? rt_se_boosted+0x21/0x2f
         [<ffffffff8103e1cf>] ? dequeue_rt_stack+0x18b/0x1ed
         [<ffffffff8110f237>] __alloc_percpu+0x10/0x12
         [<ffffffff81465e22>] setup_zone_pageset+0x38/0xbe
         [<ffffffff810d6d81>] ? build_zonelists_node.clone.58+0x79/0x8c
         [<ffffffff81452539>] __build_all_zonelists+0x419/0x46c
         [<ffffffff8108ef01>] ? cpu_stopper_thread+0xb2/0x198
         [<ffffffff8108f075>] stop_machine_cpu_stop+0x8e/0xc5
         [<ffffffff8108efe7>] ? stop_machine_cpu_stop+0x0/0xc5
         [<ffffffff8108ef57>] cpu_stopper_thread+0x108/0x198
         [<ffffffff81467a37>] ? schedule+0x5b2/0x5cc
         [<ffffffff8108ee4f>] ? cpu_stopper_thread+0x0/0x198
         [<ffffffff81065f29>] kthread+0x7f/0x87
         [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
         [<ffffffff81065eaa>] ? kthread+0x0/0x87
         [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
        Built 5 zonelists in Node order, mobility grouping on.  Total pages: 289456
        Policy zone: Normal
      
      This patch tries to fix the issue by moving setup_zone_pageset() out from
      stop_machine_run(). It's obviously not necessary to be called under
      stop_machine_run().
      
      [akpm@linux-foundation.org: remove unneeded local]
      Reported-by: NAlok Kataria <akataria@vmware.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Petr Vandrovec <petr@vmware.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9959f0f
    • M
      cgroups: make swap accounting default behavior configurable · a42c390c
      Michal Hocko 提交于
      Swap accounting can be configured by CONFIG_CGROUP_MEM_RES_CTLR_SWAP
      configuration option and then it is turned on by default.  There is a boot
      option (noswapaccount) which can disable this feature.
      
      This makes it hard for distributors to enable the configuration option as
      this feature leads to a bigger memory consumption and this is a no-go for
      general purpose distribution kernel.  On the other hand swap accounting
      may be very usuful for some workloads.
      
      This patch adds a new configuration option which controls the default
      behavior (CGROUP_MEM_RES_CTLR_SWAP_ENABLED).  If the option is selected
      then the feature is turned on by default.
      
      It also adds a new boot parameter swapaccount[=1|0] which enhances the
      original noswapaccount parameter semantic by means of enable/disable logic
      (defaults to 1 if no value is provided to be still consistent with
      noswapaccount).
      
      The default behavior is unchanged (if CONFIG_CGROUP_MEM_RES_CTLR_SWAP is
      enabled then CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED is enabled as well)
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a42c390c
    • D
      memcg: avoid deadlock between move charge and try_charge() · b1dd693e
      Daisuke Nishimura 提交于
      __mem_cgroup_try_charge() can be called under down_write(&mmap_sem)(e.g.
      mlock does it). This means it can cause deadlock if it races with move charge:
      
      Ex.1)
                      move charge             |        try charge
        --------------------------------------+------------------------------
          mem_cgroup_can_attach()             |  down_write(&mmap_sem)
            mc.moving_task = current          |    ..
            mem_cgroup_precharge_mc()         |  __mem_cgroup_try_charge()
              mem_cgroup_count_precharge()    |    prepare_to_wait()
                down_read(&mmap_sem)          |    if (mc.moving_task)
                -> cannot aquire the lock     |    -> true
                                              |      schedule()
      
      Ex.2)
                      move charge             |        try charge
        --------------------------------------+------------------------------
          mem_cgroup_can_attach()             |
            mc.moving_task = current          |
            mem_cgroup_precharge_mc()         |
              mem_cgroup_count_precharge()    |
                down_read(&mmap_sem)          |
                ..                            |
                up_read(&mmap_sem)            |
                                              |  down_write(&mmap_sem)
          mem_cgroup_move_task()              |    ..
            mem_cgroup_move_charge()          |  __mem_cgroup_try_charge()
              down_read(&mmap_sem)            |    prepare_to_wait()
              -> cannot aquire the lock       |    if (mc.moving_task)
                                              |    -> true
                                              |      schedule()
      
      To avoid this deadlock, we do all the move charge works (both can_attach() and
      attach()) under one mmap_sem section.
      And after this patch, we set/clear mc.moving_task outside mc.lock, because we
      use the lock only to check mc.from/to.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1dd693e
    • K
      memcg: fix false positive VM_BUG on non-SMP · 112bc2e1
      Kirill A. Shutemov 提交于
      Fix this:
      
        kernel BUG at mm/memcontrol.c:2155!
        invalid opcode: 0000 [#1]
        last sysfs file:
      
        Pid: 18, comm: sh Not tainted 2.6.37-rc3 #3 /Bochs
        EIP: 0060:[<c10731b2>] EFLAGS: 00000246 CPU: 0
        EIP is at mem_cgroup_move_account+0xe2/0xf0
        EAX: 00000004 EBX: c6f931d4 ECX: c681c300 EDX: c681c000
        ESI: c681c300 EDI: ffffffea EBP: c681c000 ESP: c46f3e30
         DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
        Process sh (pid: 18, ti=c46f2000 task=c6826e60 task.ti=c46f2000)
        Stack:
         00000155 c681c000 0805f000 c46ee180 c46f3e5c c7058820 c1074d37 00000000
         08060000 c46db9a0 c46ec080 c7058820 0805f000 08060000 c46f3e98 c1074c50
         c106c75e c46f3e98 c46ec080 08060000 0805ffff c46db9a0 c46f3e98 c46e0340
        Call Trace:
         [<c1074d37>] ? mem_cgroup_move_charge_pte_range+0xe7/0x130
         [<c1074c50>] ? mem_cgroup_move_charge_pte_range+0x0/0x130
         [<c106c75e>] ? walk_page_range+0xee/0x1d0
         [<c10725d6>] ? mem_cgroup_move_task+0x66/0x90
         [<c1074c50>] ? mem_cgroup_move_charge_pte_range+0x0/0x130
         [<c1072570>] ? mem_cgroup_move_task+0x0/0x90
         [<c1042616>] ? cgroup_attach_task+0x136/0x200
         [<c1042878>] ? cgroup_tasks_write+0x48/0xc0
         [<c1041e9e>] ? cgroup_file_write+0xde/0x220
         [<c101398d>] ? do_page_fault+0x17d/0x3f0
         [<c108a79d>] ? alloc_fd+0x2d/0xd0
         [<c1041dc0>] ? cgroup_file_write+0x0/0x220
         [<c1077ba2>] ? vfs_write+0x92/0xc0
         [<c1077c81>] ? sys_write+0x41/0x70
         [<c1140e3d>] ? syscall_call+0x7/0xb
        Code: 03 00 74 09 8b 44 24 04 e8 1c f1 ff ff 89 73 04 8d 86 b0 00 00 00 b9 01 00 00 00 89 da 31 ff e8 65 f5 ff ff e9 4d ff ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 90 8d b4 26 00 00 00 00 83 ec 10 8b 0d f4 e3
        EIP: [<c10731b2>] mem_cgroup_move_account+0xe2/0xf0 SS:ESP 0068:c46f3e30
        ---[ end trace 7daa1582159b6532 ]---
      
      lock_page_cgroup and unlock_page_cgroup are implemented using
      bit_spinlock.  bit_spinlock doesn't touch the bit if we are on non-SMP
      machine, so we can't use the bit to check whether the lock was taken.
      
      Let's introduce is_page_cgroup_locked based on bit_spin_is_locked instead
      of PageCgroupLocked to fix it.
      
      [akpm@linux-foundation.org: s/is_page_cgroup_locked/page_is_cgroup_locked/]
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      112bc2e1
    • S
      nommu: yield CPU while disposing VM · 04c34961
      Steven J. Magnani 提交于
      Depending on processor speed, page size, and the amount of memory a
      process is allowed to amass, cleanup of a large VM may freeze the system
      for many seconds.  This can result in a watchdog timeout.
      
      Make sure other tasks receive some service when cleaning up large VMs.
      Signed-off-by: NSteven J. Magnani <steve@digidescorp.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04c34961