1. 24 12月, 2010 2 次提交
  2. 23 12月, 2010 3 次提交
  3. 16 12月, 2010 1 次提交
    • T
      install_special_mapping skips security_file_mmap check. · 462e635e
      Tavis Ormandy 提交于
      The install_special_mapping routine (used, for example, to setup the
      vdso) skips the security check before insert_vm_struct, allowing a local
      attacker to bypass the mmap_min_addr security restriction by limiting
      the available pages for special mappings.
      
      bprm_mm_init() also skips the check, and although I don't think this can
      be used to bypass any restrictions, I don't see any reason not to have
      the security check.
      
        $ uname -m
        x86_64
        $ cat /proc/sys/vm/mmap_min_addr
        65536
        $ cat install_special_mapping.s
        section .bss
            resb BSS_SIZE
        section .text
            global _start
            _start:
                mov     eax, __NR_pause
                int     0x80
        $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
        $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
        $ ./install_special_mapping &
        [1] 14303
        $ cat /proc/14303/maps
        0000f000-00010000 r-xp 00000000 00:00 0                                  [vdso]
        00010000-00011000 r-xp 00001000 00:19 2453665                            /home/taviso/install_special_mapping
        00011000-ffffe000 rwxp 00000000 00:00 0                                  [stack]
      
      It's worth noting that Red Hat are shipping with mmap_min_addr set to
      4096.
      Signed-off-by: NTavis Ormandy <taviso@google.com>
      Acked-by: NKees Cook <kees@ubuntu.com>
      Acked-by: NRobert Swiecki <swiecki@google.com>
      [ Changed to not drop the error code - akpm ]
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      462e635e
  4. 07 12月, 2010 1 次提交
    • R
      PM / Hibernate: Fix memory corruption related to swap · c9e664f1
      Rafael J. Wysocki 提交于
      There is a problem that swap pages allocated before the creation of
      a hibernation image can be released and used for storing the contents
      of different memory pages while the image is being saved.  Since the
      kernel stored in the image doesn't know of that, it causes memory
      corruption to occur after resume from hibernation, especially on
      systems with relatively small RAM that need to swap often.
      
      This issue can be addressed by keeping the GFP_IOFS bits clear
      in gfp_allowed_mask during the entire hibernation, including the
      saving of the image, until the system is finally turned off or
      the hibernation is aborted.  Unfortunately, for this purpose
      it's necessary to rework the way in which the hibernate and
      suspend code manipulates gfp_allowed_mask.
      
      This change is based on an earlier patch from Hugh Dickins.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Reported-by: NOndrej Zary <linux@rainbow-software.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: stable@kernel.org
      c9e664f1
  5. 04 12月, 2010 1 次提交
    • T
      slub: Fix a crash during slabinfo -v · 37d57443
      Tero Roponen 提交于
      Commit f7cb1933 ("SLUB: Pass active
      and inactive redzone flags instead of boolean to debug functions")
      missed two instances of check_object(). This caused a lot of warnings
      during 'slabinfo -v' finally leading to a crash:
      
        BUG ext4_xattr: Freepointer corrupt
        ...
        BUG buffer_head: Freepointer corrupt
        ...
        BUG ext4_alloc_context: Freepointer corrupt
        ...
        ...
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff810a291f>] file_sb_list_del+0x1c/0x35
        PGD 79d78067 PUD 79e67067 PMD 0
        Oops: 0002 [#1] SMP
        last sysfs file: /sys/kernel/slab/:t-0000192/validate
      
      This patch fixes the problem by converting the two missed instances.
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTero Roponen <tero.roponen@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      37d57443
  6. 03 12月, 2010 6 次提交
    • K
      ksm: annotate ksm_thread_mutex is no deadlock source · a0b0f58c
      KOSAKI Motohiro 提交于
      commit 62b61f61 ("ksm: memory hotremove migration only") caused the
      following new lockdep warning.
      
        =======================================================
        [ INFO: possible circular locking dependency detected ]
        -------------------------------------------------------
        bash/1621 is trying to acquire lock:
         ((memory_chain).rwsem){.+.+.+}, at: [<ffffffff81079339>]
        __blocking_notifier_call_chain+0x69/0xc0
      
        but task is already holding lock:
         (ksm_thread_mutex){+.+.+.}, at: [<ffffffff8113a3aa>]
        ksm_memory_callback+0x3a/0xc0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (ksm_thread_mutex){+.+.+.}:
             [<ffffffff8108b70a>] lock_acquire+0xaa/0x140
             [<ffffffff81505d74>] __mutex_lock_common+0x44/0x3f0
             [<ffffffff81506228>] mutex_lock_nested+0x48/0x60
             [<ffffffff8113a3aa>] ksm_memory_callback+0x3a/0xc0
             [<ffffffff8150c21c>] notifier_call_chain+0x8c/0xe0
             [<ffffffff8107934e>] __blocking_notifier_call_chain+0x7e/0xc0
             [<ffffffff810793a6>] blocking_notifier_call_chain+0x16/0x20
             [<ffffffff813afbfb>] memory_notify+0x1b/0x20
             [<ffffffff81141b7c>] remove_memory+0x1cc/0x5f0
             [<ffffffff813af53d>] memory_block_change_state+0xfd/0x1a0
             [<ffffffff813afd62>] store_mem_state+0xe2/0xf0
             [<ffffffff813a0bb0>] sysdev_store+0x20/0x30
             [<ffffffff811bc116>] sysfs_write_file+0xe6/0x170
             [<ffffffff8114f398>] vfs_write+0xc8/0x190
             [<ffffffff8114fc14>] sys_write+0x54/0x90
             [<ffffffff810028b2>] system_call_fastpath+0x16/0x1b
      
        -> #0 ((memory_chain).rwsem){.+.+.+}:
             [<ffffffff8108b5ba>] __lock_acquire+0x155a/0x1600
             [<ffffffff8108b70a>] lock_acquire+0xaa/0x140
             [<ffffffff81506601>] down_read+0x51/0xa0
             [<ffffffff81079339>] __blocking_notifier_call_chain+0x69/0xc0
             [<ffffffff810793a6>] blocking_notifier_call_chain+0x16/0x20
             [<ffffffff813afbfb>] memory_notify+0x1b/0x20
             [<ffffffff81141f1e>] remove_memory+0x56e/0x5f0
             [<ffffffff813af53d>] memory_block_change_state+0xfd/0x1a0
             [<ffffffff813afd62>] store_mem_state+0xe2/0xf0
             [<ffffffff813a0bb0>] sysdev_store+0x20/0x30
             [<ffffffff811bc116>] sysfs_write_file+0xe6/0x170
             [<ffffffff8114f398>] vfs_write+0xc8/0x190
             [<ffffffff8114fc14>] sys_write+0x54/0x90
             [<ffffffff810028b2>] system_call_fastpath+0x16/0x1b
      
      But it's a false positive.  Both memory_chain.rwsem and ksm_thread_mutex
      have an outer lock (mem_hotplug_mutex).  So they cannot deadlock.
      
      Thus, This patch annotate ksm_thread_mutex is not deadlock source.
      
      [akpm@linux-foundation.org: update comment, from Hugh]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0b0f58c
    • K
      mem-hotplug: introduce {un}lock_memory_hotplug() · 20d6c96b
      KOSAKI Motohiro 提交于
      Presently hwpoison is using lock_system_sleep() to prevent a race with
      memory hotplug.  However lock_system_sleep() is a no-op if
      CONFIG_HIBERNATION=n.  Therefore we need a new lock.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20d6c96b
    • J
      vmalloc: eagerly clear ptes on vunmap · 64141da5
      Jeremy Fitzhardinge 提交于
      On stock 2.6.37-rc4, running:
      
        # mount lilith:/export /mnt/lilith
        # find  /mnt/lilith/ -type f -print0 | xargs -0 file
      
      crashes the machine fairly quickly under Xen.  Often it results in oops
      messages, but the couple of times I tried just now, it just hung quietly
      and made Xen print some rude messages:
      
          (XEN) mm.c:2389:d80 Bad type (saw 7400000000000001 != exp
          3000000000000000) for mfn 1d7058 (pfn 18fa7)
          (XEN) mm.c:964:d80 Attempt to create linear p.t. with write perms
          (XEN) mm.c:2389:d80 Bad type (saw 7400000000000010 != exp
          1000000000000000) for mfn 1d2e04 (pfn 1d1fb)
          (XEN) mm.c:2965:d80 Error while pinning mfn 1d2e04
      
      Which means the domain tried to map a pagetable page RW, which would
      allow it to map arbitrary memory, so Xen stopped it.  This is because
      vm_unmap_ram() left some pages mapped in the vmalloc area after NFS had
      finished with them, and those pages got recycled as pagetable pages
      while still having these RW aliases.
      
      Removing those mappings immediately removes the Xen-visible aliases, and
      so it has no problem with those pages being reused as pagetable pages.
      Deferring the TLB flush doesn't upset Xen because it can flush the TLB
      itself as needed to maintain its invariants.
      
      When unmapping a region in the vmalloc space, clear the ptes
      immediately.  There's no point in deferring this because there's no
      amortization benefit.
      
      The TLBs are left dirty, and they are flushed lazily to amortize the
      cost of the IPIs.
      
      This specific motivation for this patch is an oops-causing regression
      since 2.6.36 when using NFS under Xen, triggered by the NFS client's use
      of vm_map_ram() introduced in 56e4ebf8 ("NFS: readdir with vmapped
      pages") .  XFS also uses vm_map_ram() and could cause similar problems.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Bryan Schumaker <bjschuma@netapp.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64141da5
    • W
      vmstat: fix dirty threshold ordering · e172662d
      Wu Fengguang 提交于
      The nr_dirty_[background_]threshold fields are misplaced before the
      numa_* fields, and users will read strange values.
      
      This is the right order.  Before patch, nr_dirty_background_threshold
      will read as 0 (the value from numa_miss).
      
      	numa_hit 128501
      	numa_miss 0
      	numa_foreign 0
      	numa_interleave 7388
      	numa_local 128501
      	numa_other 0
      	nr_dirty_threshold 144291
      	nr_dirty_background_threshold 72145
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Michael Rubin <mrubin@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e172662d
    • Z
      mm/mempolicy.c: add rcu read lock to protect pid structure · 55cfaa3c
      Zeng Zhaoming 提交于
      find_task_by_vpid() should be protected by rcu_read_lock(), to prevent
      free_pid() reclaiming pid.
      Signed-off-by: NZeng Zhaoming <zengzm.kernel@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55cfaa3c
    • D
      mm/hugetlb.c: avoid double unlock_page() in hugetlb_fault() · 1f64d69c
      Dean Nelson 提交于
      Have hugetlb_fault() call unlock_page(page) only if it had previously
      called lock_page(page).
      
      Setting CONFIG_DEBUG_VM=y and then running the libhugetlbfs test suite,
      resulted in the tripping of VM_BUG_ON(!PageLocked(page)) in
      unlock_page() having been called by hugetlb_fault() when page ==
      pagecache_page.  This patch remedied the problem.
      Signed-off-by: NDean Nelson <dnelson@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f64d69c
  7. 02 12月, 2010 1 次提交
    • L
      Call the filesystem back whenever a page is removed from the page cache · 6072d13c
      Linus Torvalds 提交于
      NFS needs to be able to release objects that are stored in the page
      cache once the page itself is no longer visible from the page cache.
      
      This patch adds a callback to the address space operations that allows
      filesystems to perform page cleanups once the page has been removed
      from the page cache.
      
      Original patch by: Linus Torvalds <torvalds@linux-foundation.org>
      [trondmy: cover the cases of invalidate_inode_pages2() and
                truncate_inode_pages()]
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      6072d13c
  8. 25 11月, 2010 6 次提交
    • D
      mm: remove call to find_vma in pagewalk for non-hugetlbfs · 5f0af70a
      David Sterba 提交于
      Commit d33b9f45 ("mm: hugetlb: fix hugepage memory leak in
      walk_page_range()") introduces a check if a vma is a hugetlbfs one and
      later in 5dc37642 ("mm hugetlb: add hugepage support to pagemap") it is
      moved under #ifdef CONFIG_HUGETLB_PAGE but a needless find_vma call is
      left behind and its result is not used anywhere else in the function.
      
      The side-effect of caching vma for @addr inside walk->mm is neither
      utilized in walk_page_range() nor in called functions.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f0af70a
    • K
      mm/page_alloc.c: fix build_all_zonelist() where percpu_alloc() is wrongly... · e9959f0f
      KAMEZAWA Hiroyuki 提交于
      mm/page_alloc.c: fix build_all_zonelist() where percpu_alloc() is wrongly called under stop_machine_run()
      
      During memory hotplug, build_allzonelists() may be called under
      stop_machine_run().  In this function, setup_zone_pageset() is called.
      But it's bug because it will do page allocation under stop_machine_run().
      
      Here is a report from Alok Kataria.
      
        BUG: sleeping function called from invalid context at kernel/mutex.c:94
        in_atomic(): 0, irqs_disabled(): 1, pid: 4, name: migration/0
        Pid: 4, comm: migration/0 Not tainted 2.6.35.6-45.fc14.x86_64 #1
        Call Trace:
         [<ffffffff8103d12b>] __might_sleep+0xeb/0xf0
         [<ffffffff81468245>] mutex_lock+0x24/0x50
         [<ffffffff8110eaa6>] pcpu_alloc+0x6d/0x7ee
         [<ffffffff81048888>] ? load_balance+0xbe/0x60e
         [<ffffffff8103a1b3>] ? rt_se_boosted+0x21/0x2f
         [<ffffffff8103e1cf>] ? dequeue_rt_stack+0x18b/0x1ed
         [<ffffffff8110f237>] __alloc_percpu+0x10/0x12
         [<ffffffff81465e22>] setup_zone_pageset+0x38/0xbe
         [<ffffffff810d6d81>] ? build_zonelists_node.clone.58+0x79/0x8c
         [<ffffffff81452539>] __build_all_zonelists+0x419/0x46c
         [<ffffffff8108ef01>] ? cpu_stopper_thread+0xb2/0x198
         [<ffffffff8108f075>] stop_machine_cpu_stop+0x8e/0xc5
         [<ffffffff8108efe7>] ? stop_machine_cpu_stop+0x0/0xc5
         [<ffffffff8108ef57>] cpu_stopper_thread+0x108/0x198
         [<ffffffff81467a37>] ? schedule+0x5b2/0x5cc
         [<ffffffff8108ee4f>] ? cpu_stopper_thread+0x0/0x198
         [<ffffffff81065f29>] kthread+0x7f/0x87
         [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
         [<ffffffff81065eaa>] ? kthread+0x0/0x87
         [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
        Built 5 zonelists in Node order, mobility grouping on.  Total pages: 289456
        Policy zone: Normal
      
      This patch tries to fix the issue by moving setup_zone_pageset() out from
      stop_machine_run(). It's obviously not necessary to be called under
      stop_machine_run().
      
      [akpm@linux-foundation.org: remove unneeded local]
      Reported-by: NAlok Kataria <akataria@vmware.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Petr Vandrovec <petr@vmware.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9959f0f
    • M
      cgroups: make swap accounting default behavior configurable · a42c390c
      Michal Hocko 提交于
      Swap accounting can be configured by CONFIG_CGROUP_MEM_RES_CTLR_SWAP
      configuration option and then it is turned on by default.  There is a boot
      option (noswapaccount) which can disable this feature.
      
      This makes it hard for distributors to enable the configuration option as
      this feature leads to a bigger memory consumption and this is a no-go for
      general purpose distribution kernel.  On the other hand swap accounting
      may be very usuful for some workloads.
      
      This patch adds a new configuration option which controls the default
      behavior (CGROUP_MEM_RES_CTLR_SWAP_ENABLED).  If the option is selected
      then the feature is turned on by default.
      
      It also adds a new boot parameter swapaccount[=1|0] which enhances the
      original noswapaccount parameter semantic by means of enable/disable logic
      (defaults to 1 if no value is provided to be still consistent with
      noswapaccount).
      
      The default behavior is unchanged (if CONFIG_CGROUP_MEM_RES_CTLR_SWAP is
      enabled then CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED is enabled as well)
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a42c390c
    • D
      memcg: avoid deadlock between move charge and try_charge() · b1dd693e
      Daisuke Nishimura 提交于
      __mem_cgroup_try_charge() can be called under down_write(&mmap_sem)(e.g.
      mlock does it). This means it can cause deadlock if it races with move charge:
      
      Ex.1)
                      move charge             |        try charge
        --------------------------------------+------------------------------
          mem_cgroup_can_attach()             |  down_write(&mmap_sem)
            mc.moving_task = current          |    ..
            mem_cgroup_precharge_mc()         |  __mem_cgroup_try_charge()
              mem_cgroup_count_precharge()    |    prepare_to_wait()
                down_read(&mmap_sem)          |    if (mc.moving_task)
                -> cannot aquire the lock     |    -> true
                                              |      schedule()
      
      Ex.2)
                      move charge             |        try charge
        --------------------------------------+------------------------------
          mem_cgroup_can_attach()             |
            mc.moving_task = current          |
            mem_cgroup_precharge_mc()         |
              mem_cgroup_count_precharge()    |
                down_read(&mmap_sem)          |
                ..                            |
                up_read(&mmap_sem)            |
                                              |  down_write(&mmap_sem)
          mem_cgroup_move_task()              |    ..
            mem_cgroup_move_charge()          |  __mem_cgroup_try_charge()
              down_read(&mmap_sem)            |    prepare_to_wait()
              -> cannot aquire the lock       |    if (mc.moving_task)
                                              |    -> true
                                              |      schedule()
      
      To avoid this deadlock, we do all the move charge works (both can_attach() and
      attach()) under one mmap_sem section.
      And after this patch, we set/clear mc.moving_task outside mc.lock, because we
      use the lock only to check mc.from/to.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1dd693e
    • K
      memcg: fix false positive VM_BUG on non-SMP · 112bc2e1
      Kirill A. Shutemov 提交于
      Fix this:
      
        kernel BUG at mm/memcontrol.c:2155!
        invalid opcode: 0000 [#1]
        last sysfs file:
      
        Pid: 18, comm: sh Not tainted 2.6.37-rc3 #3 /Bochs
        EIP: 0060:[<c10731b2>] EFLAGS: 00000246 CPU: 0
        EIP is at mem_cgroup_move_account+0xe2/0xf0
        EAX: 00000004 EBX: c6f931d4 ECX: c681c300 EDX: c681c000
        ESI: c681c300 EDI: ffffffea EBP: c681c000 ESP: c46f3e30
         DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
        Process sh (pid: 18, ti=c46f2000 task=c6826e60 task.ti=c46f2000)
        Stack:
         00000155 c681c000 0805f000 c46ee180 c46f3e5c c7058820 c1074d37 00000000
         08060000 c46db9a0 c46ec080 c7058820 0805f000 08060000 c46f3e98 c1074c50
         c106c75e c46f3e98 c46ec080 08060000 0805ffff c46db9a0 c46f3e98 c46e0340
        Call Trace:
         [<c1074d37>] ? mem_cgroup_move_charge_pte_range+0xe7/0x130
         [<c1074c50>] ? mem_cgroup_move_charge_pte_range+0x0/0x130
         [<c106c75e>] ? walk_page_range+0xee/0x1d0
         [<c10725d6>] ? mem_cgroup_move_task+0x66/0x90
         [<c1074c50>] ? mem_cgroup_move_charge_pte_range+0x0/0x130
         [<c1072570>] ? mem_cgroup_move_task+0x0/0x90
         [<c1042616>] ? cgroup_attach_task+0x136/0x200
         [<c1042878>] ? cgroup_tasks_write+0x48/0xc0
         [<c1041e9e>] ? cgroup_file_write+0xde/0x220
         [<c101398d>] ? do_page_fault+0x17d/0x3f0
         [<c108a79d>] ? alloc_fd+0x2d/0xd0
         [<c1041dc0>] ? cgroup_file_write+0x0/0x220
         [<c1077ba2>] ? vfs_write+0x92/0xc0
         [<c1077c81>] ? sys_write+0x41/0x70
         [<c1140e3d>] ? syscall_call+0x7/0xb
        Code: 03 00 74 09 8b 44 24 04 e8 1c f1 ff ff 89 73 04 8d 86 b0 00 00 00 b9 01 00 00 00 89 da 31 ff e8 65 f5 ff ff e9 4d ff ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 90 8d b4 26 00 00 00 00 83 ec 10 8b 0d f4 e3
        EIP: [<c10731b2>] mem_cgroup_move_account+0xe2/0xf0 SS:ESP 0068:c46f3e30
        ---[ end trace 7daa1582159b6532 ]---
      
      lock_page_cgroup and unlock_page_cgroup are implemented using
      bit_spinlock.  bit_spinlock doesn't touch the bit if we are on non-SMP
      machine, so we can't use the bit to check whether the lock was taken.
      
      Let's introduce is_page_cgroup_locked based on bit_spin_is_locked instead
      of PageCgroupLocked to fix it.
      
      [akpm@linux-foundation.org: s/is_page_cgroup_locked/page_is_cgroup_locked/]
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      112bc2e1
    • S
      nommu: yield CPU while disposing VM · 04c34961
      Steven J. Magnani 提交于
      Depending on processor speed, page size, and the amount of memory a
      process is allowed to amass, cleanup of a large VM may freeze the system
      for many seconds.  This can result in a watchdog timeout.
      
      Make sure other tasks receive some service when cleaning up large VMs.
      Signed-off-by: NSteven J. Magnani <steve@digidescorp.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04c34961
  9. 14 11月, 2010 1 次提交
  10. 12 11月, 2010 4 次提交
    • N
      radix-tree: fix RCU bug · 27d20fdd
      Nick Piggin 提交于
      Salman Qazi describes the following radix-tree bug:
      
      In the following case, we get can get a deadlock:
      
      0.  The radix tree contains two items, one has the index 0.
      1.  The reader (in this case find_get_pages) takes the rcu_read_lock.
      2.  The reader acquires slot(s) for item(s) including the index 0 item.
      3.  The non-zero index item is deleted, and as a consequence the other item is
          moved to the root of the tree. The place where it used to be is queued for
          deletion after the readers finish.
      3b. The zero item is deleted, removing it from the direct slot, it remains in
          the rcu-delayed indirect node.
      4.  The reader looks at the index 0 slot, and finds that the page has 0 ref
          count
      5.  The reader looks at it again, hoping that the item will either be freed or
          the ref count will increase. This never happens, as the slot it is looking
          at will never be updated. Also, this slot can never be reclaimed because
          the reader is holding rcu_read_lock and is in an infinite loop.
      
      The fix is to re-use the same "indirect" pointer case that requires a slot
      lookup retry into a general "retry the lookup" bit.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Reported-by: NSalman Qazi <sqazi@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27d20fdd
    • S
      vmscan: avoid setting zone congested if no page dirty · 1dce071e
      Shaohua Li 提交于
      nr_dirty and nr_congested are increased only when the page is dirty.  So
      if all pages are clean, both them will be zero.  In this case, we should
      not mark the zone congested.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dce071e
    • D
      mm/vfs: revalidate page->mapping in do_generic_file_read() · 8d056cb9
      Dave Hansen 提交于
      70 hours into some stress tests of a 2.6.32-based enterprise kernel, we
      ran into a NULL dereference in here:
      
      	int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
      	                                        unsigned long from)
      	{
      ---->		struct inode *inode = page->mapping->host;
      
      It looks like page->mapping was the culprit.  (xmon trace is below).
      After closer examination, I realized that do_generic_file_read() does a
      find_get_page(), and eventually locks the page before calling
      block_is_partially_uptodate().  However, it doesn't revalidate the
      page->mapping after the page is locked.  So, there's a small window
      between the find_get_page() and ->is_partially_uptodate() where the page
      could get truncated and page->mapping cleared.
      
      We _have_ a reference, so it can't get reclaimed, but it certainly
      can be truncated.
      
      I think the correct thing is to check page->mapping after the
      trylock_page(), and jump out if it got truncated.  This patch has been
      running in the test environment for a month or so now, and we have not
      seen this bug pop up again.
      
      xmon info:
      
        1f:mon> e
        cpu 0x1f: Vector: 300 (Data Access) at [c0000002ae36f770]
            pc: c0000000001e7a6c: .block_is_partially_uptodate+0xc/0x100
            lr: c000000000142944: .generic_file_aio_read+0x1e4/0x770
            sp: c0000002ae36f9f0
           msr: 8000000000009032
           dar: 0
         dsisr: 40000000
          current = 0xc000000378f99e30
          paca    = 0xc000000000f66300
            pid   = 21946, comm = bash
        1f:mon> r
        R00 = 0025c0500000006d   R16 = 0000000000000000
        R01 = c0000002ae36f9f0   R17 = c000000362cd3af0
        R02 = c000000000e8cd80   R18 = ffffffffffffffff
        R03 = c0000000031d0f88   R19 = 0000000000000001
        R04 = c0000002ae36fa68   R20 = c0000003bb97b8a0
        R05 = 0000000000000000   R21 = c0000002ae36fa68
        R06 = 0000000000000000   R22 = 0000000000000000
        R07 = 0000000000000001   R23 = c0000002ae36fbb0
        R08 = 0000000000000002   R24 = 0000000000000000
        R09 = 0000000000000000   R25 = c000000362cd3a80
        R10 = 0000000000000000   R26 = 0000000000000002
        R11 = c0000000001e7b60   R27 = 0000000000000000
        R12 = 0000000042000484   R28 = 0000000000000001
        R13 = c000000000f66300   R29 = c0000003bb97b9b8
        R14 = 0000000000000001   R30 = c000000000e28a08
        R15 = 000000000000ffff   R31 = c0000000031d0f88
        pc  = c0000000001e7a6c .block_is_partially_uptodate+0xc/0x100
        lr  = c000000000142944 .generic_file_aio_read+0x1e4/0x770
        msr = 8000000000009032   cr  = 22000488
        ctr = c0000000001e7a60   xer = 0000000020000000   trap =  300
        dar = 0000000000000000   dsisr = 40000000
        1f:mon> t
        [link register   ] c000000000142944 .generic_file_aio_read+0x1e4/0x770
        [c0000002ae36f9f0] c000000000142a14 .generic_file_aio_read+0x2b4/0x770 (unreliable)
        [c0000002ae36fb40] c0000000001b03e4 .do_sync_read+0xd4/0x160
        [c0000002ae36fce0] c0000000001b153c .vfs_read+0xec/0x1f0
        [c0000002ae36fd80] c0000000001b1768 .SyS_read+0x58/0xb0
        [c0000002ae36fe30] c00000000000852c syscall_exit+0x0/0x40
        --- Exception: c00 (System Call) at 00000080a840bc54
        SP (fffca15df30) is in userspace
        1f:mon> di c0000000001e7a6c
        c0000000001e7a6c  e9290000      ld      r9,0(r9)
        c0000000001e7a70  418200c0      beq     c0000000001e7b30        # .block_is_partially_uptodate+0xd0/0x100
        c0000000001e7a74  e9440008      ld      r10,8(r4)
        c0000000001e7a78  78a80020      clrldi  r8,r5,32
        c0000000001e7a7c  3c000001      lis     r0,1
        c0000000001e7a80  812900a8      lwz     r9,168(r9)
        c0000000001e7a84  39600001      li      r11,1
        c0000000001e7a88  7c080050      subf    r0,r8,r0
        c0000000001e7a8c  7f805040      cmplw   cr7,r0,r10
        c0000000001e7a90  7d6b4830      slw     r11,r11,r9
        c0000000001e7a94  796b0020      clrldi  r11,r11,32
        c0000000001e7a98  419d00a8      bgt     cr7,c0000000001e7b40    # .block_is_partially_uptodate+0xe0/0x100
        c0000000001e7a9c  7fa55840      cmpld   cr7,r5,r11
        c0000000001e7aa0  7d004214      add     r8,r0,r8
        c0000000001e7aa4  79080020      clrldi  r8,r8,32
        c0000000001e7aa8  419c0078      blt     cr7,c0000000001e7b20    # .block_is_partially_uptodate+0xc0/0x100
      Signed-off-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <arunabal@in.ibm.com>
      Cc: <sbest@us.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d056cb9
    • D
      memcg: null dereference on allocation failure · d2e61b8d
      Dan Carpenter 提交于
      The original code had a null dereference if alloc_percpu() failed.  This
      was introduced in commit 711d3d2c ("memcg: cpu hotplug aware percpu
      count updates")
      Signed-off-by: NDan Carpenter <error27@gmail.com>
      Reviewed-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2e61b8d
  11. 10 11月, 2010 1 次提交
  12. 04 11月, 2010 1 次提交
    • W
      vmstat: fix offset calculation on void* · ff8b16d7
      Wu Fengguang 提交于
      Fix regression introduced by commit 79da826a ("writeback: report
      dirty thresholds in /proc/vmstat").
      
      The incorrect pointer arithmetic can result in problems like this:
      
        BUG: unable to handle kernel paging request at 07c06d16
        IP: [<c050c336>] strnlen+0x6/0x20
        Call Trace:
         [<c050a249>] ? string+0x39/0xe0
         [<c042be6b>] ? __wake_up_common+0x4b/0x80
         [<c050afcc>] ? vsnprintf+0x1ec/0x380
         [<c04b380e>] ? seq_printf+0x2e/0x60
         [<c04829a6>] ? vmstat_show+0x26/0x30
         [<c04b3bb6>] ? seq_read+0xa6/0x380
         [<c04b3b10>] ? seq_read+0x0/0x380
         [<c04d5d2f>] ? proc_reg_read+0x5f/0x90
         [<c049c4a1>] ? vfs_read+0xa1/0x140
         [<c04d5cd0>] ? proc_reg_read+0x0/0x90
         [<c049c981>] ? sys_read+0x41/0x70
         [<c0402bd0>] ? sysenter_do_call+0x12/0x26
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff8b16d7
  13. 03 11月, 2010 1 次提交
  14. 30 10月, 2010 1 次提交
    • A
      audit mmap · 120a795d
      Al Viro 提交于
      Normal syscall audit doesn't catch 5th argument of syscall.  It also
      doesn't catch the contents of userland structures pointed to be
      syscall argument, so for both old and new mmap(2) ABI it doesn't
      record the descriptor we are mapping.  For old one it also misses
      flags.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      120a795d
  15. 29 10月, 2010 2 次提交
  16. 28 10月, 2010 8 次提交