1. 12 5月, 2011 9 次提交
    • H
      tmpfs: fix spurious ENOSPC when racing with unswap · 59a16ead
      Hugh Dickins 提交于
      Testing the shmem_swaplist replacements for igrab() revealed another bug:
      writes to /dev/loop0 on a tmpfs file which fills its filesystem were
      sometimes failing with "Buffer I/O error"s.
      
      These came from ENOSPC failures of shmem_getpage(), when racing with
      swapoff: the same could happen when racing with another shmem_getpage(),
      pulling the page in from swap in between our find_lock_page() and our
      taking the info->lock (though not in the single-threaded loop case).
      
      This is unacceptable, and surprising that I've not noticed it before:
      it dates back many years, but (presumably) was made a lot easier to
      reproduce in 2.6.36, which sited a page preallocation in the race window.
      
      Fix it by rechecking the page cache before settling on an ENOSPC error.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59a16ead
    • H
      tmpfs: fix race between umount and swapoff · 778dd893
      Hugh Dickins 提交于
      The use of igrab() in swapoff's shmem_unuse_inode() is just as vulnerable
      to umount as that in shmem_writepage().
      
      Fix this instance by extending the protection of shmem_swaplist_mutex
      right across shmem_unuse_inode(): while it's on the list, the inode cannot
      be evicted (and the filesystem cannot be unmounted) without
      shmem_evict_inode() taking that mutex to remove it from the list.
      
      But since shmem_writepage() might take that mutex, we should avoid making
      memory allocations or memcg charges while holding it: prepare them at the
      outer level in shmem_unuse().  When mem_cgroup_cache_charge() was
      originally placed, we didn't know until that point that the page from swap
      was actually a shmem page; but nowadays it's noted in the swap_map, so
      we're safe to charge upfront.  For the radix_tree, do as is done in
      shmem_getpage(): preload upfront, but don't pin to the cpu; so we make a
      habit of refreshing the node pool, but might dip into GFP_NOWAIT reserves
      on occasion if subsequently preempted.
      
      With the allocation and charge moved out from shmem_unuse_inode(),
      we can also hold index map and info->lock over from finding the entry.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      778dd893
    • H
      tmpfs: fix race between umount and writepage · b1dea800
      Hugh Dickins 提交于
      Konstanin Khlebnikov reports that a dangerous race between umount and
      shmem_writepage can be reproduced by this script:
      
        for i in {1..300} ; do
      	mkdir $i
      	while true ; do
      		mount -t tmpfs none $i
      		dd if=/dev/zero of=$i/test bs=1M count=$(($RANDOM % 100))
      		umount $i
      	done &
        done
      
      on a 6xCPU node with 8Gb RAM: kernel very unstable after this accident. =)
      
      Kernel log:
      
        VFS: Busy inodes after unmount of tmpfs.
                       Self-destruct in 5 seconds.  Have a nice day...
      
        WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
        list_del corruption. prev->next should be ffff880222fdaac8, but was (null)
        Pid: 11222, comm: mount.tmpfs Not tainted 2.6.39-rc2+ #4
        Call Trace:
         warn_slowpath_common+0x80/0x98
         warn_slowpath_fmt+0x41/0x43
         __list_del_entry+0x8d/0x98
         evict+0x50/0x113
         iput+0x138/0x141
        ...
        BUG: unable to handle kernel paging request at ffffffffffffffff
        IP: shmem_free_blocks+0x18/0x4c
        Pid: 10422, comm: dd Tainted: G        W   2.6.39-rc2+ #4
        Call Trace:
         shmem_recalc_inode+0x61/0x66
         shmem_writepage+0xba/0x1dc
         pageout+0x13c/0x24c
         shrink_page_list+0x28e/0x4be
         shrink_inactive_list+0x21f/0x382
        ...
      
      shmem_writepage() calls igrab() on the inode for the page which came from
      page reclaim, to add it later into shmem_swaplist for swapoff operation.
      
      This igrab() can race with super-block deactivating process:
      
        shrink_inactive_list()          deactivate_super()
        pageout()                       tmpfs_fs_type->kill_sb()
        shmem_writepage()               kill_litter_super()
                                        generic_shutdown_super()
                                         evict_inodes()
         igrab()
                                          atomic_read(&inode->i_count)
                                           skip-inode
         iput()
                                         if (!list_empty(&sb->s_inodes))
                                                printk("VFS: Busy inodes after...
      
      This igrap-iput pair was added in commit 1b1b32f2 "tmpfs: fix
      shmem_swaplist races" based on incorrect assumptions: igrab() protects the
      inode from concurrent eviction by deletion, but it does nothing to protect
      it from concurrent unmounting, which goes ahead despite the raised
      i_count.
      
      So this use of igrab() was wrong all along, but the race made much worse
      in 2.6.37 when commit 63997e98 "split invalidate_inodes()" replaced
      two attempts at invalidate_inodes() by a single evict_inodes().
      
      Konstantin posted a plausible patch, raising sb->s_active too: I'm unsure
      whether it was correct or not; but burnt once by igrab(), I am sure that
      we don't want to rely more deeply upon externals here.
      
      Fix it by adding the inode to shmem_swaplist earlier, while the page lock
      on page in page cache still secures the inode against eviction, without
      artifically raising i_count.  It was originally added later because
      shmem_unuse_inode() is liable to remove an inode from the list while it's
      unswapped; but we can guard against that by taking spinlock before
      dropping mutex.
      Reported-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Tested-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1dea800
    • A
      memcg: allocate memory cgroup structures in local nodes · 21a3c964
      Andi Kleen 提交于
      Commit dde79e00 ("page_cgroup: reduce allocation overhead for
      page_cgroup array for CONFIG_SPARSEMEM") added a regression that the
      memory cgroup data structures all end up in node 0 because the first
      attempt at allocating them would not pass in a node hint.  Since the
      initialization runs on CPU #0 it would all end up node 0.  This is a
      problem on large memory systems, where node 0 would lose a lot of
      memory.
      
      Change the alloc_pages_exact() to alloc_pages_exact_nid().  This will
      still fall back to other nodes if not enough memory is available.
      
       [ RED-PEN: right now it would fall back first before trying
         vmalloc_node.  Probably not the best strategy ...  But I left it like
         that for now. ]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Reported-by: Doug Nelson
      Cc: David Rientjes <rientjes@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21a3c964
    • A
      mm: add alloc_pages_exact_nid() · ee85c2e1
      Andi Kleen 提交于
      Add a alloc_pages_exact_nid() that allocates on a specific node.
      
      The naming is quite broken, but fixing that would need a larger renaming
      action.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee85c2e1
    • H
      MAINTAINERS: fix sorting · 71a6d0af
      Harry Wei 提交于
      Take alphabetical orders for MAINTAINERS file.
      Signed-off-by: NHarry Wei <harryxiyou@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71a6d0af
    • Y
      mm: use alloc_bootmem_node_nopanic() on really needed path · 8f389a99
      Yinghai Lu 提交于
      Stefan found nobootmem does not work on his system that has only 8M of
      RAM.  This causes an early panic:
      
        BIOS-provided physical RAM map:
         BIOS-88: 0000000000000000 - 000000000009f000 (usable)
         BIOS-88: 0000000000100000 - 0000000000840000 (usable)
        bootconsole [earlyser0] enabled
        Notice: NX (Execute Disable) protection missing in CPU or disabled in BIOS!
        DMI not present or invalid.
        last_pfn = 0x840 max_arch_pfn = 0x100000
        init_memory_mapping: 0000000000000000-0000000000840000
        8MB LOWMEM available.
          mapped low ram: 0 - 00840000
          low ram: 0 - 00840000
        Zone PFN ranges:
          DMA      0x00000001 -> 0x00001000
          Normal   empty
        Movable zone start PFN for each node
        early_node_map[2] active PFN ranges
            0: 0x00000001 -> 0x0000009f
            0: 0x00000100 -> 0x00000840
        BUG: Int 6: CR2 (null)
             EDI c034663c  ESI (null)  EBP c0329f38  ESP c0329ef4
             EBX c0346380  EDX 00000006  ECX ffffffff  EAX fffffff4
             err (null)  EIP c0353191   CS c0320060  flg 00010082
        Stack: (null) c030c533 000007cd (null) c030c533 00000001 (null) (null)
               00000003 0000083f 00000018 00000002 00000002 c0329f6c c03534d6 (null)
               (null) 00000100 00000840 (null) c0329f64 00000001 00001000 (null)
        Pid: 0, comm: swapper Not tainted 2.6.36 #5
        Call Trace:
         [<c02e3707>] ? 0xc02e3707
         [<c035e6e5>] 0xc035e6e5
         [<c0353191>] ? 0xc0353191
         [<c03534d6>] 0xc03534d6
         [<c034f1cd>] 0xc034f1cd
         [<c034a824>] 0xc034a824
         [<c03513cb>] ? 0xc03513cb
         [<c0349432>] 0xc0349432
         [<c0349066>] 0xc0349066
      
      It turns out that we should ignore the low limit of 16M.
      
      Use alloc_bootmem_node_nopanic() in this case.
      
      [akpm@linux-foundation.org: less mess]
      Signed-off-by: NYinghai LU <yinghai@kernel.org>
      Reported-by: NStefan Hellermann <stefan@the2masters.de>
      Tested-by: NStefan Hellermann <stefan@the2masters.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@kernel.org>		[2.6.34+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f389a99
    • M
      mm: check PageUnevictable in lru_deactivate_fn() · bad49d9c
      Minchan Kim 提交于
      The lru_deactivate_fn should not move page which in on unevictable lru
      into inactive list.  Otherwise, we can meet BUG when we use
      isolate_lru_pages as __isolate_lru_page could return -EINVAL.
      Reported-by: NYing Han <yinghan@google.com>
      Tested-by: NYing Han <yinghan@google.com>
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: Rik van Riel<riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bad49d9c
    • B
      drivers/rtc/rtc-s3c.c: fixup wake support for rtc · 52cd4e5c
      Ben Dooks 提交于
      The driver is not balancing set_irq and disable_irq_wake() calls, so
      ensure that it keeps track of whether the wake is enabled.
      
      The fixes the following error on S3C6410 devices:
      
        WARNING: at kernel/irq/manage.c:382 set_irq_wake+0x84/0xec()
        Unbalanced IRQ 92 wake disable
      Signed-off-by: NBen Dooks <ben-linux@fluff.org>
      Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52cd4e5c
  2. 11 5月, 2011 31 次提交