1. 12 2月, 2011 11 次提交
    • K
      memcg: fix leak of accounting at failure path of hugepage collapsing · 678ff896
      KAMEZAWA Hiroyuki 提交于
      mem_cgroup_uncharge_page() should be called in all failure cases after
      mem_cgroup_charge_newpage() is called in huge_memory.c::collapse_huge_page()
      
       [ 4209.076861] BUG: Bad page state in process khugepaged  pfn:1e9800
       [ 4209.077601] page:ffffea0006b14000 count:0 mapcount:0 mapping:          (null) index:0x2800
       [ 4209.078674] page flags: 0x40000000004000(head)
       [ 4209.079294] pc:ffff880214a30000 pc->flags:2146246697418756 pc->mem_cgroup:ffffc9000177a000
       [ 4209.082177] (/A)
       [ 4209.082500] Pid: 31, comm: khugepaged Not tainted 2.6.38-rc3-mm1 #1
       [ 4209.083412] Call Trace:
       [ 4209.083678]  [<ffffffff810f4454>] ? bad_page+0xe4/0x140
       [ 4209.084240]  [<ffffffff810f53e6>] ? free_pages_prepare+0xd6/0x120
       [ 4209.084837]  [<ffffffff8155621d>] ? rwsem_down_failed_common+0xbd/0x150
       [ 4209.085509]  [<ffffffff810f5462>] ? __free_pages_ok+0x32/0xe0
       [ 4209.086110]  [<ffffffff810f552b>] ? free_compound_page+0x1b/0x20
       [ 4209.086699]  [<ffffffff810fad6c>] ? __put_compound_page+0x1c/0x30
       [ 4209.087333]  [<ffffffff810fae1d>] ? put_compound_page+0x4d/0x200
       [ 4209.087935]  [<ffffffff810fb015>] ? put_page+0x45/0x50
       [ 4209.097361]  [<ffffffff8113f779>] ? khugepaged+0x9e9/0x1430
       [ 4209.098364]  [<ffffffff8107c870>] ? autoremove_wake_function+0x0/0x40
       [ 4209.099121]  [<ffffffff8113ed90>] ? khugepaged+0x0/0x1430
       [ 4209.099780]  [<ffffffff8107c236>] ? kthread+0x96/0xa0
       [ 4209.100452]  [<ffffffff8100dda4>] ? kernel_thread_helper+0x4/0x10
       [ 4209.101214]  [<ffffffff8107c1a0>] ? kthread+0x0/0xa0
       [ 4209.101842]  [<ffffffff8100dda0>] ? kernel_thread_helper+0x0/0x10
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      678ff896
    • J
      vmscan: fix zone shrinking exit when scan work is done · f0fdc5e8
      Johannes Weiner 提交于
      Commit 3e7d3449 ("mm: vmscan: reclaim order-0 and use compaction
      instead of lumpy reclaim") introduced an indefinite loop in
      shrink_zone().
      
      It meant to break out of this loop when no pages had been reclaimed and
      not a single page was even scanned.  The way it would detect the latter
      is by taking a snapshot of sc->nr_scanned at the beginning of the
      function and comparing it against the new sc->nr_scanned after the scan
      loop.  But it would re-iterate without updating that snapshot, looping
      forever if sc->nr_scanned changed at least once since shrink_zone() was
      invoked.
      
      This is not the sole condition that would exit that loop, but it
      requires other processes to change the zone state, as the reclaimer that
      is stuck obviously can not anymore.
      
      This is only happening for higher-order allocations, where reclaim is
      run back to back with compaction.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: Kent Overstreet<kent.overstreet@gmail.com>
      Reported-by: NKent Overstreet <kent.overstreet@gmail.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0fdc5e8
    • M
      mlock: do not munlock pages in __do_fault() · 419d8c96
      Michel Lespinasse 提交于
      If the page is going to be written to, __do_page needs to break COW.
      
      However, the old page (before breaking COW) was never mapped mapped into
      the current pte (__do_fault is only called when the pte is not present),
      so vmscan can't have marked the old page as PageMlocked due to being
      mapped in __do_fault's VMA.  Therefore, __do_fault() does not need to
      worry about clearing PageMlocked() on the old page.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      419d8c96
    • M
      mlock: fix race when munlocking pages in do_wp_page() · e15f8c01
      Michel Lespinasse 提交于
      vmscan can lazily find pages that are mapped within VM_LOCKED vmas, and
      set the PageMlocked bit on these pages, transfering them onto the
      unevictable list.  When do_wp_page() breaks COW within a VM_LOCKED vma,
      it may need to clear PageMlocked on the old page and set it on the new
      page instead.
      
      This change fixes an issue where do_wp_page() was clearing PageMlocked
      on the old page while the pte was still pointing to it (as well as
      rmap).  Therefore, we were not protected against vmscan immediately
      transfering the old page back onto the unevictable list.  This could
      cause pages to get stranded there forever.
      
      I propose to move the corresponding code to the end of do_wp_page(),
      after the pte (and rmap) have been pointed to the new page.
      Additionally, we can use munlock_vma_page() instead of
      clear_page_mlock(), so that the old page stays mlocked if there are
      still other VM_LOCKED vmas mapping it.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e15f8c01
    • Y
      memblock: don't adjust size in memblock_find_base() · e6d2e2b2
      Yinghai Lu 提交于
      While applying patch to use memblock to find aperture for 64bit x86.
      Ingo found system with 1g + force_iommu
      
      > No AGP bridge found
      > Node 0: aperture @ 38000000 size 32 MB
      > Aperture pointing to e820 RAM. Ignoring.
      > Your BIOS doesn't leave a aperture memory hole
      > Please enable the IOMMU option in the BIOS setup
      > This costs you 64 MB of RAM
      > Cannot allocate aperture memory hole (0,65536K)
      
      the corresponding code:
      
      	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<20);
      	if (addr == MEMBLOCK_ERROR || addr + aper_size > 0xffffffff) {
      		printk(KERN_ERR
      			"Cannot allocate aperture memory hole (%lx,%uK)\n",
      				addr, aper_size>>10);
      		return 0;
      	}
      	memblock_x86_reserve_range(addr, addr + aper_size, "aperture64")
      
      fails because memblock core code align the size with 512M.  That could
      make size way too big.
      
      So don't align the size in that case.
      
      actually __memblock_alloc_base, the another caller already align that
      before calling that function.
      
      BTW. x86 does not use __memblock_alloc_base...
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Miller <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Airlie <airlied@linux.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6d2e2b2
    • S
      nbd: remove module-level ioctl mutex · de1f016f
      Soren Hansen 提交于
      Commit 2a48fc0a ("block: autoconvert trivial BKL users to private
      mutex") replaced uses of the BKL in the nbd driver with mutex
      operations.  Since then, I've been been seeing these lock ups:
      
       INFO: task qemu-nbd:16115 blocked for more than 120 seconds.
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       qemu-nbd      D 0000000000000001     0 16115  16114 0x00000004
        ffff88007d775d98 0000000000000082 ffff88007d775fd8 ffff88007d774000
        0000000000013a80 ffff8800020347e0 ffff88007d775fd8 0000000000013a80
        ffff880133730000 ffff880002034440 ffffea0004333db8 ffffffffa071c020
       Call Trace:
        [<ffffffff815b9997>] __mutex_lock_slowpath+0xf7/0x180
        [<ffffffff815b93eb>] mutex_lock+0x2b/0x50
        [<ffffffffa071a21c>] nbd_ioctl+0x6c/0x1c0 [nbd]
        [<ffffffff812cb970>] blkdev_ioctl+0x230/0x730
        [<ffffffff811967a1>] block_ioctl+0x41/0x50
        [<ffffffff81175c03>] do_vfs_ioctl+0x93/0x370
        [<ffffffff81175f61>] sys_ioctl+0x81/0xa0
        [<ffffffff8100c0c2>] system_call_fastpath+0x16/0x1b
      
      Instrumenting the nbd module's ioctl handler with some extra logging
      clearly shows the NBD_DO_IT ioctl being invoked which is a long-lived
      ioctl in the sense that it doesn't return until another ioctl asks the
      driver to disconnect.  However, that other ioctl blocks, waiting for the
      module-level mutex that replaced the BKL, and then we're stuck.
      
      This patch removes the module-level mutex altogether.  It's clearly
      wrong, and as far as I can see, it's entirely unnecessary, since the nbd
      driver maintains per-device mutexes, and I don't see anything that would
      require a module-level (or kernel-level, for that matter) mutex.
      Signed-off-by: NSoren Hansen <soren@linux2go.dk>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: NPaul Clements <paul.clements@steeleye.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@kernel.org>		[2.6.37.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de1f016f
    • A
      drivers/rtc/rtc-proc.c: add module_put on error path in rtc_proc_open() · 24a6f5b8
      Alexander Strakh 提交于
      In file drivers/rtc/rtc-proc.c seq_open() can return -ENOMEM.
      
       86        if (!try_module_get(THIS_MODULE))
       87                return -ENODEV;
       88
       89        return single_open(file, rtc_proc_show, rtc);
      
      In this case before exiting (line 89) from rtc_proc_open the
      module_put(THIS_MODULE) must be called.
      
      Found by Linux Device Drivers Verification Project
      Signed-off-by: NAlexander Strakh <strakh@ispras.ru>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24a6f5b8
    • R
      drivers/gpio/pca953x.c: add a mutex to fix race condition · 6e20fb18
      Roland Stigge 提交于
      Add a mutex to register communication and handling.  Without the mutex,
      GPIOs didn't switch as expected when toggled in a fast sequence of
      status changes of multiple outputs.
      Signed-off-by: NRoland Stigge <stigge@antcom.de>
      Acked-by: NEric Miao <eric.y.miao@gmail.com>
      Cc: Grant Likely <grant.likely@secretlab.ca>
      Cc: Marc Zyngier <maz@misterjones.org>
      Cc: Ben Gardner <bgardner@wabtec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e20fb18
    • T
      ptrace: use safer wake up on ptrace_detach() · 01e05e9a
      Tejun Heo 提交于
      The wake_up_process() call in ptrace_detach() is spurious and not
      interlocked with the tracee state.  IOW, the tracee could be running or
      sleeping in any place in the kernel by the time wake_up_process() is
      called.  This can lead to the tracee waking up unexpectedly which can be
      dangerous.
      
      The wake_up is spurious and should be removed but for now reduce its
      toxicity by only waking up if the tracee is in TRACED or STOPPED state.
      
      This bug can possibly be used as an attack vector.  I don't think it
      will take too much effort to come up with an attack which triggers oops
      somewhere.  Most sleeps are wrapped in condition test loops and should
      be safe but we have quite a number of places where sleep and wakeup
      conditions are expected to be interlocked.  Although the window of
      opportunity is tiny, ptrace can be used by non-privileged users and with
      some loading the window can definitely be extended and exploited.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01e05e9a
    • B
      vfs: call rcu_barrier after ->kill_sb() · d863b50a
      Boaz Harrosh 提交于
      In commit fa0d7e3d ("fs: icache RCU free inodes"), we use rcu free
      inode instead of freeing the inode directly.  It causes a crash when we
      rmmod immediately after we umount the volume[1].
      
      So we need to call rcu_barrier after we kill_sb so that the inode is
      freed before we do rmmod.  The idea is inspired by Aneesh Kumar.
      rcu_barrier will wait for all callbacks to end before preceding.  The
      original patch was done by Tao Ma, but synchronize_rcu() is not enough
      here.
      
      1. http://marc.info/?l=linux-fsdevel&m=129680863330185&w=2Tested-by: NTao Ma <boyu.mt@taobao.com>
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d863b50a
    • L
      Fix possible filp_cachep memory corruption · 2dab5974
      Linus Torvalds 提交于
      In commit 31e6b01f ("fs: rcu-walk for path lookup") we started doing
      path lookup using RCU, which then falls back to a careful non-RCU lookup
      in case of problems (LOOKUP_REVAL).  So do_filp_open() has this "re-do
      the lookup carefully" looping case.
      
      However, that means that we must not release the open-intent file data
      if we are going to loop around and use it once more!
      
      Fix this by moving the release of the open-intent data to the function
      that allocates it (do_filp_open() itself) rather than the helper
      functions that can get called multiple times (finish_open() and
      do_last()).  This makes the logic for the lifetime of that field much
      more obvious, and avoids the possible double free.
      Reported-by: NJ. R. Okajima <hooanon05@yahoo.co.jp>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2dab5974
  2. 11 2月, 2011 8 次提交
  3. 10 2月, 2011 13 次提交
  4. 09 2月, 2011 8 次提交
    • S
      cdrom: support devices that have check_events but not media_changed · b8cf0e0e
      Simon Arlott 提交于
      Commit 93aae17a ("sr: implement
      sr_check_events()") replaced the media_changed op with the
      check_events op in drivers/scsi/sr.c
      
      All users that check for the CDC_MEDIA_CHANGED capbility try both
      the check_events op and the media_changed op, but register_cdrom()
      was requiring media_changed.
      
      This patch fixes the capability checking.
      
      The cdrom_select_disc ioctl is also using the two operations, so
      they should be required for CDC_SELECT_DISC too.
      Signed-off-by: NSimon Arlott <simon@fire.lp0.eu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Tested-by: NChris Clayton <chris2553@googlemail.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      b8cf0e0e
    • J
      cfq-iosched: Don't wait if queue already has requests. · 02a8f01b
      Justin TerAvest 提交于
      Commit 7667aa06 added logic to wait for
      the last queue of the group to become busy (have at least one request),
      so that the group does not lose out for not being continuously
      backlogged. The commit did not check for the condition that the last
      queue already has some requests. As a result, if the queue already has
      requests, wait_busy is set. Later on, cfq_select_queue() checks the
      flag, and decides that since the queue has a request now and wait_busy
      is set, the queue is expired.  This results in early expiration of the
      queue.
      
      This patch fixes the problem by adding a check to see if queue already
      has requests. If it does, wait_busy is not set. As a result, time slices
      do not expire early.
      
      The queues with more than one request are usually buffered writers.
      Testing shows improvement in isolation between buffered writers.
      
      Cc: stable@kernel.org
      Signed-off-by: NJustin TerAvest <teravest@google.com>
      Reviewed-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      02a8f01b
    • P
      netfilter: nf_conntrack: set conntrack templates again if we return NF_REPEAT · c3174286
      Pablo Neira Ayuso 提交于
      The TCP tracking code has a special case that allows to return
      NF_REPEAT if we receive a new SYN packet while in TIME_WAIT state.
      
      In this situation, the TCP tracking code destroys the existing
      conntrack to start a new clean session.
      
      [DESTROY] tcp      6 src=192.168.0.2 dst=192.168.1.2 sport=38925 dport=8000 src=192.168.1.2 dst=192.168.1.100 sport=8000 dport=38925 [ASSURED]
          [NEW] tcp      6 120 SYN_SENT src=192.168.0.2 dst=192.168.1.2 sport=38925 dport=8000 [UNREPLIED] src=192.168.1.2 dst=192.168.1.100 sport=8000 dport=38925
      
      However, this is a problem for the iptables' CT target event filtering
      which will not work in this case since the conntrack template will not
      be there for the new session. To fix this, we reassign the conntrack
      template to the packet if we return NF_REPEAT.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      c3174286
    • T
      pch_can: fix module reload issue with MSI · c69b9092
      Tomoya 提交于
      Currently, in case reload pch_can,
      pch_can not to be able to catch interrupt.
      
      The cause is bus-master is not set in pch_can.
      Thus, add enabling bus-master processing.
      Signed-off-by: NTomoya MORINAGA <tomoya-linux@dsn.okisemi.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c69b9092
    • T
      pch_can: fix rmmod issue · ce9736d4
      Tomoya 提交于
      Currently, when rmmod pch_can, kernel failure occurs.
      The cause is pci_iounmap executed before pch_can_reset.
      Thus pci_iounmap moves after pch_can_reset.
      Signed-off-by: NTomoya MORINAGA <tomoya-linux@dsn.okisemi.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce9736d4
    • T
      pch_can: fix 800k comms issue · eab743ed
      Tomoya 提交于
      Currently, 800k comms fails since prop_seg set zero.
      (EG20T PCH CAN register of prop_seg must be set more than 1)
      To prevent prop_seg set to zero, change tseg2_min 1 to 2.
      Signed-off-by: NTomoya MORINAGA <tomoya-linux@dsn.okisemi.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eab743ed
    • D
      net: Fix lockdep regression caused by initializing netdev queues too early. · 8d3bdbd5
      David S. Miller 提交于
      In commit aa942104 ("net: init ingress
      queue") we moved the allocation and lock initialization of the queues
      into alloc_netdev_mq() since register_netdevice() is way too late.
      
      The problem is that dev->type is not setup until the setup()
      callback is invoked by alloc_netdev_mq(), and the dev->type is
      what determines the lockdep class to use for the locks in the
      queues.
      
      Fix this by doing the queue allocation after the setup() callback
      runs.
      
      This is safe because the setup() callback is not allowed to make any
      state changes that need to be undone on error (memory allocations,
      etc.).  It may, however, make state changes that are undone by
      free_netdev() (such as netif_napi_add(), which is done by the
      ipoib driver's setup routine).
      
      The previous code also leaked a reference to the &init_net namespace
      object on RX/TX queue allocation failures.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d3bdbd5
    • D
      net/caif: Fix dangling list pointer in freed object on error. · b2df5a84
      David S. Miller 提交于
      rtnl_link_ops->setup(), and the "setup" callback passed to alloc_netdev*(),
      cannot make state changes which need to be undone on failure.  There is
      no cleanup mechanism available at this point.
      
      So we have to add the caif private instance to the global list once we
      are sure that register_netdev() has succedded in ->newlink().
      
      Otherwise, if register_netdev() fails, the caller will invoke free_netdev()
      and we will have a reference to freed up memory on the chnl_net_list.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2df5a84