1. 19 10月, 2010 1 次提交
    • H
      lockdep: Add improved subclass caching · 62016250
      Hitoshi Mitake 提交于
      Current lockdep_map only caches one class with subclass == 0,
      and looks up hash table of classes when subclass != 0.
      
      It seems that this has no problem because the case of
      subclass != 0 is rare. But locks of struct rq are
      acquired with subclass == 1 when task migration is executed.
      Task migration is high frequent event, so I modified lockdep
      to cache subclasses.
      
      I measured the score of perf bench sched messaging.
      This patch has slightly but certain (order of milli seconds
      or 10 milli seconds) effect when lots of tasks are running.
      I'll show the result in the tail of this description.
      
      NR_LOCKDEP_CACHING_CLASSES specifies how many classes can be
      cached in the instances of lockdep_map.
      I discussed with Peter Zijlstra in LinuxCon Japan about
      this approach and he taught me that caching every subclasses(8)
      is cleary waste of memory. So number of cached classes
      should be configurable.
      
      === Score comparison of benchmarks ===
      # "min" means best score, and "max" means worst score
      
      for i in `seq 1 10`; do ./perf bench -f simple sched messaging; done
      
      before: min: 0.565000, max: 0.583000, avg: 0.572500
      after:  min: 0.559000, max: 0.568000, avg: 0.563300
      
      # with more processes
      for i in `seq 1 10`; do ./perf bench -f simple sched messaging -g 40; done
      
      before: min: 2.274000, max: 2.298000, avg: 2.286300
      after:  min: 2.242000, max: 2.270000, avg: 2.259700
      Signed-off-by: NHitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286269311-28336-2-git-send-email-mitake@dcl.info.waseda.ac.jp>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      62016250
  2. 16 10月, 2010 1 次提交
  3. 15 10月, 2010 2 次提交
    • L
      Un-inline the core-dump helper functions · 3aa0ce82
      Linus Torvalds 提交于
      Tony Luck reports that the addition of the access_ok() check in commit
      0eead9ab ("Don't dump task struct in a.out core-dumps") broke the
      ia64 compile due to missing the necessary header file includes.
      
      Rather than add yet another include (<asm/unistd.h>) to make everything
      happy, just uninline the silly core dump helper functions and move the
      bodies to fs/exec.c where they make a lot more sense.
      
      dump_seek() in particular was too big to be an inline function anyway,
      and none of them are in any way performance-critical.  And we really
      don't need to mess up our include file headers more than they already
      are.
      Reported-and-tested-by: NTony Luck <tony.luck@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3aa0ce82
    • L
      Don't dump task struct in a.out core-dumps · 0eead9ab
      Linus Torvalds 提交于
      akiphie points out that a.out core-dumps have that odd task struct
      dumping that was never used and was never really a good idea (it goes
      back into the mists of history, probably the original core-dumping
      code).  Just remove it.
      
      Also do the access_ok() check on dump_write().  It probably doesn't
      matter (since normal filesystems all seem to do it anyway), but he
      points out that it's normally done by the VFS layer, so ...
      
      [ I suspect that we should possibly do "vfs_write()" instead of
        calling ->write directly.  That also does the whole fsnotify and write
        statistics thing, which may or may not be a good idea. ]
      
      And just to be anal, do this all for the x86-64 32-bit a.out emulation
      code too, even though it's not enabled (and won't currently even
      compile)
      Reported-by: Nakiphie <akiphie@lavabit.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0eead9ab
  4. 12 10月, 2010 1 次提交
    • E
      fanotify: disable fanotify syscalls · 7c534773
      Eric Paris 提交于
      This patch disables the fanotify syscalls by just not building them and
      letting the cond_syscall() statements in kernel/sys_ni.c redirect them
      to sys_ni_syscall().
      
      It was pointed out by Tvrtko Ursulin that the fanotify interface did not
      include an explicit prioritization between groups.  This is necessary
      for fanotify to be usable for hierarchical storage management software,
      as they must get first access to the file, before inotify-like notifiers
      see the file.
      
      This feature can be added in an ABI compatible way in the next release
      (by using a number of bits in the flags field to carry the info) but it
      was suggested by Alan that maybe we should just hold off and do it in
      the next cycle, likely with an (new) explicit argument to the syscall.
      I don't like this approach best as I know people are already starting to
      use the current interface, but Alan is all wise and noone on list backed
      me up with just using what we have.  I feel this is needlessly ripping
      the rug out from under people at the last minute, but if others think it
      needs to be a new argument it might be the best way forward.
      
      Three choices:
      Go with what we got (and implement the new feature next cycle).  Add a
      new field right now (and implement the new feature next cycle).  Wait
      till next cycle to release the ABI (and implement the new feature next
      cycle).  This is number 3.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c534773
  5. 07 10月, 2010 1 次提交
    • J
      elevator: fix oops on early call to elevator_change() · 430c62fb
      Jens Axboe 提交于
      2.6.36 introduces an API for drivers to switch the IO scheduler
      instead of manually calling the elevator exit and init functions.
      This API was added since q->elevator must be cleared in between
      those two calls. And since we already have this functionality
      directly from use by the sysfs interface to switch schedulers
      online, it was prudent to reuse it internally too.
      
      But this API needs the queue to be in a fully initialized state
      before it is called, or it will attempt to unregister elevator
      kobjects before they have been added. This results in an oops
      like this:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000051
      IP: [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
      PGD 47ddfc067 PUD 47c6a1067 PMD 0
      Oops: 0000 [#1] PREEMPT SMP
      last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:04:00.1/irq
      CPU 2
      Modules linked in: t(+) loop hid_apple usbhid ahci ehci_hcd uhci_hcd libahci usbcore nls_base igb
      
      Pid: 7319, comm: modprobe Not tainted 2.6.36-rc6+ #132 QSSC-S4R/QSSC-S4R
      RIP: 0010:[<ffffffff8116f15e>]  [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
      RSP: 0018:ffff88027da25d08  EFLAGS: 00010246
      RAX: ffff88047c68c528 RBX: 00000000fffffffe RCX: 0000000000000000
      RDX: 000000000000002f RSI: 000000000000002f RDI: ffff88047e196c88
      RBP: ffff88027da25d38 R08: 0000000000000000 R09: d84156c5635688c0
      R10: d84156c5635688c0 R11: 0000000000000000 R12: ffff88047e196c88
      R13: 0000000000000000 R14: 0000000000000000 R15: ffff88047c68c528
      FS:  00007fcb0b26f6e0(0000) GS:ffff880287400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000051 CR3: 000000047e76e000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process modprobe (pid: 7319, threadinfo ffff88027da24000, task ffff88027d377090)
      Stack:
       ffff88027da25d58 ffff88047c68c528 00000000fffffffe ffff88047e196c88
      <0> ffff88047c68c528 ffff88047e05bd90 ffff88027da25d78 ffffffff8123fb77
      <0> ffff88047e05bd90 0000000000000000 ffff88047e196c88 ffff88047c68c528
      Call Trace:
       [<ffffffff8123fb77>] kobject_add_internal+0xe7/0x1f0
       [<ffffffff8123fd98>] kobject_add_varg+0x38/0x60
       [<ffffffff8123feb9>] kobject_add+0x69/0x90
       [<ffffffff8116efe0>] ? sysfs_remove_dir+0x20/0xa0
       [<ffffffff8103d48d>] ? sub_preempt_count+0x9d/0xe0
       [<ffffffff8143de20>] ? _raw_spin_unlock+0x30/0x50
       [<ffffffff8116efe0>] ? sysfs_remove_dir+0x20/0xa0
       [<ffffffff8116eff4>] ? sysfs_remove_dir+0x34/0xa0
       [<ffffffff81224204>] elv_register_queue+0x34/0xa0
       [<ffffffff81224aad>] elevator_change+0xfd/0x250
       [<ffffffffa007e000>] ? t_init+0x0/0x361 [t]
       [<ffffffffa007e000>] ? t_init+0x0/0x361 [t]
       [<ffffffffa007e0a8>] t_init+0xa8/0x361 [t]
       [<ffffffff810001de>] do_one_initcall+0x3e/0x170
       [<ffffffff8108c3fd>] sys_init_module+0xbd/0x220
       [<ffffffff81002f2b>] system_call_fastpath+0x16/0x1b
      Code: e5 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 85 ff 74 52 48 8b 47 18 49 c7 c5 00 46 61 81 48 85 c0 74 04 4c 8b 68 30 45 31 f6 <41> 80 7d 51 00 74 0e 49 8b 44 24 28 4c 89 e7 ff 50 20 49 89 c6
      RIP  [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
       RSP <ffff88027da25d08>
      CR2: 0000000000000051
      ---[ end trace a6541d3bf07945df ]---
      
      Fix this by adding a registered bit to the elevator queue, which is
      set when the sysfs kobjects have been registered.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      430c62fb
  6. 06 10月, 2010 3 次提交
    • T
      drm/ttm: Fix two race conditions + fix busy codepaths · 1df6a2eb
      Thomas Hellstrom 提交于
      This fixes a race pointed out by Dave Airlie where we don't take a buffer
      object about to be destroyed off the LRU lists properly. It also fixes a rare
      case where a buffer object could be destroyed in the middle of an
      accelerated eviction.
      
      The patch also adds a utility function that can be used to prematurely
      release GPU memory space usage of an object waiting to be destroyed.
      For example during eviction or swapout.
      
      The above mentioned commit didn't queue the buffer on the delayed destroy
      list under some rare circumstances. It also didn't completely honor the
      remove_all parameter.
      
      Fixes:
      https://bugzilla.redhat.com/show_bug.cgi?id=615505
      http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=591061Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com>
      Signed-off-by: NDave Airlie <airlied@redhat.com>
      1df6a2eb
    • E
      wait: using uninitialized member of wait queue · 231d0aef
      Evgeny Kuznetsov 提交于
      The "flags" member of "struct wait_queue_t" is used in several places in
      the kernel code without beeing initialized by init_wait().  "flags" is
      used in bitwise operations.
      
      If "flags" not initialized then unexpected behaviour may take place.
      Incorrect flags might used later in code.
      
      Added initialization of "wait_queue_t.flags" with zero value into
      "init_wait".
      Signed-off-by: NEvgeny Kuznetsov <EXT-Eugeny.Kuznetsov@nokia.com>
      [ The bit we care about does end up being initialized by both
         prepare_to_wait() and add_to_wait_queue(), so this doesn't seem to
         cause actual bugs, but is definitely the right thing to do -Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      231d0aef
    • L
      modules: Fix module_bug_list list corruption race · 5336377d
      Linus Torvalds 提交于
      With all the recent module loading cleanups, we've minimized the code
      that sits under module_mutex, fixing various deadlocks and making it
      possible to do most of the module loading in parallel.
      
      However, that whole conversion totally missed the rather obscure code
      that adds a new module to the list for BUG() handling.  That code was
      doubly obscure because (a) the code itself lives in lib/bugs.c (for
      dubious reasons) and (b) it gets called from the architecture-specific
      "module_finalize()" rather than from generic code.
      
      Calling it from arch-specific code makes no sense what-so-ever to begin
      with, and is now actively wrong since that code isn't protected by the
      module loading lock any more.
      
      So this commit moves the "module_bug_{finalize,cleanup}()" calls away
      from the arch-specific code, and into the generic code - and in the
      process protects it with the module_mutex so that the list operations
      are now safe.
      
      Future fixups:
       - move the module list handling code into kernel/module.c where it
         belongs.
       - get rid of 'module_bug_list' and just use the regular list of modules
         (called 'modules' - imagine that) that we already create and maintain
         for other reasons.
      Reported-and-tested-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Adrian Bunk <bunk@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5336377d
  7. 01 10月, 2010 3 次提交
  8. 30 9月, 2010 1 次提交
    • G
      Bluetooth: Fix deadlock in the ERTM logic · e454c844
      Gustavo F. Padovan 提交于
      The Enhanced Retransmission Mode(ERTM) is a realiable mode of operation
      of the Bluetooth L2CAP layer. Think on it like a simplified version of
      TCP.
      The problem we were facing here was a deadlock. ERTM uses a backlog
      queue to queue incomimg packets while the user is helding the lock. At
      some moment the sk_sndbuf can be exceeded and we can't alloc new skbs
      then the code sleep with the lock to wait for memory, that stalls the
      ERTM connection once we can't read the acknowledgements packets in the
      backlog queue to free memory and make the allocation of outcoming skb
      successful.
      
      This patch actually affect all users of bt_skb_send_alloc(), i.e., all
      L2CAP modes and SCO.
      
      We are safe against socket states changes or channels deletion while the
      we are sleeping wait memory. Checking for the sk->sk_err and
      sk->sk_shutdown make the code safe, since any action that can leave the
      socket or the channel in a not usable state set one of the struct
      members at least. Then we can check both of them when getting the lock
      again and return with the proper error if something unexpected happens.
      Signed-off-by: NGustavo F. Padovan <padovan@profusion.mobi>
      Signed-off-by: NUlisses Furquim <ulisses@profusion.mobi>
      e454c844
  9. 29 9月, 2010 1 次提交
  10. 28 9月, 2010 4 次提交
  11. 27 9月, 2010 3 次提交
  12. 23 9月, 2010 5 次提交
  13. 22 9月, 2010 1 次提交
  14. 21 9月, 2010 1 次提交
    • T
      xfrm: Allow different selector family in temporary state · 8444cf71
      Thomas Egerer 提交于
      The family parameter xfrm_state_find is used to find a state matching a
      certain policy. This value is set to the template's family
      (encap_family) right before xfrm_state_find is called.
      The family parameter is however also used to construct a temporary state
      in xfrm_state_find itself which is wrong for inter-family scenarios
      because it produces a selector for the wrong family. Since this selector
      is included in the xfrm_user_acquire structure, user space programs
      misinterpret IPv6 addresses as IPv4 and vice versa.
      This patch splits up the original init_tempsel function into a part that
      initializes the selector respectively the props and id of the temporary
      state, to allow for differing ip address families whithin the state.
      Signed-off-by: NThomas Egerer <thomas.egerer@secunet.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8444cf71
  15. 18 9月, 2010 1 次提交
  16. 16 9月, 2010 1 次提交
  17. 15 9月, 2010 1 次提交
    • H
      compat: Make compat_alloc_user_space() incorporate the access_ok() · c41d68a5
      H. Peter Anvin 提交于
      compat_alloc_user_space() expects the caller to independently call
      access_ok() to verify the returned area.  A missing call could
      introduce problems on some architectures.
      
      This patch incorporates the access_ok() check into
      compat_alloc_user_space() and also adds a sanity check on the length.
      The existing compat_alloc_user_space() implementations are renamed
      arch_compat_alloc_user_space() and are used as part of the
      implementation of the new global function.
      
      This patch assumes NULL will cause __get_user()/__put_user() to either
      fail or access userspace on all architectures.  This should be
      followed by checking the return value of compat_access_user_space()
      for NULL in the callers, at which time the access_ok() in the callers
      can also be removed.
      Reported-by: NBen Hawkes <hawkes@sota.gen.nz>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NChris Metcalf <cmetcalf@tilera.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NTony Luck <tony.luck@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: James Bottomley <jejb@parisc-linux.org>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: <stable@kernel.org>
      c41d68a5
  18. 14 9月, 2010 1 次提交
  19. 13 9月, 2010 3 次提交
  20. 10 9月, 2010 5 次提交
    • G
      libata-sff: Reenable Port Multiplier after libata-sff remodeling. · ea3c6450
      Gwendal Grignou 提交于
      Keep track of the link on the which the current request is in progress.
      It allows support of links behind port multiplier.
      
      Not all libata-sff is PMP compliant. Code for native BMDMA controller
      does not take in accound PMP.
      
      Tested on Marvell 7042 and Sil7526.
      Signed-off-by: NGwendal Grignou <gwendal@google.com>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      ea3c6450
    • T
      libata: skip EH autopsy and recovery during suspend · e2f3d75f
      Tejun Heo 提交于
      For some mysterious reason, certain hardware reacts badly to usual EH
      actions while the system is going for suspend.  As the devices won't
      be needed until the system is resumed, ask EH to skip usual autopsy
      and recovery and proceed directly to suspend.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Tested-by: NStephan Diestelhorst <stephan.diestelhorst@amd.com>
      Cc: stable@kernel.org
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      e2f3d75f
    • C
      mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory... · aa454840
      Christoph Lameter 提交于
      mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
      
      Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
      cheaper than scanning a number of lists.  To avoid synchronization
      overhead, counter deltas are maintained on a per-cpu basis and drained
      both periodically and when the delta is above a threshold.  On large CPU
      systems, the difference between the estimated and real value of
      NR_FREE_PAGES can be very high.  If NR_FREE_PAGES is much higher than
      number of real free page in buddy, the VM can allocate pages below min
      watermark, at worst reducing the real number of pages to zero.  Even if
      the OOM killer kills some victim for freeing memory, it may not free
      memory if the exit path requires a new page resulting in livelock.
      
      This patch introduces a zone_page_state_snapshot() function (courtesy of
      Christoph) that takes a slightly more accurate view of an arbitrary vmstat
      counter.  It is used to read NR_FREE_PAGES while kswapd is awake to avoid
      the watermark being accidentally broken.  The estimate is not perfect and
      may result in cache line bounces but is expected to be lighter than the
      IPI calls necessary to continually drain the per-cpu counters while kswapd
      is awake.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa454840
    • H
      swap: discard while swapping only if SWAP_FLAG_DISCARD · 33994466
      Hugh Dickins 提交于
      Tests with recent firmware on Intel X25-M 80GB and OCZ Vertex 60GB SSDs
      show a shift since I last tested in December: in part because of firmware
      updates, in part because of the necessary move from barriers to awaiting
      completion at the block layer.  While discard at swapon still shows as
      slightly beneficial on both, discarding 1MB swap cluster when allocating
      is now disadvanteous: adds 25% overhead on Intel, adds 230% on OCZ (YMMV).
      
      Surrender: discard as presently implemented is more hindrance than help
      for swap; but might prove useful on other devices, or with improvements.
      So continue to do the discard at swapon, but make discard while swapping
      conditional on a SWAP_FLAG_DISCARD to sys_swapon() (which has been using
      only the lower 16 bits of int flags).
      
      We can add a --discard or -d to swapon(8), and a "discard" to swap in
      /etc/fstab: matching the mount option for btrfs, ext4, fat, gfs2, nilfs2.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Nigel Cunningham <nigel@tuxonice.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33994466
    • H
      swap: revert special hibernation allocation · 910321ea
      Hugh Dickins 提交于
      Please revert 2.6.36-rc commit d2997b10
      "hibernation: freeze swap at hibernation".  It complicated matters by
      adding a second swap allocation path, just for hibernation; without in any
      way fixing the issue that it was intended to address - page reclaim after
      fixing the hibernation image might free swap from a page already imaged as
      swapcache, letting its swap be reallocated to store a different page of
      the image: resulting in data corruption if the imaged page were freed as
      clean then swapped back in.  Pages freed to si->swap_map were still in
      danger of being reallocated by the alternative allocation path.
      
      I guess it inadvertently fixed slow SSD swap allocation for hibernation,
      as reported by Nigel Cunningham: by missing out the discards that occur on
      the usual swap allocation path; but that was unintentional, and needs a
      separate fix.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Ondrej Zary <linux@rainbow-software.org>
      Cc: Andrea Gelmini <andrea.gelmini@gmail.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nigel Cunningham <nigel@tuxonice.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      910321ea