1. 11 12月, 2006 40 次提交
    • R
      [MIPS] Discard .exit.text at linktime. · 86384d54
      Ralf Baechle 提交于
      This fixes fairly unobvious breakage of various drivers.
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      86384d54
    • R
    • H
      [CRYPTO] dm-crypt: Select CRYPTO_CBC · 3263263f
      Herbert Xu 提交于
      As CBC is the default chaining method for cryptoloop, we should select
      it from cryptoloop to ease the transition.  Spotted by Rene Herman.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3263263f
    • C
      [PATCH] add MODULE_* attributes to bit reversal library · 0258736a
      Cal Peake 提交于
      Add MODULE_* attributes to the new bit reversal library. Most notably
      MODULE_LICENSE which prevents superfluous kernel tainting.
      Signed-off-by: NCal Peake <cp@absolutedigital.net>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0258736a
    • L
      Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6 · edb16bec
      Linus Torvalds 提交于
      * master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
        [SPARC64]: Fix several kprobes bugs.
        [SPARC64]: Update defconfig.
        [SPARC64]: dma remove extra brackets
        [SPARC{32,64}]: Propagate ptrace_traceme() return value.
        [SPARC64]: Replace kmalloc+memset with kzalloc
        [SPARC]: Check kzalloc() return value in SUN4D irq/iommu init.
        [SPARC]: Replace kmalloc+memset with kzalloc
        [SPARC64]: Run ctrl-alt-del action for sun4v powerdown request.
        [SPARC64]: Unaligned accesses to userspace are hard errors.
        [SPARC64]: Call do_mathemu on illegal instruction traps too.
        [SPARC64]: Update defconfig.
        [SPARC64]: Add irqtrace/stacktrace/lockdep support.
      edb16bec
    • L
      Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/mchehab/v4l-dvb · bb7320d1
      Linus Torvalds 提交于
      * 'master' of master.kernel.org:/pub/scm/linux/kernel/git/mchehab/v4l-dvb: (132 commits)
        V4L/DVB 4949b: Fix container_of pointer retreival
        V4L/DVB (4949a): Fix INIT_WORK
        V4L/DVB (4949): Cxusb: codingstyle cleanups
        V4L/DVB (4948): Cxusb: Convert tuner functions to use dvb_pll_attach
        V4L/DVB (4947): Cx88: trivial cleanups
        V4L/DVB (4946): Cx88: Move cx88_dvb_bus_ctrl out of the card-specific area
        V4L/DVB (4945): Cx88: consolidate cx22702_config structs
        V4L/DVB (4944): Cx88: Convert DViCO FusionHDTV Hybrid to use dvb_pll_attach
        V4L/DVB (4943): Cx88: cleanup dvb_pll_attach for lgdt3302 tuners
        V4L/DVB (4953): Usbvision minor fixes
        V4L/DVB (4951): Add version.h, since it is required for VIDIOC_QUERYCAP
        V4L/DVB (4940): Or51211: Changed SNR and signal strength calculations
        V4L/DVB (4939): Or51132: Changed SNR and signal strength reporting
        V4L/DVB (4938): Cx88: Convert lgdt3302 tuning function to use dvb_pll_attach
        V4L/DVB (4941): Remove LINUX_VERSION_CODE and fix identations
        V4L/DVB (4942): Whitespace cleanups
        V4L/DVB (4937): Usbvision cleanup and code reorganization
        V4L/DVB (4936): Make MT4049FM5 tuner to set FM Gain to Normal
        V4L/DVB (4935): Added the capability of selecting fm gain by tuner
        V4L/DVB (4934): Usbvision radio requires GainNormal at e register
        ...
      bb7320d1
    • A
      [PATCH] kvm: userspace interface · 6aa8b732
      Avi Kivity 提交于
      web site: http://kvm.sourceforge.net
      
      mailing list: kvm-devel@lists.sourceforge.net
        (http://lists.sourceforge.net/lists/listinfo/kvm-devel)
      
      The following patchset adds a driver for Intel's hardware virtualization
      extensions to the x86 architecture.  The driver adds a character device
      (/dev/kvm) that exposes the virtualization capabilities to userspace.  Using
      this driver, a process can run a virtual machine (a "guest") in a fully
      virtualized PC containing its own virtual hard disks, network adapters, and
      display.
      
      Using this driver, one can start multiple virtual machines on a host.
      
      Each virtual machine is a process on the host; a virtual cpu is a thread in
      that process.  kill(1), nice(1), top(1) work as expected.  In effect, the
      driver adds a third execution mode to the existing two: we now have kernel
      mode, user mode, and guest mode.  Guest mode has its own address space mapping
      guest physical memory (which is accessible to user mode by mmap()ing
      /dev/kvm).  Guest mode has no access to any I/O devices; any such access is
      intercepted and directed to user mode for emulation.
      
      The driver supports i386 and x86_64 hosts and guests.  All combinations are
      allowed except x86_64 guest on i386 host.  For i386 guests and hosts, both pae
      and non-pae paging modes are supported.
      
      SMP hosts and UP guests are supported.  At the moment only Intel
      hardware is supported, but AMD virtualization support is being worked on.
      
      Performance currently is non-stellar due to the naive implementation of the
      mmu virtualization, which throws away most of the shadow page table entries
      every context switch.  We plan to address this in two ways:
      
      - cache shadow page tables across tlb flushes
      - wait until AMD and Intel release processors with nested page tables
      
      Currently a virtual desktop is responsive but consumes a lot of CPU.  Under
      Windows I tried playing pinball and watching a few flash movies; with a recent
      CPU one can hardly feel the virtualization.  Linux/X is slower, probably due
      to X being in a separate process.
      
      In addition to the driver, you need a slightly modified qemu to provide I/O
      device emulation and the BIOS.
      
      Caveats (akpm: might no longer be true):
      
      - The Windows install currently bluescreens due to a problem with the
        virtual APIC.  We are working on a fix.  A temporary workaround is to
        use an existing image or install through qemu
      - Windows 64-bit does not work.  That's also true for qemu, so it's
        probably a problem with the device model.
      
      [bero@arklinux.org: build fix]
      [simon.kagstrom@bth.se: build fix, other fixes]
      [uril@qumranet.com: KVM: Expose interrupt bitmap]
      [akpm@osdl.org: i386 build fix]
      [mingo@elte.hu: i386 fixes]
      [rdreier@cisco.com: add log levels to all printks]
      [randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
      [anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
      Signed-off-by: NYaniv Kamay <yaniv@qumranet.com>
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      Cc: Simon Kagstrom <simon.kagstrom@bth.se>
      Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
      Signed-off-by: NUri Lublin <uril@qumranet.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Roland Dreier <rolandd@cisco.com>
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAnthony Liguori <anthony@codemonkey.ws>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6aa8b732
    • D
      [PATCH] clocksource: small cleanup · f5f1a24a
      Daniel Walker 提交于
      Mostly changing alignment.  Just some general cleanup.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: NDaniel Walker <dwalker@mvista.com>
      Acked-by: NJohn Stultz <johnstul@us.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f5f1a24a
    • D
      [PATCH] clocksource: add usage of CONFIG_SYSFS · 2b013700
      Daniel Walker 提交于
      Simply adds some ifdefs to remove clocksoure sysfs code when CONFIG_SYSFS
      isn't turn on.
      Signed-off-by: NDaniel Walker <dwalker@mvista.com>
      Acked-by: NJohn Stultz <johnstul@us.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2b013700
    • A
      [PATCH] user of the jiffies rounding patch: Slab · 2b284214
      Arjan van de Ven 提交于
      This patch introduces users of the round_jiffies() function in the slab code.
      
      The slab code has a few "run every second" timers for background work; these
      are obviously not timing critical as long as they happen roughly at the right
      frequency.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2b284214
    • A
      [PATCH] user of the jiffies rounding code: JBD · 44d306e1
      Arjan van de Ven 提交于
      This patch introduces a user: of the round_jiffies() function; the "5 second"
      ext3/jbd wakeup.
      
      While "every 5 seconds" doesn't sound as a problem, there can be many of these
      (and these timers do add up over all the kernel).  The "5 second" wakeup isn't
      really timing sensitive; in addition even with rounding it'll still happen
      every 5 seconds (with the exception of the very first time, which is likely to
      be rounded up to somewhere closer to 6 seconds)
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      44d306e1
    • A
      [PATCH] round_jiffies infrastructure · 4c36a5de
      Arjan van de Ven 提交于
      Introduce a round_jiffies() function as well as a round_jiffies_relative()
      function.  These functions round a jiffies value to the next whole second.
      The primary purpose of this rounding is to cause all "we don't care exactly
      when" timers to happen at the same jiffy.
      
      This avoids multiple timers firing within the second for no real reason;
      with dynamic ticks these extra timers cause wakeups from deep sleep CPU
      sleep states and thus waste power.
      
      The exact wakeup moment is skewed by the cpu number, to avoid all cpus from
      waking up at the exact same time (and hitting the same lock/cachelines
      there)
      
      [akpm@osdl.org: fix variable type]
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4c36a5de
    • V
      [PATCH] fdtable: Implement new pagesize-based fdtable allocator · 5466b456
      Vadim Lobanov 提交于
      This patch provides an improved fdtable allocation scheme, useful for
      expanding fdtable file descriptor entries.  The main focus is on the fdarray,
      as its memory usage grows 128 times faster than that of an fdset.
      
      The allocation algorithm sizes the fdarray in such a way that its memory usage
      increases in easy page-sized chunks. The overall algorithm expands the allowed
      size in powers of two, in order to amortize the cost of invoking vmalloc() for
      larger allocation sizes. Namely, the following sizes for the fdarray are
      considered, and the smallest that accommodates the requested fd count is
      chosen:
      
          pagesize / 4
          pagesize / 2
          pagesize      <- memory allocator switch point
          pagesize * 2
          pagesize * 4
          ...etc...
      
      Unlike the current implementation, this allocation scheme does not require a
      loop to compute the optimal fdarray size, and can be done in efficient
      straightline code.
      
      Furthermore, since the fdarray overflows the pagesize boundary long before any
      of the fdsets do, it makes sense to optimize run-time by allocating both
      fdsets in a single swoop.  Even together, they will still be, by far, smaller
      than the fdarray.  The fdtable->open_fds is now used as the anchor for the
      fdset memory allocation.
      Signed-off-by: NVadim Lobanov <vlobanov@speakeasy.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5466b456
    • V
      [PATCH] fdtable: Remove the free_files field · 4fd45812
      Vadim Lobanov 提交于
      An fdtable can either be embedded inside a files_struct or standalone (after
      being expanded).  When an fdtable is being discarded after all RCU references
      to it have expired, we must either free it directly, in the standalone case,
      or free the files_struct it is contained within, in the embedded case.
      
      Currently the free_files field controls this behavior, but we can get rid of
      it entirely, as all the necessary information is already recorded.  We can
      distinguish embedded and standalone fdtables using max_fds, and if it is
      embedded we can divine the relevant files_struct using container_of().
      Signed-off-by: NVadim Lobanov <vlobanov@speakeasy.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4fd45812
    • V
      [PATCH] fdtable: Make fdarray and fdsets equal in size · bbea9f69
      Vadim Lobanov 提交于
      Currently, each fdtable supports three dynamically-sized arrays of data: the
      fdarray and two fdsets.  The code allows the number of fds supported by the
      fdarray (fdtable->max_fds) to differ from the number of fds supported by each
      of the fdsets (fdtable->max_fdset).
      
      In practice, it is wasteful for these two sizes to differ: whenever we hit a
      limit on the smaller-capacity structure, we will reallocate the entire fdtable
      and all the dynamic arrays within it, so any delta in the memory used by the
      larger-capacity structure will never be touched at all.
      
      Rather than hogging this excess, we shouldn't even allocate it in the first
      place, and keep the capacities of the fdarray and the fdsets equal.  This
      patch removes fdtable->max_fdset.  As an added bonus, most of the supporting
      code becomes simpler.
      Signed-off-by: NVadim Lobanov <vlobanov@speakeasy.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bbea9f69
    • V
      [PATCH] fdtable: Delete pointless code in dup_fd() · f3d19c90
      Vadim Lobanov 提交于
      The dup_fd() function creates a new files_struct and fdtable embedded inside
      that files_struct, and then possibly expands the fdtable using expand_files().
      
      The out_release error path is invoked when expand_files() returns an error
      code.  However, when this attempt to expand fails, the fdtable is left in its
      original embedded form, so it is pointless to try to free the associated
      fdarray and fdsets.
      Signed-off-by: NVadim Lobanov <vlobanov@speakeasy.net>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f3d19c90
    • Z
      [PATCH] dio: lock refcount operations · 5eb6c7a2
      Zach Brown 提交于
      The wait_for_more_bios() function name was poorly chosen.  While looking to
      clean it up it I noticed that the dio struct refcounting between the bio
      completion and dio submission paths was racey.
      
      The bio submission path was simply freeing the dio struct if
      atomic_dec_and_test() indicated that it dropped the final reference.
      
      The aio bio completion path was dereferencing its dio struct pointer *after
      dropping its reference* based on the remaining number of references.
      
      These two paths could race and result in the aio bio completion path
      dereferencing a freed dio, though this was not observed in the wild.
      
      This moves the refcount under the bio lock so that bio completion can drop
      its reference and decide to wake all in one atomic step.
      
      Once testing and waking is locked dio_await_one() can test its sleeping
      condition and mark itself uninterruptible under the lock.  It gets simpler
      and wait_for_more_bios() disappears.
      
      The addition of the interrupt masking spin lock acquiry in dio_bio_submit()
      looks alarming.  This lock acquiry existed in that path before the recent
      dio completion patch set.  We shouldn't expect significant performance
      regression from returning to the behaviour that existed before the
      completion clean up work.
      
      This passed 4k block ext3 O_DIRECT fsx and aio-stress on an SMP machine.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5eb6c7a2
    • Z
      [PATCH] dio: only call aio_complete() after returning -EIOCBQUEUED · 8459d86a
      Zach Brown 提交于
      The only time it is safe to call aio_complete() is when the ->ki_retry
      function returns -EIOCBQUEUED to the AIO core.  direct_io_worker() has
      historically done this by relying on its caller to translate positive return
      codes into -EIOCBQUEUED for the aio case.  It did this by trying to keep
      conditionals in sync.  direct_io_worker() knew when finished_one_bio() was
      going to call aio_complete().  It would reverse the test and wait and free the
      dio in the cases it thought that finished_one_bio() wasn't going to.
      
      Not surprisingly, it ended up getting it wrong.  'ret' could be a negative
      errno from the submission path but it failed to communicate this to
      finished_one_bio().  direct_io_worker() would return < 0, it's callers
      wouldn't raise -EIOCBQUEUED, and aio_complete() would be called.  In the
      future finished_one_bio()'s tests wouldn't reflect this and aio_complete()
      would be called for a second time which can manifest as an oops.
      
      The previous cleanups have whittled the sync and async completion paths down
      to the point where we can collapse them and clearly reassert the invariant
      that we must only call aio_complete() after returning -EIOCBQUEUED.
      direct_io_worker() will only return -EIOCBQUEUED when it is not the last to
      drop the dio refcount and the aio bio completion path will only call
      aio_complete() when it is the last to drop the dio refcount.
      direct_io_worker() can ensure that it is the last to drop the reference count
      by waiting for bios to drain.  It does this for sync ops, of course, and for
      partial dio writes that must fall back to buffered and for aio ops that saw
      errors during submission.
      
      This means that operations that end up waiting, even if they were issued as
      aio ops, will not call aio_complete() from dio.  Instead we return the return
      code of the operation and let the aio core call aio_complete().  This is
      purposely done to fix a bug where AIO DIO file extensions would call
      aio_complete() before their callers have a chance to update i_size.
      
      Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers
      no longer have to translate for it.  XFS needs to be careful not to free
      resources that will be used during AIO completion if -EIOCBQUEUED is returned.
       We maintain the previous behaviour of trying to write fs metadata for O_SYNC
      aio+dio writes.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8459d86a
    • Z
      [PATCH] dio: remove duplicate bio wait code · 20258b2b
      Zach Brown 提交于
      Now that we have a single refcount and waiting path we can reuse it in the
      async 'should_wait' path.  It continues to rely on the fragile link between
      the conditional in dio_complete_aio() which decides to complete the AIO and
      the conditional in direct_io_worker() which decides to wait and free.
      
      By waiting before dropping the reference we stop dio_bio_end_aio() from
      calling dio_complete_aio() which used to wake up the waiter after seeing the
      reference count drop to 0.  We hoist this wake up into dio_bio_end_aio() which
      now notices when it's left a single remaining reference that is held by the
      waiter.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      20258b2b
    • Z
      [PATCH] dio: formalize bio counters as a dio reference count · 0273201e
      Zach Brown 提交于
      Previously we had two confusing counts of bio progress.  'bio_count' was
      decremented as bios were processed and freed by the dio core.  It was used to
      indicate final completion of the dio operation.  'bios_in_flight' reflected
      how many bios were between submit_bio() and bio->end_io.  It was used by the
      sync path to decide when to wake up and finish completing bios and was ignored
      by the async path.
      
      This patch collapses the two notions into one notion of a dio reference count.
       bios hold a dio reference when they're between submit_bio and bio->end_io.
      
      Since bios_in_flight was only used in the sync path it is now equivalent to
      dio->refcount - 1 which accounts for direct_io_worker() holding a reference
      for the duration of the operation.
      
      dio_bio_complete() -> finished_one_bio() was called from the sync path after
      finding bios on the list that the bio->end_io function had deposited.
      finished_one_bio() can not drop the dio reference on behalf of these bios now
      because bio->end_io already has.  The is_async test in finished_one_bio()
      meant that it never actually did anything other than drop the bio_count for
      sync callers.  So we remove its refcount decrement, don't call it from
      dio_bio_complete(), and hoist its call up into the async dio_bio_complete()
      caller after an explicit refcount decrement.  It is renamed dio_complete_aio()
      to reflect the remaining work it actually does.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0273201e
    • Z
      [PATCH] dio: call blk_run_address_space() once per op · 17a7b1d7
      Zach Brown 提交于
      We only need to call blk_run_address_space() once after all the bios for the
      direct IO op have been submitted.  This removes the chance of calling
      blk_run_address_space() after spurious wake ups as the sync path waits for
      bios to drain.  It's also one less difference betwen the sync and async paths.
      
      In the process we remove a redundant dio_bio_submit() that its caller had
      already performed.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      17a7b1d7
    • Z
      [PATCH] dio: centralize completion in dio_complete() · 6d544bb4
      Zach Brown 提交于
      There have been a lot of bugs recently due to the way direct_io_worker() tries
      to decide how to finish direct IO operations.  In the worst examples it has
      failed to call aio_complete() at all (hang) or called it too many times
      (oops).
      
      This set of patches cleans up the completion phase with the goal of removing
      the complexity that lead to these bugs.  We end up with one path that
      calculates the result of the operation after all off the bios have completed.
      We decide when to generate a result of the operation using that path based on
      the final release of a refcount on the dio structure.
      
      I tried to progress towards the final state in steps that were relatively easy
      to understand.  Each step should compile but I only tested the final result of
      having all the patches applied.
      
      I've tested these on low end PC drives with aio-stress, the direct IO tests I
      could manage to get running in LTP, orasim, and some home-brew functional
      tests.
      
      In http://lkml.org/lkml/2006/9/21/103 IBM reports success with ext2 and ext3
      running DIO LTP tests.  They found that XFS bug which has since been addressed
      in the patch series.
      
      This patch:
      
      The mechanics which decide the result of a direct IO operation were duplicated
      in the sync and async paths.
      
      The async path didn't check page_errors which can manifest as silently
      returning success when the final pointer in an operation faults and its
      matching file region is filled with zeros.
      
      The sync path and async path differed in whether they passed errors to the
      caller's dio->end_io operation.  The async path was passing errors to it which
      trips an assertion in XFS, though it is apparently harmless.
      
      This centralizes the completion phase of dio ops in one place.  AIO will now
      return EFAULT consistently and all paths fall back to the previously sync
      behaviour of passing the number of bytes 'transferred' to the dio->end_io
      callback, regardless of errors.
      
      dio_await_completion() doesn't have to propogate EIO from non-uptodate bios
      now that it's being propogated through dio_complete() via dio->io_error.  This
      lets it return void which simplifies its sole caller.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6d544bb4
    • N
      [PATCH] md: assorted md and raid1 one-liners · 17571284
      NeilBrown 提交于
      Fix few bugs that meant that:
        - superblocks weren't alway written at exactly the right time (this
          could show up if the array was not written to - writting to the array
          causes lots of superblock updates and so hides these errors).
      
        - restarting device recovery after a clean shutdown (version-1 metadata
          only) didn't work as intended (or at all).
      
      1/ Ensure superblock is updated when a new device is added.
      2/ Remove an inappropriate test on MD_RECOVERY_SYNC in md_do_sync.
         The body of this if takes one of two branches depending on whether
         MD_RECOVERY_SYNC is set, so testing it in the clause of the if
         is wrong.
      3/ Flag superblock for updating after a resync/recovery finishes.
      4/ If we find the neeed to restart a recovery in the middle (version-1
         metadata only) make sure a full recovery (not just as guided by
         bitmaps) does get done.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      17571284
    • N
      [PATCH] md: return a non-zero error to bi_end_io as appropriate in raid5 · c2b00852
      NeilBrown 提交于
      Currently raid5 depends on clearing the BIO_UPTODATE flag to signal an error
      to higher levels.  While this should be sufficient, it is safer to explicitly
      set the error code as well - less room for confusion.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c2b00852
    • N
      [PATCH] md: remove some old ifdefed-out code from raid5.c · b8c6b645
      NeilBrown 提交于
      There are some vestiges of old code that was used for bypassing the stripe
      cache on reads in raid5.c.  This was never updated after the change from
      buffer_heads to bios, but was left as a reminder.
      
      That functionality has nowe been implemented in a completely different way, so
      the old code can go.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b8c6b645
    • J
      [PATCH] MD: conditionalize some code · fdee8ae4
      Jeff Garzik 提交于
      The autorun code is only used if this module is built into the static
      kernel image.  Adjust #ifdefs accordingly.
      Signed-off-by: NJeff Garzik <jeff@garzik.org>
      Acked-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fdee8ae4
    • N
      [PATCH] md: fix innocuous bug in raid6 stripe_to_pdidx · b875e531
      NeilBrown 提交于
      stripe_to_pdidx finds the index of the parity disk for a given stripe.  It
      assumes raid5 in that it uses "disks-1" to determine the number of data disks.
      
      This is incorrect for raid6 but fortunately the two usages cancel each other
      out.  The only way that 'data_disks' affects the calculation of pd_idx in
      raid5_compute_sector is when it is divided into the sector number.  But as
      that sector number is calculated by multiplying in the wrong value of
      'data_disks' the division produces the right value.
      
      So it is innocuous but needs to be fixed.
      
      Also change the calculation of raid_disks in compute_blocknr to make it
      more obviously correct (it seems at first to always use disks-1 too).
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b875e531
    • R
      [PATCH] md: enable bypassing cache for reads · 52488615
      Raz Ben-Jehuda(caro) 提交于
      Call the chunk_aligned_read where appropriate.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      52488615
    • R
      [PATCH] md: allow reads that have bypassed the cache to be retried on failure · 46031f9a
      Raz Ben-Jehuda(caro) 提交于
      If a bypass-the-cache read fails, we simply try again through the cache.  If
      it fails again it will trigger normal recovery precedures.
      
      update 1:
      
      From: NeilBrown <neilb@suse.de>
      
      1/
        chunk_aligned_read and retry_aligned_read assume that
            data_disks == raid_disks - 1
        which is not true for raid6.
        So when an aligned read request bypasses the cache, we can get the wrong data.
      
      2/ The cloned bio is being used-after-free in raid5_align_endio
         (to test BIO_UPTODATE).
      
      3/ We forgot to add rdev->data_offset when submitting
         a bio for aligned-read
      
      4/ clone_bio calls blk_recount_segments and then we change bi_bdev,
         so we need to invalidate the segment counts.
      
      5/ We don't de-reference the rdev when the read completes.
         This means we need to record the rdev to so it is still
         available in the end_io routine.  Fortunately
         bi_next in the original bio is unused at this point so
         we can stuff it in there.
      
      6/ We leak a cloned bio if the target rdev is not usable.
      
      From: NeilBrown <neilb@suse.de>
      
      update 2:
      
      1/ When aligned requests fail (read error) they need to be retried
         via the normal method (stripe cache).  As we cannot be sure that
         we can process a single read in one go (we may not be able to
         allocate all the stripes needed) we store a bio-being-retried
         and a list of bioes-that-still-need-to-be-retried.
         When find a bio that needs to be retried, we should add it to
         the list, not to single-bio...
      
      2/ We were never incrementing 'scnt' when resubmitting failed
         aligned requests.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      46031f9a
    • R
    • R
      [PATCH] md: define raid5_mergeable_bvec · 23032a0e
      Raz Ben-Jehuda(caro) 提交于
      This will encourage read request to be on only one device, so we will often be
      able to bypass the cache for read requests.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      23032a0e
    • N
      [PATCH] md: tidy up device-change notification when an md array is stopped · 0d4ca600
      NeilBrown 提交于
      An md array can be stopped leaving all the setting still in place, or it can
      torn down and destroyed.  set_capacity and other change notifications only
      happen in the latter case, but should happen in both.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0d4ca600
    • P
      [PATCH] Fbdev driver for IBM GXT4500P videocards · a3d89983
      Paul Mackerras 提交于
      This is an fbdev driver for the IBM GXT4500P display card found in some IBM
      System P (pSeries) machines.  These cards have hardware 2D and 3D
      capabilities, but the driver does not use them; it just exports a dumb
      framebuffer.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Acked-by: NJames Simmons <jsimmons@infradead.org>
      Cc: "Antonino A. Daplas" <adaplas@pol.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a3d89983
    • A
      [PATCH] ide-cd: Handle strange interrupt on the Intel ESB2 · ee2f344b
      Alan Cox 提交于
      The ESB2 appears to emit spurious DMA interrupts when configured for native
      mode and handling ATAPI devices.  Stratus were able to pin this bug down and
      produce a patch.  This is a rework which applies the fixup only to the ESB2
      (for now).  We can apply it to other chips later if the same problem is found.
      
      This code has been tested and confirmed to fix the problem on the tested
      systems.
      Signed-off-by: NAlan Cox <alan@redhat.com>
      (Most of the hard work done by Stratus however)
      Cc: Jens Axboe <axboe@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ee2f344b
    • M
      [PATCH] kernel/sched.c: whitespace cleanups · 33859f7f
      Miguel Ojeda Sandonis 提交于
      [akpm@osdl.org: additional cleanups]
      Signed-off-by: NMiguel Ojeda Sandonis <maxextreme@gmail.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      33859f7f
    • C
      [PATCH] sched: optimize activate_task for RT task · 62ab616d
      Chen, Kenneth W 提交于
      RT task does not participate in interactiveness priority and thus shouldn't
      be bothered with timestamp and p->sleep_type manipulation when task is
      being put on run queue.  Bypass all of the them with a single if (rt_task)
      test.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      62ab616d
    • C
      [PATCH] sched: remove lb_stopbalance counter · 06066714
      Chen, Kenneth W 提交于
      Remove scheduler stats lb_stopbalance counter.  This counter can be
      calculated by: lb_balanced - lb_nobusyg - lb_nobusyq.  There is no need to
      create gazillion counters while we can derive the value.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      06066714
    • S
      [PATCH] sched: decrease number of load balances · 783609c6
      Siddha, Suresh B 提交于
      Currently at a particular domain, each cpu in the sched group will do a
      load balance at the frequency of balance_interval.  More the cores and
      threads, more the cpus will be in each sched group at SMP and NUMA domain.
      And we endup spending quite a bit of time doing load balancing in those
      domains.
      
      Fix this by making only one cpu(first idle cpu or first cpu in the group if
      all the cpus are busy) in the sched group do the load balance at that
      particular sched domain and this load will slowly percolate down to the
      other cpus with in that group(when they do load balancing at lower
      domains).
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      783609c6
    • M
      [PATCH] sched: improve migration accuracy · b18ec803
      Mike Galbraith 提交于
      Co-opt rq->timestamp_last_tick to maintain a cache_hot_time evaluation
      reference timestamp at both tick and sched times to prevent said reference,
      formerly rq->timestamp_last_tick, from being behind task->last_ran at
      evaluation time, and to move said reference closer to current time on the
      remote processor, intent being to improve cache hot evaluation and
      timestamp adjustment accuracy for task migration.
      
      Fix minor sched_time double accounting error which occurs when a task
      passing through schedule() does not schedule off, and takes the next timer
      tick.
      
      [kenneth.w.chen@intel.com: cleanup]
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NKen Chen <kenneth.w.chen@intel.com>
      Cc: Don Mullis <dwm@meer.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b18ec803
    • C
      [PATCH] sched: add option to serialize load balancing · 08c183f3
      Christoph Lameter 提交于
      Large sched domains can be very expensive to scan.  Add an option SD_SERIALIZE
      to the sched domain flags.  If that flag is set then we make sure that no
      other such domain is being balanced.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      08c183f3