1. 11 12月, 2006 40 次提交
    • A
      [PATCH] round_jiffies infrastructure · 4c36a5de
      Arjan van de Ven 提交于
      Introduce a round_jiffies() function as well as a round_jiffies_relative()
      function.  These functions round a jiffies value to the next whole second.
      The primary purpose of this rounding is to cause all "we don't care exactly
      when" timers to happen at the same jiffy.
      
      This avoids multiple timers firing within the second for no real reason;
      with dynamic ticks these extra timers cause wakeups from deep sleep CPU
      sleep states and thus waste power.
      
      The exact wakeup moment is skewed by the cpu number, to avoid all cpus from
      waking up at the exact same time (and hitting the same lock/cachelines
      there)
      
      [akpm@osdl.org: fix variable type]
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4c36a5de
    • V
      [PATCH] fdtable: Implement new pagesize-based fdtable allocator · 5466b456
      Vadim Lobanov 提交于
      This patch provides an improved fdtable allocation scheme, useful for
      expanding fdtable file descriptor entries.  The main focus is on the fdarray,
      as its memory usage grows 128 times faster than that of an fdset.
      
      The allocation algorithm sizes the fdarray in such a way that its memory usage
      increases in easy page-sized chunks. The overall algorithm expands the allowed
      size in powers of two, in order to amortize the cost of invoking vmalloc() for
      larger allocation sizes. Namely, the following sizes for the fdarray are
      considered, and the smallest that accommodates the requested fd count is
      chosen:
      
          pagesize / 4
          pagesize / 2
          pagesize      <- memory allocator switch point
          pagesize * 2
          pagesize * 4
          ...etc...
      
      Unlike the current implementation, this allocation scheme does not require a
      loop to compute the optimal fdarray size, and can be done in efficient
      straightline code.
      
      Furthermore, since the fdarray overflows the pagesize boundary long before any
      of the fdsets do, it makes sense to optimize run-time by allocating both
      fdsets in a single swoop.  Even together, they will still be, by far, smaller
      than the fdarray.  The fdtable->open_fds is now used as the anchor for the
      fdset memory allocation.
      Signed-off-by: NVadim Lobanov <vlobanov@speakeasy.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5466b456
    • V
      [PATCH] fdtable: Remove the free_files field · 4fd45812
      Vadim Lobanov 提交于
      An fdtable can either be embedded inside a files_struct or standalone (after
      being expanded).  When an fdtable is being discarded after all RCU references
      to it have expired, we must either free it directly, in the standalone case,
      or free the files_struct it is contained within, in the embedded case.
      
      Currently the free_files field controls this behavior, but we can get rid of
      it entirely, as all the necessary information is already recorded.  We can
      distinguish embedded and standalone fdtables using max_fds, and if it is
      embedded we can divine the relevant files_struct using container_of().
      Signed-off-by: NVadim Lobanov <vlobanov@speakeasy.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4fd45812
    • V
      [PATCH] fdtable: Make fdarray and fdsets equal in size · bbea9f69
      Vadim Lobanov 提交于
      Currently, each fdtable supports three dynamically-sized arrays of data: the
      fdarray and two fdsets.  The code allows the number of fds supported by the
      fdarray (fdtable->max_fds) to differ from the number of fds supported by each
      of the fdsets (fdtable->max_fdset).
      
      In practice, it is wasteful for these two sizes to differ: whenever we hit a
      limit on the smaller-capacity structure, we will reallocate the entire fdtable
      and all the dynamic arrays within it, so any delta in the memory used by the
      larger-capacity structure will never be touched at all.
      
      Rather than hogging this excess, we shouldn't even allocate it in the first
      place, and keep the capacities of the fdarray and the fdsets equal.  This
      patch removes fdtable->max_fdset.  As an added bonus, most of the supporting
      code becomes simpler.
      Signed-off-by: NVadim Lobanov <vlobanov@speakeasy.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bbea9f69
    • V
      [PATCH] fdtable: Delete pointless code in dup_fd() · f3d19c90
      Vadim Lobanov 提交于
      The dup_fd() function creates a new files_struct and fdtable embedded inside
      that files_struct, and then possibly expands the fdtable using expand_files().
      
      The out_release error path is invoked when expand_files() returns an error
      code.  However, when this attempt to expand fails, the fdtable is left in its
      original embedded form, so it is pointless to try to free the associated
      fdarray and fdsets.
      Signed-off-by: NVadim Lobanov <vlobanov@speakeasy.net>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f3d19c90
    • Z
      [PATCH] dio: lock refcount operations · 5eb6c7a2
      Zach Brown 提交于
      The wait_for_more_bios() function name was poorly chosen.  While looking to
      clean it up it I noticed that the dio struct refcounting between the bio
      completion and dio submission paths was racey.
      
      The bio submission path was simply freeing the dio struct if
      atomic_dec_and_test() indicated that it dropped the final reference.
      
      The aio bio completion path was dereferencing its dio struct pointer *after
      dropping its reference* based on the remaining number of references.
      
      These two paths could race and result in the aio bio completion path
      dereferencing a freed dio, though this was not observed in the wild.
      
      This moves the refcount under the bio lock so that bio completion can drop
      its reference and decide to wake all in one atomic step.
      
      Once testing and waking is locked dio_await_one() can test its sleeping
      condition and mark itself uninterruptible under the lock.  It gets simpler
      and wait_for_more_bios() disappears.
      
      The addition of the interrupt masking spin lock acquiry in dio_bio_submit()
      looks alarming.  This lock acquiry existed in that path before the recent
      dio completion patch set.  We shouldn't expect significant performance
      regression from returning to the behaviour that existed before the
      completion clean up work.
      
      This passed 4k block ext3 O_DIRECT fsx and aio-stress on an SMP machine.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5eb6c7a2
    • Z
      [PATCH] dio: only call aio_complete() after returning -EIOCBQUEUED · 8459d86a
      Zach Brown 提交于
      The only time it is safe to call aio_complete() is when the ->ki_retry
      function returns -EIOCBQUEUED to the AIO core.  direct_io_worker() has
      historically done this by relying on its caller to translate positive return
      codes into -EIOCBQUEUED for the aio case.  It did this by trying to keep
      conditionals in sync.  direct_io_worker() knew when finished_one_bio() was
      going to call aio_complete().  It would reverse the test and wait and free the
      dio in the cases it thought that finished_one_bio() wasn't going to.
      
      Not surprisingly, it ended up getting it wrong.  'ret' could be a negative
      errno from the submission path but it failed to communicate this to
      finished_one_bio().  direct_io_worker() would return < 0, it's callers
      wouldn't raise -EIOCBQUEUED, and aio_complete() would be called.  In the
      future finished_one_bio()'s tests wouldn't reflect this and aio_complete()
      would be called for a second time which can manifest as an oops.
      
      The previous cleanups have whittled the sync and async completion paths down
      to the point where we can collapse them and clearly reassert the invariant
      that we must only call aio_complete() after returning -EIOCBQUEUED.
      direct_io_worker() will only return -EIOCBQUEUED when it is not the last to
      drop the dio refcount and the aio bio completion path will only call
      aio_complete() when it is the last to drop the dio refcount.
      direct_io_worker() can ensure that it is the last to drop the reference count
      by waiting for bios to drain.  It does this for sync ops, of course, and for
      partial dio writes that must fall back to buffered and for aio ops that saw
      errors during submission.
      
      This means that operations that end up waiting, even if they were issued as
      aio ops, will not call aio_complete() from dio.  Instead we return the return
      code of the operation and let the aio core call aio_complete().  This is
      purposely done to fix a bug where AIO DIO file extensions would call
      aio_complete() before their callers have a chance to update i_size.
      
      Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers
      no longer have to translate for it.  XFS needs to be careful not to free
      resources that will be used during AIO completion if -EIOCBQUEUED is returned.
       We maintain the previous behaviour of trying to write fs metadata for O_SYNC
      aio+dio writes.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8459d86a
    • Z
      [PATCH] dio: remove duplicate bio wait code · 20258b2b
      Zach Brown 提交于
      Now that we have a single refcount and waiting path we can reuse it in the
      async 'should_wait' path.  It continues to rely on the fragile link between
      the conditional in dio_complete_aio() which decides to complete the AIO and
      the conditional in direct_io_worker() which decides to wait and free.
      
      By waiting before dropping the reference we stop dio_bio_end_aio() from
      calling dio_complete_aio() which used to wake up the waiter after seeing the
      reference count drop to 0.  We hoist this wake up into dio_bio_end_aio() which
      now notices when it's left a single remaining reference that is held by the
      waiter.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      20258b2b
    • Z
      [PATCH] dio: formalize bio counters as a dio reference count · 0273201e
      Zach Brown 提交于
      Previously we had two confusing counts of bio progress.  'bio_count' was
      decremented as bios were processed and freed by the dio core.  It was used to
      indicate final completion of the dio operation.  'bios_in_flight' reflected
      how many bios were between submit_bio() and bio->end_io.  It was used by the
      sync path to decide when to wake up and finish completing bios and was ignored
      by the async path.
      
      This patch collapses the two notions into one notion of a dio reference count.
       bios hold a dio reference when they're between submit_bio and bio->end_io.
      
      Since bios_in_flight was only used in the sync path it is now equivalent to
      dio->refcount - 1 which accounts for direct_io_worker() holding a reference
      for the duration of the operation.
      
      dio_bio_complete() -> finished_one_bio() was called from the sync path after
      finding bios on the list that the bio->end_io function had deposited.
      finished_one_bio() can not drop the dio reference on behalf of these bios now
      because bio->end_io already has.  The is_async test in finished_one_bio()
      meant that it never actually did anything other than drop the bio_count for
      sync callers.  So we remove its refcount decrement, don't call it from
      dio_bio_complete(), and hoist its call up into the async dio_bio_complete()
      caller after an explicit refcount decrement.  It is renamed dio_complete_aio()
      to reflect the remaining work it actually does.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0273201e
    • Z
      [PATCH] dio: call blk_run_address_space() once per op · 17a7b1d7
      Zach Brown 提交于
      We only need to call blk_run_address_space() once after all the bios for the
      direct IO op have been submitted.  This removes the chance of calling
      blk_run_address_space() after spurious wake ups as the sync path waits for
      bios to drain.  It's also one less difference betwen the sync and async paths.
      
      In the process we remove a redundant dio_bio_submit() that its caller had
      already performed.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      17a7b1d7
    • Z
      [PATCH] dio: centralize completion in dio_complete() · 6d544bb4
      Zach Brown 提交于
      There have been a lot of bugs recently due to the way direct_io_worker() tries
      to decide how to finish direct IO operations.  In the worst examples it has
      failed to call aio_complete() at all (hang) or called it too many times
      (oops).
      
      This set of patches cleans up the completion phase with the goal of removing
      the complexity that lead to these bugs.  We end up with one path that
      calculates the result of the operation after all off the bios have completed.
      We decide when to generate a result of the operation using that path based on
      the final release of a refcount on the dio structure.
      
      I tried to progress towards the final state in steps that were relatively easy
      to understand.  Each step should compile but I only tested the final result of
      having all the patches applied.
      
      I've tested these on low end PC drives with aio-stress, the direct IO tests I
      could manage to get running in LTP, orasim, and some home-brew functional
      tests.
      
      In http://lkml.org/lkml/2006/9/21/103 IBM reports success with ext2 and ext3
      running DIO LTP tests.  They found that XFS bug which has since been addressed
      in the patch series.
      
      This patch:
      
      The mechanics which decide the result of a direct IO operation were duplicated
      in the sync and async paths.
      
      The async path didn't check page_errors which can manifest as silently
      returning success when the final pointer in an operation faults and its
      matching file region is filled with zeros.
      
      The sync path and async path differed in whether they passed errors to the
      caller's dio->end_io operation.  The async path was passing errors to it which
      trips an assertion in XFS, though it is apparently harmless.
      
      This centralizes the completion phase of dio ops in one place.  AIO will now
      return EFAULT consistently and all paths fall back to the previously sync
      behaviour of passing the number of bytes 'transferred' to the dio->end_io
      callback, regardless of errors.
      
      dio_await_completion() doesn't have to propogate EIO from non-uptodate bios
      now that it's being propogated through dio_complete() via dio->io_error.  This
      lets it return void which simplifies its sole caller.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6d544bb4
    • N
      [PATCH] md: assorted md and raid1 one-liners · 17571284
      NeilBrown 提交于
      Fix few bugs that meant that:
        - superblocks weren't alway written at exactly the right time (this
          could show up if the array was not written to - writting to the array
          causes lots of superblock updates and so hides these errors).
      
        - restarting device recovery after a clean shutdown (version-1 metadata
          only) didn't work as intended (or at all).
      
      1/ Ensure superblock is updated when a new device is added.
      2/ Remove an inappropriate test on MD_RECOVERY_SYNC in md_do_sync.
         The body of this if takes one of two branches depending on whether
         MD_RECOVERY_SYNC is set, so testing it in the clause of the if
         is wrong.
      3/ Flag superblock for updating after a resync/recovery finishes.
      4/ If we find the neeed to restart a recovery in the middle (version-1
         metadata only) make sure a full recovery (not just as guided by
         bitmaps) does get done.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      17571284
    • N
      [PATCH] md: return a non-zero error to bi_end_io as appropriate in raid5 · c2b00852
      NeilBrown 提交于
      Currently raid5 depends on clearing the BIO_UPTODATE flag to signal an error
      to higher levels.  While this should be sufficient, it is safer to explicitly
      set the error code as well - less room for confusion.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c2b00852
    • N
      [PATCH] md: remove some old ifdefed-out code from raid5.c · b8c6b645
      NeilBrown 提交于
      There are some vestiges of old code that was used for bypassing the stripe
      cache on reads in raid5.c.  This was never updated after the change from
      buffer_heads to bios, but was left as a reminder.
      
      That functionality has nowe been implemented in a completely different way, so
      the old code can go.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b8c6b645
    • J
      [PATCH] MD: conditionalize some code · fdee8ae4
      Jeff Garzik 提交于
      The autorun code is only used if this module is built into the static
      kernel image.  Adjust #ifdefs accordingly.
      Signed-off-by: NJeff Garzik <jeff@garzik.org>
      Acked-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fdee8ae4
    • N
      [PATCH] md: fix innocuous bug in raid6 stripe_to_pdidx · b875e531
      NeilBrown 提交于
      stripe_to_pdidx finds the index of the parity disk for a given stripe.  It
      assumes raid5 in that it uses "disks-1" to determine the number of data disks.
      
      This is incorrect for raid6 but fortunately the two usages cancel each other
      out.  The only way that 'data_disks' affects the calculation of pd_idx in
      raid5_compute_sector is when it is divided into the sector number.  But as
      that sector number is calculated by multiplying in the wrong value of
      'data_disks' the division produces the right value.
      
      So it is innocuous but needs to be fixed.
      
      Also change the calculation of raid_disks in compute_blocknr to make it
      more obviously correct (it seems at first to always use disks-1 too).
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b875e531
    • R
      [PATCH] md: enable bypassing cache for reads · 52488615
      Raz Ben-Jehuda(caro) 提交于
      Call the chunk_aligned_read where appropriate.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      52488615
    • R
      [PATCH] md: allow reads that have bypassed the cache to be retried on failure · 46031f9a
      Raz Ben-Jehuda(caro) 提交于
      If a bypass-the-cache read fails, we simply try again through the cache.  If
      it fails again it will trigger normal recovery precedures.
      
      update 1:
      
      From: NeilBrown <neilb@suse.de>
      
      1/
        chunk_aligned_read and retry_aligned_read assume that
            data_disks == raid_disks - 1
        which is not true for raid6.
        So when an aligned read request bypasses the cache, we can get the wrong data.
      
      2/ The cloned bio is being used-after-free in raid5_align_endio
         (to test BIO_UPTODATE).
      
      3/ We forgot to add rdev->data_offset when submitting
         a bio for aligned-read
      
      4/ clone_bio calls blk_recount_segments and then we change bi_bdev,
         so we need to invalidate the segment counts.
      
      5/ We don't de-reference the rdev when the read completes.
         This means we need to record the rdev to so it is still
         available in the end_io routine.  Fortunately
         bi_next in the original bio is unused at this point so
         we can stuff it in there.
      
      6/ We leak a cloned bio if the target rdev is not usable.
      
      From: NeilBrown <neilb@suse.de>
      
      update 2:
      
      1/ When aligned requests fail (read error) they need to be retried
         via the normal method (stripe cache).  As we cannot be sure that
         we can process a single read in one go (we may not be able to
         allocate all the stripes needed) we store a bio-being-retried
         and a list of bioes-that-still-need-to-be-retried.
         When find a bio that needs to be retried, we should add it to
         the list, not to single-bio...
      
      2/ We were never incrementing 'scnt' when resubmitting failed
         aligned requests.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      46031f9a
    • R
    • R
      [PATCH] md: define raid5_mergeable_bvec · 23032a0e
      Raz Ben-Jehuda(caro) 提交于
      This will encourage read request to be on only one device, so we will often be
      able to bypass the cache for read requests.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      23032a0e
    • N
      [PATCH] md: tidy up device-change notification when an md array is stopped · 0d4ca600
      NeilBrown 提交于
      An md array can be stopped leaving all the setting still in place, or it can
      torn down and destroyed.  set_capacity and other change notifications only
      happen in the latter case, but should happen in both.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0d4ca600
    • P
      [PATCH] Fbdev driver for IBM GXT4500P videocards · a3d89983
      Paul Mackerras 提交于
      This is an fbdev driver for the IBM GXT4500P display card found in some IBM
      System P (pSeries) machines.  These cards have hardware 2D and 3D
      capabilities, but the driver does not use them; it just exports a dumb
      framebuffer.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Acked-by: NJames Simmons <jsimmons@infradead.org>
      Cc: "Antonino A. Daplas" <adaplas@pol.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a3d89983
    • A
      [PATCH] ide-cd: Handle strange interrupt on the Intel ESB2 · ee2f344b
      Alan Cox 提交于
      The ESB2 appears to emit spurious DMA interrupts when configured for native
      mode and handling ATAPI devices.  Stratus were able to pin this bug down and
      produce a patch.  This is a rework which applies the fixup only to the ESB2
      (for now).  We can apply it to other chips later if the same problem is found.
      
      This code has been tested and confirmed to fix the problem on the tested
      systems.
      Signed-off-by: NAlan Cox <alan@redhat.com>
      (Most of the hard work done by Stratus however)
      Cc: Jens Axboe <axboe@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ee2f344b
    • M
      [PATCH] kernel/sched.c: whitespace cleanups · 33859f7f
      Miguel Ojeda Sandonis 提交于
      [akpm@osdl.org: additional cleanups]
      Signed-off-by: NMiguel Ojeda Sandonis <maxextreme@gmail.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      33859f7f
    • C
      [PATCH] sched: optimize activate_task for RT task · 62ab616d
      Chen, Kenneth W 提交于
      RT task does not participate in interactiveness priority and thus shouldn't
      be bothered with timestamp and p->sleep_type manipulation when task is
      being put on run queue.  Bypass all of the them with a single if (rt_task)
      test.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      62ab616d
    • C
      [PATCH] sched: remove lb_stopbalance counter · 06066714
      Chen, Kenneth W 提交于
      Remove scheduler stats lb_stopbalance counter.  This counter can be
      calculated by: lb_balanced - lb_nobusyg - lb_nobusyq.  There is no need to
      create gazillion counters while we can derive the value.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      06066714
    • S
      [PATCH] sched: decrease number of load balances · 783609c6
      Siddha, Suresh B 提交于
      Currently at a particular domain, each cpu in the sched group will do a
      load balance at the frequency of balance_interval.  More the cores and
      threads, more the cpus will be in each sched group at SMP and NUMA domain.
      And we endup spending quite a bit of time doing load balancing in those
      domains.
      
      Fix this by making only one cpu(first idle cpu or first cpu in the group if
      all the cpus are busy) in the sched group do the load balance at that
      particular sched domain and this load will slowly percolate down to the
      other cpus with in that group(when they do load balancing at lower
      domains).
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      783609c6
    • M
      [PATCH] sched: improve migration accuracy · b18ec803
      Mike Galbraith 提交于
      Co-opt rq->timestamp_last_tick to maintain a cache_hot_time evaluation
      reference timestamp at both tick and sched times to prevent said reference,
      formerly rq->timestamp_last_tick, from being behind task->last_ran at
      evaluation time, and to move said reference closer to current time on the
      remote processor, intent being to improve cache hot evaluation and
      timestamp adjustment accuracy for task migration.
      
      Fix minor sched_time double accounting error which occurs when a task
      passing through schedule() does not schedule off, and takes the next timer
      tick.
      
      [kenneth.w.chen@intel.com: cleanup]
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NKen Chen <kenneth.w.chen@intel.com>
      Cc: Don Mullis <dwm@meer.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b18ec803
    • C
      [PATCH] sched: add option to serialize load balancing · 08c183f3
      Christoph Lameter 提交于
      Large sched domains can be very expensive to scan.  Add an option SD_SERIALIZE
      to the sched domain flags.  If that flag is set then we make sure that no
      other such domain is being balanced.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      08c183f3
    • C
      [PATCH] sched: call tasklet less frequently · 1bd77f2d
      Christoph Lameter 提交于
      Trigger softirq less frequently
      
      We trigger the softirq before this patch using offset of sd->interval.
      However, if the queue is busy then it is sufficient to schedule the softirq
      with sd->interval * busy_factor.
      
      So we modify the calculation of the next time to balance by taking
      the interval added to last_balance again. This is only the
      right value if the idle/busy situation continues as is.
      
      There are two potential trouble spots:
      - If the queue was idle and now gets busy then we call rebalance
        early. However, that is not a problem because we will then use
        the longer interval for the next period.
      
      - If the queue was busy and becomes idle then we potentially
        wait too long before rebalancing. However, when the task
        goes idle then idle_balance is called. We add another calculation
        of the next balance time based on sd->interval in idle_balance
        so that we will rebalance soon.
      
      V2->V3:
      - Calculate rebalance time based on current jiffies and not
        based on the jiffies at the last time we load balanced.
        We no longer rely on staggering and therefore we can
        affort to do this now.
      
      V3->V4:
      - Use functions to do jiffy comparisons.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1bd77f2d
    • C
      [PATCH] sched: use softirq for load balancing · c9819f45
      Christoph Lameter 提交于
      Call rebalance_tick (renamed to run_rebalance_domains) from a newly introduced
      softirq.
      
      We calculate the earliest time for each layer of sched domains to be rescanned
      (this is the rescan time for idle) and use the earliest of those to schedule
      the softirq via a new field "next_balance" added to struct rq.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c9819f45
    • C
      [PATCH] sched: move idle status calculation into rebalance_tick() · e418e1c2
      Christoph Lameter 提交于
      Perform the idle state determination in rebalance_tick.
      
      If we separate balancing from sched_tick then we also need to determine the
      idle state in rebalance_tick.
      
      V2->V3
      	Remove useless idlle != 0 check. Checking nr_running seems
      	to be sufficient. Thanks Suresh.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e418e1c2
    • C
      [PATCH] sched: extract load calculation from rebalance_tick · 7835b98b
      Christoph Lameter 提交于
      A load calculation is always done in rebalance_tick() in addition to the real
      load balancing activities that only take place when certain jiffie counts have
      been reached.  Move that processing into a separate function and call it
      directly from scheduler_tick().
      
      Also extract the time slice handling from scheduler_tick and put it into a
      separate function.  Then we can clean up scheduler_tick significantly.  It
      will no longer have any gotos.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7835b98b
    • C
      [PATCH] sched: disable interrupts for locking in load_balance() · fe2eea3f
      Christoph Lameter 提交于
      Interrupts must be disabled for request queue locks if we want to run
      load_balance() with interrupts enabled.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fe2eea3f
    • C
      [PATCH] sched: remove staggering of load balancing · 4211a9a2
      Christoph Lameter 提交于
      Timer interrupts already are staggered.  We do not need an additional layer of
      time staggering for short load balancing actions that take a reasonably small
      portion of the time slice.
      
      For load balancing on large sched_domains we will add a serialization later
      that avoids concurrent load balance operations and thus has the same effect as
      load staggering.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4211a9a2
    • C
      [PATCH] sched: avoid taking rq lock in wake_priority_sleeper · 571f6d2f
      Christoph Lameter 提交于
      Avoid taking the request queue lock in wake_priority_sleeper if there are no
      running processes.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      571f6d2f
    • S
      [PATCH] sched domain: increase the SMT busy rebalance interval · ac7d5504
      Siddha, Suresh B 提交于
      With SMT, if the logical processor is busy, load balance happens for every
      8msec(min)-16msec(max).  There is no need to do this often, as this is just
      for fairness(to maintain uniform runqueue lengths) and default time slice
      anyhow is 100msec.
      
      Appended patch increases this interval to 64msec(min)-128msec(max) when the
      logical processor is busy.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ac7d5504
    • K
      [PATCH] move_task_off_dead_cpu() should be called with disabled ints · 054b9108
      Kirill Korotaev 提交于
      move_task_off_dead_cpu() requires interrupts to be disabled, while
      migrate_dead() calls it with enabled interrupts.  Added appropriate
      comments to functions and added BUG_ON(!irqs_disabled()) into
      double_rq_lock() and double_lock_balance() which are the origin sources of
      such bugs.
      Signed-off-by: NKirill Korotaev <dev@openvz.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      054b9108
    • S
      [PATCH] ched domain: move sched group allocations to percpu area · 6711cab4
      Siddha, Suresh B 提交于
      Move the sched group allocations to percpu area.  This will minimize cross
      node memory references and also cleans up the sched groups allocation for
      allnodes sched domain.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6711cab4
    • R