1. 09 10月, 2008 26 次提交
    • K
      block: add lld busy state exporting interface · ef9e3fac
      Kiyoshi Ueda 提交于
      This patch adds an new interface, blk_lld_busy(), to check lld's
      busy state from the block layer.
      blk_lld_busy() calls down into low-level drivers for the checking
      if the drivers set q->lld_busy_fn() using blk_queue_lld_busy().
      
      This resolves a performance problem on request stacking devices below.
      
      Some drivers like scsi mid layer stop dispatching request when
      they detect busy state on its low-level device like host/target/device.
      It allows other requests to stay in the I/O scheduler's queue
      for a chance of merging.
      
      Request stacking drivers like request-based dm should follow
      the same logic.
      However, there is no generic interface for the stacked device
      to check if the underlying device(s) are busy.
      If the request stacking driver dispatches and submits requests to
      the busy underlying device, the requests will stay in
      the underlying device's queue without a chance of merging.
      This causes performance problem on burst I/O load.
      
      With this patch, busy state of the underlying device is exported
      via q->lld_busy_fn().  So the request stacking driver can check it
      and stop dispatching requests if busy.
      
      The underlying device driver must return the busy state appropriately:
          1: when the device driver can't process requests immediately.
          0: when the device driver can process requests immediately,
             including abnormal situations where the device driver needs
             to kill all requests.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      ef9e3fac
    • E
      block: Fix blk_start_queueing() to not kick a stopped queue · 336c3d8c
      Elias Oltmanns 提交于
      blk_start_queueing() should act like the generic queue unplugging
      and kicking and ignore a stopped queue. Such a queue may not be
      run until after a call to blk_start_queue().
      Signed-off-by: NElias Oltmanns <eo@nebensachen.de>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      336c3d8c
    • K
      block: add a queue flag for request stacking support · 4ee5eaf4
      Kiyoshi Ueda 提交于
      This patch adds a queue flag to indicate the block device can be
      used for request stacking.
      
      Request stacking drivers need to stack their devices on top of
      only devices of which q->request_fn is functional.
      Since bio stacking drivers (e.g. md, loop) basically initialize
      their queue using blk_alloc_queue() and don't set q->request_fn,
      the check of (q->request_fn == NULL) looks enough for that purpose.
      
      However, dm will become both types of stacking driver (bio-based and
      request-based).  And dm will always set q->request_fn even if the dm
      device is bio-based of which q->request_fn is not functional actually.
      So we need something else to distinguish the type of the device.
      Adding a queue flag is a solution for that.
      
      The reason why dm always sets q->request_fn is to keep
      the compatibility of dm user-space tools.
      Currently, all dm user-space tools are using bio-based dm without
      specifying the type of the dm device they use.
      To use request-based dm without changing such tools, the kernel
      must decide the type of the dm device automatically.
      The automatic type decision can't be done at the device creation time
      and needs to be deferred until such tools load a mapping table,
      since the actual type is decided by dm target type included in
      the mapping table.
      
      So a dm device has to be initialized using blk_init_queue()
      so that we can load either type of table.
      Then, all queue stuffs are set (e.g. q->request_fn) and we have
      no element to distinguish that it is bio-based or request-based,
      even after a table is loaded and the type of the device is decided.
      
      By the way, some stuffs of the queue (e.g. request_list, elevator)
      are needless when the dm device is used as bio-based.
      But the memory size is not so large (about 20[KB] per queue on ia64),
      so I hope the memory loss can be acceptable for bio-based dm users.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      4ee5eaf4
    • K
      block: add request submission interface · 82124d60
      Kiyoshi Ueda 提交于
      This patch adds blk_insert_cloned_request(), a generic request
      submission interface for request stacking drivers.
      Request-based dm will use it to submit their clones to underlying
      devices.
      
      blk_rq_check_limits() is also added because it is possible that
      the lower queue has stronger limitations than the upper queue
      if multiple drivers are stacking at request-level.
      Not only for blk_insert_cloned_request()'s internal use, the function
      will be used by request-based dm when the queue limitation is
      modified (e.g. by replacing dm's table).
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      82124d60
    • K
      block: add request update interface · 32fab448
      Kiyoshi Ueda 提交于
      This patch adds blk_update_request(), which updates struct request
      with completing its data part, but doesn't complete the struct
      request itself.
      Though it looks like end_that_request_first() of older kernels,
      blk_update_request() should be used only by request stacking drivers.
      
      Request-based dm will use it in bio->bi_end_io callback to update
      the original request when a data part of a cloned request completes.
      Followings are additional background information of why request-based
      dm needs this interface.
      
        - Request stacking drivers can't use blk_end_request() directly from
          the lower driver's completion context (bio->bi_end_io or rq->end_io),
          because some device drivers (e.g. ide) may try to complete
          their request with queue lock held, and it may cause deadlock.
          See below for detailed description of possible deadlock:
          <http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
      
        - To solve that, request-based dm offloads the completion of
          cloned struct request to softirq context (i.e. using
          blk_complete_request() from rq->end_io).
      
        - Though it is possible to use the same solution from bio->bi_end_io,
          it will delay the notification of bio completion to the original
          submitter.  Also, it will cause inefficient partial completion,
          because the lower driver can't perform the cloned request anymore
          and request-based dm needs to requeue and redispatch it to
          the lower driver again later.  That's not good.
      
        - So request-based dm needs blk_update_request() to perform the bio
          completion in the lower driver's completion context, which is more
          efficient.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      32fab448
    • J
      block: blk_cleanup_queue() should call blk_sync_queue() · e3335de9
      Jens Axboe 提交于
      When a driver calls blk_cleanup_queue(), the device should be fully idle.
      However, the block layer may have pending plugging timers and the IO
      schedulers may have pending work in the work queues. So quisce the device
      by waiting for the timer and flushing the work queues.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      e3335de9
    • J
      block: unify request timeout handling · 242f9dcb
      Jens Axboe 提交于
      Right now SCSI and others do their own command timeout handling.
      Move those bits to the block layer.
      
      Instead of having a timer per command, we try to be a bit more clever
      and simply have one per-queue. This avoids the overhead of having to
      tear down and setup a timer for each command, so it will result in a lot
      less timer fiddling.
      Signed-off-by: NMike Anderson <andmike@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      242f9dcb
    • J
      block: update comment on end_request() · 839e96af
      Jens Axboe 提交于
      It refers to functions that no longer exist after the IO completion
      changes.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      839e96af
    • J
      block: don't use bio_has_data() in the completion path · 60540161
      Jens Axboe 提交于
      We should just check for rq->bio, as that is really the information
      we are looking for. Even if the bio attached doesn't carry data,
      we still need to do IO post processing on it.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      60540161
    • J
      block: inherit CPU completion on bio->rq and rq->rq merges · ab780f1e
      Jens Axboe 提交于
      Somewhat incomplete, as we do allow merges of requests and bios
      that have different completion CPUs given. This is done on the
      assumption that a larger IO is still more beneficial than CPU
      locality.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      ab780f1e
    • J
      block: add support for IO CPU affinity · c7c22e4d
      Jens Axboe 提交于
      This patch adds support for controlling the IO completion CPU of
      either all requests on a queue, or on a per-request basis. We export
      a sysfs variable (rq_affinity) which, if set, migrates completions
      of requests to the CPU that originally submitted it. A bio helper
      (bio_set_completion_cpu()) is also added, so that queuers can ask
      for completion on that specific CPU.
      
      In testing, this has been show to cut the system time by as much
      as 20-40% on synthetic workloads where CPU affinity is desired.
      
      This requires a little help from the architecture, so it'll only
      work as designed for archs that are using the new generic smp
      helper infrastructure.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      c7c22e4d
    • J
      block: make kblockd_schedule_work() take the queue as parameter · 18887ad9
      Jens Axboe 提交于
      Preparatory patch for checking queuing affinity.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      18887ad9
    • J
      block: split softirq handling into blk-softirq.c · b646fc59
      Jens Axboe 提交于
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      b646fc59
    • T
      block: move stats from disk to part0 · 074a7aca
      Tejun Heo 提交于
      Move stats related fields - stamp, in_flight, dkstats - from disk to
      part0 and unify stat handling such that...
      
      * part_stat_*() now updates part0 together if the specified partition
        is not part0.  ie. part_stat_*() are now essentially all_stat_*().
      
      * {disk|all}_stat_*() are gone.
      
      * part_round_stats() is updated similary.  It handles part0 stats
        automatically and disk_round_stats() is killed.
      
      * part_{inc|dec}_in_fligh() is implemented which automatically updates
        part0 stats for parts other than part0.
      
      * disk_map_sector_rcu() is updated to return part0 if no part matches.
        Combined with the above changes, this makes NULL special case
        handling in callers unnecessary.
      
      * Separate stats show code paths for disk are collapsed into part
        stats show code paths.
      
      * Rename disk_stat_lock/unlock() to part_stat_lock/unlock()
      
      While at it, reposition stat handling macros a bit and add missing
      parentheses around macro parameters.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      074a7aca
    • T
      block: kill GENHD_FL_FAIL and use part0->make_it_fail · eddb2e26
      Tejun Heo 提交于
      GENHD_FL_FAIL for disk is what make_it_fail is for parts.  Kill it and
      use part0->make_it_fail.  Sysfs node handling is unified too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      eddb2e26
    • T
      block: always set bdev->bd_part · 0762b8bd
      Tejun Heo 提交于
      Till now, bdev->bd_part is set only if the bdev was for parts other
      than part0.  This patch makes bdev->bd_part always set so that code
      paths don't have to differenciate common handling.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      0762b8bd
    • T
      block: fix diskstats access · c9959059
      Tejun Heo 提交于
      There are two variants of stat functions - ones prefixed with double
      underbars which don't care about preemption and ones without which
      disable preemption before manipulating per-cpu counters.  It's unclear
      whether the underbarred ones assume that preemtion is disabled on
      entry as some callers don't do that.
      
      This patch unifies diskstats access by implementing disk_stat_lock()
      and disk_stat_unlock() which take care of both RCU (for partition
      access) and preemption (for per-cpu counter access).  diskstats access
      should always be enclosed between the two functions.  As such, there's
      no need for the versions which disables preemption.  They're removed
      and double underbars ones are renamed to drop the underbars.  As an
      extra argument is added, there's no danger of using the old version
      unconverted.
      
      disk_stat_lock() uses get_cpu() and returns the cpu index and all
      diskstat functions which access per-cpu counters now has @cpu
      argument to help RT.
      
      This change adds RCU or preemption operations at some places but also
      collapses several preemption ops into one at others.  Overall, the
      performance difference should be negligible as all involved ops are
      very lightweight per-cpu ones.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      c9959059
    • T
      block: fix disk->part[] dereferencing race · e71bf0d0
      Tejun Heo 提交于
      disk->part[] is protected by its matching bdev's lock.  However,
      non-critical accesses like collecting stats and printing out sysfs and
      proc information used to be performed without any locking.  As
      partitions can come and go dynamically, partitions can go away
      underneath those non-critical accesses.  As some of those accesses are
      writes, this theoretically can lead to silent corruption.
      
      This patch fixes the race by using RCU for the partition array and dev
      reference counter to hold partitions.
      
      * Rename disk->part[] to disk->__part[] to make sure no one outside
        genhd layer proper accesses it directly.
      
      * Use RCU for disk->__part[] dereferencing.
      
      * Implement disk_{get|put}_part() which can be used to get and put
        partitions from gendisk respectively.
      
      * Iterators are implemented to help iterate through all partitions
        safely.
      
      * Functions which require RCU readlock are marked with _rcu suffix.
      
      * Use disk_put_part() in __blkdev_put() instead of directly putting
        the contained kobject.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      e71bf0d0
    • T
      block: misc updates · 310a2c10
      Tejun Heo 提交于
      This patch makes the following misc updates in preparation for
      disk->part dereference fix and extended block devt support.
      
      * implment part_to_disk()
      
      * fix comment about gendisk->part indexing
      
      * rename get_part() to disk_map_sector()
      
      * don't use n which is always zero while printing disk information in
        diskstats_show()
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      310a2c10
    • R
      Add some block/ source files to the kernel-api docbook. Fix kernel-doc... · 710027a4
      Randy Dunlap 提交于
      Add some block/ source files to the kernel-api docbook. Fix kernel-doc notation in them as needed. Fix changed function parameter names. Fix typos/spellos. In comments, change REQ_SPECIAL to REQ_TYPE_SPECIAL and REQ_BLOCK_PC to REQ_TYPE_BLOCK_PC.
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      710027a4
    • M
      drop vmerge accounting · 5df97b91
      Mikulas Patocka 提交于
      Remove hw_segments field from struct bio and struct request. Without virtual
      merge accounting they have no purpose.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      5df97b91
    • D
      Allow elevators to sort/merge discard requests · e17fc0a1
      David Woodhouse 提交于
      But blkdev_issue_discard() still emits requests which are interpreted as
      soft barriers, because naïve callers might otherwise issue subsequent
      writes to those same sectors, which might cross on the queue (if they're
      reallocated quickly enough).
      
      Callers still _can_ issue non-barrier discard requests, but they have to
      take care of queue ordering for themselves.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      e17fc0a1
    • D
      Add 'discard' request handling · fb2dce86
      David Woodhouse 提交于
      Some block devices benefit from a hint that they can forget the contents
      of certain sectors. Add basic support for this to the block core, along
      with a 'blkdev_issue_discard()' helper function which issues such
      requests.
      
      The caller doesn't get to provide an end_io functio, since
      blkdev_issue_discard() will automatically split the request up into
      multiple bios if appropriate. Neither does the function wait for
      completion -- it's expected that callers won't care about when, or even
      _if_, the request completes. It's only a hint to the device anyway. By
      definition, the file system doesn't _care_ about these sectors any more.
      
      [With feedback from OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> and
      Jens Axboe <jens.axboe@oracle.com]
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      fb2dce86
    • D
    • J
      block: use bio_has_data() in the IO completion path · 051cc395
      Jens Axboe 提交于
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      051cc395
    • J
      a9c701e5
  2. 27 8月, 2008 1 次提交
    • F
      block: move cmdfilter from gendisk to request_queue · abf54393
      FUJITA Tomonori 提交于
      cmd_filter works only for the block layer SG_IO with SCSI block
      devices. It breaks scsi/sg.c, bsg, and the block layer SG_IO with SCSI
      character devices (such as st). We hit a kernel crash with them.
      
      The problem is that cmd_filter code accesses to gendisk (having struct
      blk_scsi_cmd_filter) via inode->i_bdev->bd_disk. It works for only
      SCSI block device files. With character device files, inode->i_bdev
      leads you to struct cdev. inode->i_bdev->bd_disk->blk_scsi_cmd_filter
      isn't safe.
      
      SCSI ULDs don't expose gendisk; they keep it private. bsg needs to be
      independent on any protocols. We shouldn't change ULDs to expose their
      gendisk.
      
      This patch moves struct blk_scsi_cmd_filter from gendisk to
      request_queue, a common object, which eveyone can access to.
      
      The user interface doesn't change; users can change the filters via
      /sys/block/. gendisk has a pointer to request_queue so the cmd_filter
      code accesses to struct blk_scsi_cmd_filter.
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      abf54393
  3. 02 8月, 2008 1 次提交
  4. 16 7月, 2008 1 次提交
  5. 03 7月, 2008 2 次提交
  6. 28 5月, 2008 1 次提交
  7. 25 5月, 2008 1 次提交
    • C
      Remove argument from open_softirq which is always NULL · 962cf36c
      Carlos R. Mafra 提交于
      As git-grep shows, open_softirq() is always called with the last argument
      being NULL
      
      block/blk-core.c:       open_softirq(BLOCK_SOFTIRQ, blk_done_softirq, NULL);
      kernel/hrtimer.c:       open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq, NULL);
      kernel/rcuclassic.c:    open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
      kernel/rcupreempt.c:    open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
      kernel/sched.c: open_softirq(SCHED_SOFTIRQ, run_rebalance_domains, NULL);
      kernel/softirq.c:       open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
      kernel/softirq.c:       open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
      kernel/timer.c: open_softirq(TIMER_SOFTIRQ, run_timer_softirq, NULL);
      net/core/dev.c: open_softirq(NET_TX_SOFTIRQ, net_tx_action, NULL);
      net/core/dev.c: open_softirq(NET_RX_SOFTIRQ, net_rx_action, NULL);
      
      This observation has already been made by Matthew Wilcox in June 2002
      (http://www.cs.helsinki.fi/linux/linux-kernel/2002-25/0687.html)
      
      "I notice that none of the current softirq routines use the data element
      passed to them."
      
      and the situation hasn't changed since them. So it appears we can safely
      remove that extra argument to save 128 (54) bytes of kernel data (text).
      Signed-off-by: NCarlos R. Mafra <crmafra@ift.unesp.br>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      962cf36c
  8. 15 5月, 2008 1 次提交
    • N
      Remove blkdev warning triggered by using md · e7e72bf6
      Neil Brown 提交于
      As setting and clearing queue flags now requires that we hold a spinlock
      on the queue, and as blk_queue_stack_limits is called without that lock,
      get the lock inside blk_queue_stack_limits.
      
      For blk_queue_stack_limits to be able to find the right lock, each md
      personality needs to set q->queue_lock to point to the appropriate lock.
      Those personalities which didn't previously use a spin_lock, us
      q->__queue_lock.  So always initialise that lock when allocated.
      
      With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
      longer cause warnings as it will be clear that the proper lock is held.
      
      Thanks to Dan Williams for review and fixing the silly bugs.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Alistair John Strachan <alistair@devzero.co.uk>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Jacek Luczak <difrost.kernel@gmail.com>
      Cc: Prakash Punnoor <prakash@punnoor.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7e72bf6
  9. 07 5月, 2008 2 次提交
    • J
      block: avoid duplicate calls to get_part() in disk stat code · 28f13702
      Jens Axboe 提交于
      get_part() is fairly expensive, as it O(N) loops over partitions
      to find the right one. In lots of normal IO paths we end up looking
      up the partition twice, to make matters even worse. Change the
      stat add code to accept a passed in partition instead.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      28f13702
    • J
      block: optimize generic_unplug_device() · dbaf2c00
      Jens Axboe 提交于
      Original patch from Mikulas Patocka <mpatocka@redhat.com>
      
      Mike Anderson was doing an OLTP benchmark on a computer with 48 physical
      disks mapped to one logical device via device mapper.
      
      He found that there was a slowdown on request_queue->lock in function
      generic_unplug_device. The slowdown is caused by the fact that when some
      code calls unplug on the device mapper, device mapper calls unplug on all
      physical disks. These unplug calls take the lock, find that the queue is
      already unplugged, release the lock and exit.
      
      With the below patch, performance of the benchmark was increased by 18%
      (the whole OLTP application, not just block layer microbenchmarks).
      
      So I'm submitting this patch for upstream. I think the patch is correct,
      because when more threads call simultaneously plug and unplug, it is
      unspecified, if the queue is or isn't plugged (so the patch can't make
      this worse). And the caller that plugged the queue should unplug it
      anyway. (if it doesn't, there's 3ms timeout).
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      dbaf2c00
  10. 01 5月, 2008 1 次提交
  11. 29 4月, 2008 3 次提交