1. 24 7月, 2011 1 次提交
    • D
      block: strict rq_affinity · 5757a6d7
      Dan Williams 提交于
      Some systems benefit from completions always being steered to the strict
      requester cpu rather than the looser "per-socket" steering that
      blk_cpu_to_group() attempts by default. This is because the first
      CPU in the group mask ends up being completely overloaded with work,
      while the others (including the original submitter) has power left
      to spare.
      
      Allow the strict mode to be set by writing '2' to the sysfs control
      file. This is identical to the scheme used for the nomerges file,
      where '2' is a more aggressive setting than just being turned on.
      
      echo 2 > /sys/block/<bdev>/queue/rq_affinity
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Roland Dreier <roland@purestorage.com>
      Tested-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      5757a6d7
  2. 14 7月, 2011 1 次提交
  3. 08 7月, 2011 2 次提交
    • S
      block: document blk_plug list access · 316cc67d
      Shaohua Li 提交于
      I'm often confused why not disable preempt when changing blk_plug list. It
      would be better to add comments here in case others have the similar concerns.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      316cc67d
    • S
      block: avoid building too big plug list · 55c022bb
      Shaohua Li 提交于
      When I test fio script with big I/O depth, I found the total throughput drops
      compared to some relative small I/O depth. The reason is the thread accumulates
      big requests in its plug list and causes some delays (surely this depends
      on CPU speed).
      I thought we'd better have a threshold for requests. When a threshold reaches,
      this means there is no request merge and queue lock contention isn't severe
      when pushing per-task requests to queue, so the main advantages of blk plug
      don't exist. We can force a plug list flush in this case.
      With this, my test throughput actually increases and almost equals to small
      I/O depth. Another side effect is irq off time decreases in blk_flush_plug_list()
      for big I/O depth.
      The BLK_MAX_REQUEST_COUNT is choosen arbitarily, but 16 is efficiently to
      reduce lock contention to me. But I'm open here, 32 is ok in my test too.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      55c022bb
  4. 13 6月, 2011 1 次提交
  5. 31 5月, 2011 1 次提交
  6. 18 5月, 2011 1 次提交
  7. 07 5月, 2011 2 次提交
    • S
      block: hold queue if flush is running for non-queueable flush drive · 3ac0cc45
      shaohua.li@intel.com 提交于
      In some drives, flush requests are non-queueable. When flush request is
      running, normal read/write requests can't run. If block layer dispatches
      such request, driver can't handle it and requeue it.  Tejun suggested we
      can hold the queue when flush is running. This can avoid unnecessary
      requeue.  Also this can improve performance. For example, we have
      request flush1, write1, flush 2. flush1 is dispatched, then queue is
      hold, write1 isn't inserted to queue. After flush1 is finished, flush2
      will be dispatched. Since disk cache is already clean, flush2 will be
      finished very soon, so looks like flush2 is folded to flush1.
      
      In my test, the queue holding completely solves a regression introduced by
      commit 53d63e6b:
      
          block: make the flush insertion use the tail of the dispatch list
      
          It's not a preempt type request, in fact we have to insert it
          behind requests that do specify INSERT_FRONT.
      
      which causes about 20% regression running a sysbench fileio
      workload.
      
      Stable: 2.6.39 only
      
      Cc: stable@kernel.org
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      3ac0cc45
    • S
      block: add a non-queueable flush flag · f3876930
      shaohua.li@intel.com 提交于
      flush request isn't queueable in some drives. Add a flag to let driver
      notify block layer about this. We can optimize flush performance with the
      knowledge.
      
      Stable: 2.6.39 only
      
      Cc: stable@kernel.org
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      f3876930
  8. 19 4月, 2011 1 次提交
    • J
      block: get rid of QUEUE_FLAG_REENTER · c21e6beb
      Jens Axboe 提交于
      We are currently using this flag to check whether it's safe
      to call into ->request_fn(). If it is set, we punt to kblockd.
      But we get a lot of false positives and excessive punts to
      kblockd, which hurts performance.
      
      The only real abuser of this infrastructure is SCSI. So export
      the async queue run and convert SCSI over to use that. There's
      room for improvement in that SCSI need not always use the async
      call, but this fixes our performance issue and they can fix that
      up in due time.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      c21e6beb
  9. 18 4月, 2011 3 次提交
  10. 16 4月, 2011 1 次提交
    • J
      block: let io_schedule() flush the plug inline · a237c1c5
      Jens Axboe 提交于
      Linus correctly observes that the most important dispatch cases
      are now done from kblockd, this isn't ideal for latency reasons.
      The original reason for switching dispatches out-of-line was to
      avoid too deep a stack, so by _only_ letting the "accidental"
      flush directly in schedule() be guarded by offload to kblockd,
      we should be able to get the best of both worlds.
      
      So add a blk_schedule_flush_plug() that offloads to kblockd,
      and only use that from the schedule() path.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      a237c1c5
  11. 15 4月, 2011 2 次提交
  12. 12 4月, 2011 1 次提交
  13. 06 4月, 2011 1 次提交
    • M
      dm: improve block integrity support · a63a5cf8
      Mike Snitzer 提交于
      The current block integrity (DIF/DIX) support in DM is verifying that
      all devices' integrity profiles match during DM device resume (which
      is past the point of no return).  To some degree that is unavoidable
      (stacked DM devices force this late checking).  But for most DM
      devices (which aren't stacking on other DM devices) the ideal time to
      verify all integrity profiles match is during table load.
      
      Introduce the notion of an "initialized" integrity profile: a profile
      that was blk_integrity_register()'d with a non-NULL 'blk_integrity'
      template.  Add blk_integrity_is_initialized() to allow checking if a
      profile was initialized.
      
      Update DM integrity support to:
      - check all devices with _initialized_ integrity profiles match
        during table load; uninitialized profiles (e.g. for underlying DM
        device(s) of a stacked DM device) are ignored.
      - disallow a table load that would result in an integrity profile that
        conflicts with a DM device's existing (in-use) integrity profile
      - avoid clearing an existing integrity profile
      - validate all integrity profiles match during resume; but if they
        don't all we can do is report the mismatch (during resume we're past
        the point of no return)
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      a63a5cf8
  14. 12 3月, 2011 1 次提交
  15. 10 3月, 2011 3 次提交
    • J
      block: remove per-queue plugging · 7eaceacc
      Jens Axboe 提交于
      Code has been converted over to the new explicit on-stack plugging,
      and delay users have been converted to use the new API for that.
      So lets kill off the old plugging along with aops->sync_page().
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7eaceacc
    • J
      block: initial patch for on-stack per-task plugging · 73c10101
      Jens Axboe 提交于
      This patch adds support for creating a queuing context outside
      of the queue itself. This enables us to batch up pieces of IO
      before grabbing the block device queue lock and submitting them to
      the IO scheduler.
      
      The context is created on the stack of the process and assigned in
      the task structure, so that we can auto-unplug it if we hit a schedule
      event.
      
      The current queue plugging happens implicitly if IO is submitted to
      an empty device, yet callers have to remember to unplug that IO when
      they are going to wait for it. This is an ugly API and has caused bugs
      in the past. Additionally, it requires hacks in the vm (->sync_page()
      callback) to handle that logic. By switching to an explicit plugging
      scheme we make the API a lot nicer and can get rid of the ->sync_page()
      hack in the vm.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      73c10101
    • J
      block: add API for delaying work/request_fn a little bit · 3cca6dc1
      Jens Axboe 提交于
      Currently we use plugging for that, but as plugging is going away,
      we need an alternative mechanism.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      3cca6dc1
  16. 03 3月, 2011 1 次提交
    • V
      block: Move blk_throtl_exit() call to blk_cleanup_queue() · da527770
      Vivek Goyal 提交于
      Move blk_throtl_exit() in blk_cleanup_queue() as blk_throtl_exit() is
      written in such a way that it needs queue lock. In blk_release_queue()
      there is no gurantee that ->queue_lock is still around.
      
      Initially blk_throtl_exit() was in blk_cleanup_queue() but Ingo reported
      one problem.
      
        https://lkml.org/lkml/2010/10/23/86
      
        And a quick fix moved blk_throtl_exit() to blk_release_queue().
      
              commit 7ad58c02
              Author: Jens Axboe <jaxboe@fusionio.com>
              Date:   Sat Oct 23 20:40:26 2010 +0200
      
              block: fix use-after-free bug in blk throttle code
      
      This patch reverts above change and does not try to shutdown the
      throtl work in blk_sync_queue(). By avoiding call to
      throtl_shutdown_timer_wq() from blk_sync_queue(), we should also avoid
      the problem reported by Ingo.
      
      blk_sync_queue() seems to be used only by md driver and it seems to be
      using it to make sure q->unplug_fn is not called as md registers its
      own unplug functions and it is about to free up the data structures
      used by unplug_fn(). Block throttle does not call back into unplug_fn()
      or into md. So there is no need to cancel blk throttle work.
      
      In fact I think cancelling block throttle work is bad because it might
      happen that some bios are throttled and scheduled to be dispatched later
      with the help of pending work and if work is cancelled, these bios might
      never be dispatched.
      
      Block layer also uses blk_sync_queue() during blk_cleanup_queue() and
      blk_release_queue() time. That should be safe as we are also calling
      blk_throtl_exit() which should make sure all the throttling related
      data structures are cleaned up.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      da527770
  17. 02 3月, 2011 2 次提交
    • T
      block: add @force_kblockd to __blk_run_queue() · 1654e741
      Tejun Heo 提交于
      __blk_run_queue() automatically either calls q->request_fn() directly
      or schedules kblockd depending on whether the function is recursed.
      blk-flush implementation needs to be able to explicitly choose
      kblockd.  Add @force_kblockd.
      
      All the current users are converted to specify %false for the
      parameter and this patch doesn't introduce any behavior change.
      
      stable: This is prerequisite for fixing ide oops caused by the new
              blk-flush implementation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      1654e741
    • V
      blk-throttle: Do not use kblockd workqueue for throtl work · 450adcbe
      Vivek Goyal 提交于
      o Dominik Klein reported a system hang issue while doing some blkio
        throttling testing.
      
        https://lkml.org/lkml/2011/2/24/173
      
      o Some tracing revealed that CFQ was not dispatching any more jobs as
        queue unplug was not happening. And queue unplug was not happening
        because unplug work was not being called as there was one throttling
        work on same cpu which as not finished yet. And throttling work had not
        finished as it was tyring to dispatch a bio to CFQ but all the request
        descriptors were consume to it was put to sleep.
      
      o So basically it is a cyclic dependecny between CFQ unplug work and
        throtl dispatch work. Tejun suggested that use separate workqueue for
        such cases.
      
      o This patch uses a separate workqueue for throttle related work and
        does not rely on kblockd workqueue anymore.
      
      Cc: stable@kernel.org
      Reported-by: NDominik Klein <dk@in-telegence.net>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      450adcbe
  18. 11 2月, 2011 1 次提交
  19. 25 1月, 2011 1 次提交
    • T
      block: reimplement FLUSH/FUA to support merge · ae1b1539
      Tejun Heo 提交于
      The current FLUSH/FUA support has evolved from the implementation
      which had to perform queue draining.  As such, sequencing is done
      queue-wide one flush request after another.  However, with the
      draining requirement gone, there's no reason to keep the queue-wide
      sequential approach.
      
      This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
      request is sequenced individually.  The actual FLUSH execution is
      double buffered and whenever a request wants to execute one for either
      PRE or POSTFLUSH, it queues on the pending queue.  Once certain
      conditions are met, a flush request is issued and on its completion
      all pending requests proceed to the next sequence.
      
      This allows arbitrary merging of different type of flushes.  How they
      are merged can be primarily controlled and tuned by adjusting the
      above said 'conditions' used to determine when to issue the next
      flush.
      
      This is inspired by Darrick's patches to merge multiple zero-data
      flushes which helps workloads with highly concurrent fsync requests.
      
      * As flush requests are never put on the IO scheduler, request fields
        used for flush share space with rq->rb_node.  rq->completion_data is
        moved out of the union.  This increases the request size by one
        pointer.
      
        As rq->elevator_private* are used only by the iosched too, it is
        possible to reduce the request size further.  However, to do that,
        we need to modify request allocation path such that iosched data is
        not allocated for flush requests.
      
      * FLUSH/FUA processing happens on insertion now instead of dispatch.
      
      - Comments updated as per Vivek and Mike.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "Darrick J. Wong" <djwong@us.ibm.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      ae1b1539
  20. 05 1月, 2011 1 次提交
    • J
      block: fix accounting bug on cross partition merges · 09e099d4
      Jerome Marchand 提交于
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      Also add a refcount to struct hd_struct to keep the partition in
      memory as long as users exist. We use kref_test_and_get() to ensure
      we don't add a reference to a partition which is going away.
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      09e099d4
  21. 17 12月, 2010 4 次提交
    • M
      block: max hardware sectors limit wrapper · 72d4cd9f
      Mike Snitzer 提交于
      Implement blk_limits_max_hw_sectors() and make
      blk_queue_max_hw_sectors() a wrapper around it.
      
      DM needs this to avoid setting queue_limits' max_hw_sectors and
      max_sectors directly.  dm_set_device_limits() now leverages
      blk_limits_max_hw_sectors() logic to establish the appropriate
      max_hw_sectors minimum (PAGE_SIZE).  Fixes issue where DM was
      incorrectly setting max_sectors rather than max_hw_sectors (which
      caused dm_merge_bvec()'s max_hw_sectors check to be ineffective).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@kernel.org
      Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      72d4cd9f
    • M
      block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead · e692cb66
      Martin K. Petersen 提交于
      When stacking devices, a request_queue is not always available. This
      forced us to have a no_cluster flag in the queue_limits that could be
      used as a carrier until the request_queue had been set up for a
      metadevice.
      
      There were several problems with that approach. First of all it was up
      to the stacking device to remember to set queue flag after stacking had
      completed. Also, the queue flag and the queue limits had to be kept in
      sync at all times. We got that wrong, which could lead to us issuing
      commands that went beyond the max scatterlist limit set by the driver.
      
      The proper fix is to avoid having two flags for tracking the same thing.
      We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
      block layer merging functions. The queue_limit 'no_cluster' is turned
      into 'cluster' to avoid double negatives and to ease stacking.
      Clustering defaults to being enabled as before. The queue flag logic is
      removed from the stacking function, and explicitly setting the cluster
      flag is no longer necessary in DM and MD.
      Reported-by: NEd Lin <ed.lin@promise.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      e692cb66
    • T
      implement in-kernel gendisk events handling · 77ea887e
      Tejun Heo 提交于
      Currently, media presence polling for removeable block devices is done
      from userland.  There are several issues with this.
      
      * Polling is done by periodically opening the device.  For SCSI
        devices, the command sequence generated by such action involves a
        few different commands including TEST_UNIT_READY.  This behavior,
        while perfectly legal, is different from Windows which only issues
        single command, GET_EVENT_STATUS_NOTIFICATION.  Unfortunately, some
        ATAPI devices lock up after being periodically queried such command
        sequences.
      
      * There is no reliable and unintrusive way for a userland program to
        tell whether the target device is safe for media presence polling.
        For example, polling for media presence during an on-going burning
        session can make it fail.  The polling program can avoid this by
        opening the device with O_EXCL but then it risks making a valid
        exclusive user of the device fail w/ -EBUSY.
      
      * Userland polling is unnecessarily heavy and in-kernel implementation
        is lighter and better coordinated (workqueue, timer slack).
      
      This patch implements framework for in-kernel disk event handling,
      which includes media presence polling.
      
      * bdops->check_events() is added, which supercedes ->media_changed().
        It should check whether there's any pending event and return if so.
        Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
        DISK_EVENT_EJECT_REQUEST.  ->check_events() is guaranteed not to be
        called parallelly.
      
      * gendisk->events and ->async_events are added.  These should be
        initialized by block driver before passing the device to add_disk().
        The former contains the mask of all supported events and the latter
        the mask of all events which the device can report without polling.
        /sys/block/*/events[_async] export these to userland.
      
      * Kernel parameter block.events_dfl_poll_msecs controls the system
        polling interval (default is 0 which means disable) and
        /sys/block/*/events_poll_msecs control polling intervals for
        individual devices (default is -1 meaning use system setting).  Note
        that if a device can report all supported events asynchronously and
        its polling interval isn't explicitly set, the device won't be
        polled regardless of the system polling interval.
      
      * If a device is opened exclusively with write access, event checking
        is automatically disabled until all write exclusive accesses are
        released.
      
      * There are event 'clearing' events.  For example, both of currently
        defined events are cleared after the device has been successfully
        opened.  This information is passed to ->check_events() callback
        using @clearing argument as a hint.
      
      * Event checking is always performed from system_nrt_wq and timer
        slack is set to 25% for polling.
      
      * Nothing changes for drivers which implement ->media_changed() but
        not ->check_events().  Going forward, all drivers will be converted
        to ->check_events() and ->media_change() will be dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      77ea887e
    • T
      block: move register_disk() and del_gendisk() to block/genhd.c · d2bf1b67
      Tejun Heo 提交于
      There's no reason for register_disk() and del_gendisk() to be in
      fs/partitions/check.c.  Move both to genhd.c.  While at it, collapse
      unlink_gendisk(), which was artificially in a separate function due to
      genhd.c / check.c split, into del_gendisk().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      d2bf1b67
  22. 10 11月, 2010 1 次提交
    • C
      block: remove REQ_HARDBARRIER · 02e031cb
      Christoph Hellwig 提交于
      REQ_HARDBARRIER is dead now, so remove the leftovers.  What's left
      at this point is:
      
       - various checks inside the block layer.
       - sanity checks in bio based drivers.
       - now unused bio_empty_barrier helper.
       - Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while,
         but Xen really needs to sort out it's barrier situaton.
       - setting of ordered tags in uas - dead code copied from old scsi
         drivers.
       - scsi different retry for barriers - it's dead and should have been
         removed when flushes were converted to FS requests.
       - blktrace handling of barriers - removed.  Someone who knows blktrace
         better should add support for REQ_FLUSH and REQ_FUA, though.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      02e031cb
  23. 28 10月, 2010 1 次提交
  24. 25 10月, 2010 1 次提交
  25. 19 10月, 2010 1 次提交
    • Y
      block: fix accounting bug on cross partition merges · 7681bfee
      Yasuaki Ishimatsu 提交于
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      When reloading partition tables, quiesce IO to ensure that no
      request references to the partition struct exists. When it is safe
      to free the partition table, the IO for that device is restarted
      again.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7681bfee
  26. 14 10月, 2010 1 次提交
  27. 17 9月, 2010 1 次提交
    • C
      block: remove BLKDEV_IFL_WAIT · dd3932ed
      Christoph Hellwig 提交于
      All the blkdev_issue_* helpers can only sanely be used for synchronous
      caller.  To issue cache flushes or barriers asynchronously the caller needs
      to set up a bio by itself with a completion callback to move the asynchronous
      state machine ahead.  So drop the BLKDEV_IFL_WAIT flag that is always
      specified when calling blkdev_issue_* and also remove the now unused flags
      argument to blkdev_issue_flush and blkdev_issue_zeroout.  For
      blkdev_issue_discard we need to keep it for the secure discard flag, which
      gains a more descriptive name and loses the bitops vs flag confusion.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      dd3932ed
  28. 16 9月, 2010 1 次提交
  29. 15 9月, 2010 1 次提交
    • N
      block: fix an address space warning in blk-map.c · 14417799
      Namhyung Kim 提交于
      Change type of 2nd parameter of blk_rq_aligned() into unsigned long
      and remove unnecessary casting. Now we can call it with 'uaddr'
      instead of 'ubuf' in __blk_rq_map_user() so that it can remove
      following warnings from sparse:
      
       block/blk-map.c:57:31: warning: incorrect type in argument 2 (different address spaces)
       block/blk-map.c:57:31:    expected void *addr
       block/blk-map.c:57:31:    got void [noderef] <asn:1>*ubuf
      
      However blk_rq_map_kern() needs one more local variable to handle it.
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      14417799