1. 14 12月, 2011 4 次提交
    • T
      block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq · a612fddf
      Tejun Heo 提交于
      Most of icq management is about to be moved out of cfq into blk-ioc.
      This patch prepares for it.
      
      * Move cfqd->icq_list to request_queue->icq_list
      
      * Make request explicitly point to icq instead of through elevator
        private data.  ->elevator_private[3] is replaced with sub struct elv
        which contains icq pointer and priv[2].  cfq is updated accordingly.
      
      * Meaningless clearing of ->elevator_private[0] removed from
        elv_set_request().  At that point in code, the field was guaranteed
        to be %NULL anyway.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a612fddf
    • T
      block: remove elevator_queue->ops · 22f746e2
      Tejun Heo 提交于
      elevator_queue->ops points to the same ops struct ->elevator_type.ops
      is pointing to.  The only effect of caching it in elevator_queue is
      shorter notation - it doesn't save any indirect derefence.
      
      Relocate elevator_type->list which used only during module init/exit
      to the end of the structure, rename elevator_queue->elevator_type to
      ->type, and replace elevator_queue->ops with elevator_queue->type.ops.
      
      This doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      22f746e2
    • T
      block: reorder elevator switch sequence · f8fc877d
      Tejun Heo 提交于
      Elevator switch sequence first attached the new elevator, then tried
      registering it (sysfs) and if that failed attached back the old
      elevator.  However, sysfs registration doesn't require the elevator to
      be attached, so there is no reason to do the "detach, attach new,
      register, maybe re-attach old" sequence.  It can just do "register,
      detach, attach".
      
      * elevator_init_queue() is updated to set ->elevator_data directly and
        return 0 / -errno.  This allows elevator_exit() on an unattached
        elevator.
      
      * __elv_unregister_queue() which was necessary to unregister
        unattached q is removed in favor of __elv_register_queue() which can
        register unattached q.
      
      * elevator_attach() becomes a single assignment and obscures more then
        it helps.  Dropped.
      
      This will help cleaning up io_context handling across elevator switch.
      
      This patch doesn't introduce visible behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f8fc877d
    • T
      block, cfq: remove delayed unlink · b9a19208
      Tejun Heo 提交于
      Now that all cic's are immediately unlinked from both ioc and queue,
      lazy dropping from lookup path and trimming on elevator unregister are
      unnecessary.  Kill them and remove now unused elevator_ops->trim().
      
      This also leaves call_for_each_cic() without any user.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b9a19208
  2. 19 10月, 2011 2 次提交
    • T
      block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown · c9a929dd
      Tejun Heo 提交于
      request_queue is refcounted but actually depdends on lifetime
      management from the queue owner - on blk_cleanup_queue(), block layer
      expects that there's no request passing through request_queue and no
      new one will.
      
      This is fundamentally broken.  The queue owner (e.g. SCSI layer)
      doesn't have a way to know whether there are other active users before
      calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
      guarantee that the queue is and would stay valid while it's holding a
      reference.
      
      With delay added in blk_queue_bio() before queue_lock is grabbed, the
      following oops can be easily triggered when a device is removed with
      in-flight IOs.
      
       sd 0:0:1:0: [sdb] Stopping disk
       ata1.01: disabled
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
      
       Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
       RIP: 0010:[<ffffffff8137d651>]  [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
       ...
       Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
       ...
       Call Trace:
        [<ffffffff8137d774>] elv_merge+0x84/0xe0
        [<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
        [<ffffffff813838ea>] generic_make_request+0xca/0x100
        [<ffffffff81383994>] submit_bio+0x74/0x100
        [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
        [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
        [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
        [<ffffffff8118c1ca>] do_sync_read+0xda/0x120
        [<ffffffff8118ce55>] vfs_read+0xc5/0x180
        [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
        [<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
      
      This happens because blk_queue_cleanup() destroys the queue and
      elevator whether IOs are in progress or not and DEAD tests are
      sprinkled in the request processing path without proper
      synchronization.
      
      Similar problem exists for blk-throtl.  On queue cleanup, blk-throtl
      is shutdown whether it has requests in it or not.  Depending on
      timing, it either oopses or throttled bios are lost putting tasks
      which are waiting for bio completion into eternal D state.
      
      The way it should work is having the usual clear distinction between
      shutdown and release.  Shutdown drains all currently pending requests,
      marks the queue dead, and performs partial teardown of the now
      unnecessary part of the queue.  Even after shutdown is complete,
      reference holders are still allowed to issue requests to the queue
      although they will be immmediately failed.  The rest of teardown
      happens on release.
      
      This patch makes the following changes to make blk_queue_cleanup()
      behave as proper shutdown.
      
      * QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
        queue_lock.
      
      * Unsynchronized DEAD check in generic_make_request_checks() removed.
        This couldn't make any meaningful difference as the queue could die
        after the check.
      
      * blk_drain_queue() updated such that it can drain all requests and is
        now called during cleanup.
      
      * blk_throtl updated such that it checks DEAD on grabbing queue_lock,
        drains all throttled bios during cleanup and free td when queue is
        released.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c9a929dd
    • T
      block: reorganize queue draining · e3c78ca5
      Tejun Heo 提交于
      Reorganize queue draining related code in preparation of queue exit
      changes.
      
      * Factor out actual draining from elv_quiesce_start() to
        blk_drain_queue().
      
      * Make elv_quiesce_start/end() responsible for their own locking.
      
      * Replace open-coded ELVSWITCH clearing in elevator_switch() with
        elv_quiesce_end().
      
      This patch doesn't cause any visible functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e3c78ca5
  3. 12 9月, 2011 1 次提交
  4. 03 6月, 2011 1 次提交
    • J
      iosched: prevent aliased requests from starving other I/O · 796d5116
      Jeff Moyer 提交于
      Hi, Jens,
      
      If you recall, I posted an RFC patch for this back in July of last year:
      http://lkml.org/lkml/2010/7/13/279
      
      The basic problem is that a process can issue a never-ending stream of
      async direct I/Os to the same sector on a device, thus starving out
      other I/O in the system (due to the way the alias handling works in both
      cfq and deadline).  The solution I proposed back then was to start
      dispatching from the fifo after a certain number of aliases had been
      dispatched.  Vivek asked why we had to treat aliases differently at all,
      and I never had a good answer.  So, I put together a simple patch which
      allows aliases to be added to the rb tree (it adds them to the right,
      though that doesn't matter as the order isn't guaranteed anyway).  I
      think this is the preferred solution, as it doesn't break up time slices
      in CFQ or batches in deadline.  I've tested it, and it does solve the
      starvation issue.  Let me know what you think.
      
      Cheers,
      Jeff
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      796d5116
  5. 21 5月, 2011 1 次提交
  6. 06 5月, 2011 1 次提交
  7. 22 4月, 2011 1 次提交
  8. 18 4月, 2011 1 次提交
  9. 06 4月, 2011 1 次提交
  10. 21 3月, 2011 1 次提交
    • J
      block: attempt to merge with existing requests on plug flush · 5e84ea3a
      Jens Axboe 提交于
      One of the disadvantages of on-stack plugging is that we potentially
      lose out on merging since all pending IO isn't always visible to
      everybody. When we flush the on-stack plugs, right now we don't do
      any checks to see if potential merge candidates could be utilized.
      
      Correct this by adding a new insert variant, ELEVATOR_INSERT_SORT_MERGE.
      It works just ELEVATOR_INSERT_SORT, but first checks whether we can
      merge with an existing request before doing the insertion (if we fail
      merging).
      
      This fixes a regression with multiple processes issuing IO that
      can be merged.
      
      Thanks to Shaohua Li <shaohua.li@intel.com> for testing and fixing
      an accounting bug.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      5e84ea3a
  11. 10 3月, 2011 2 次提交
    • J
      block: remove per-queue plugging · 7eaceacc
      Jens Axboe 提交于
      Code has been converted over to the new explicit on-stack plugging,
      and delay users have been converted to use the new API for that.
      So lets kill off the old plugging along with aops->sync_page().
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7eaceacc
    • J
      block: initial patch for on-stack per-task plugging · 73c10101
      Jens Axboe 提交于
      This patch adds support for creating a queuing context outside
      of the queue itself. This enables us to batch up pieces of IO
      before grabbing the block device queue lock and submitting them to
      the IO scheduler.
      
      The context is created on the stack of the process and assigned in
      the task structure, so that we can auto-unplug it if we hit a schedule
      event.
      
      The current queue plugging happens implicitly if IO is submitted to
      an empty device, yet callers have to remember to unplug that IO when
      they are going to wait for it. This is an ugly API and has caused bugs
      in the past. Additionally, it requires hacks in the vm (->sync_page()
      callback) to handle that logic. By switching to an explicit plugging
      scheme we make the API a lot nicer and can get rid of the ->sync_page()
      hack in the vm.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      73c10101
  12. 02 3月, 2011 1 次提交
    • T
      block: add @force_kblockd to __blk_run_queue() · 1654e741
      Tejun Heo 提交于
      __blk_run_queue() automatically either calls q->request_fn() directly
      or schedules kblockd depending on whether the function is recursed.
      blk-flush implementation needs to be able to explicitly choose
      kblockd.  Add @force_kblockd.
      
      All the current users are converted to specify %false for the
      parameter and this patch doesn't introduce any behavior change.
      
      stable: This is prerequisite for fixing ide oops caused by the new
              blk-flush implementation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      1654e741
  13. 11 2月, 2011 1 次提交
  14. 25 1月, 2011 1 次提交
    • T
      block: reimplement FLUSH/FUA to support merge · ae1b1539
      Tejun Heo 提交于
      The current FLUSH/FUA support has evolved from the implementation
      which had to perform queue draining.  As such, sequencing is done
      queue-wide one flush request after another.  However, with the
      draining requirement gone, there's no reason to keep the queue-wide
      sequential approach.
      
      This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
      request is sequenced individually.  The actual FLUSH execution is
      double buffered and whenever a request wants to execute one for either
      PRE or POSTFLUSH, it queues on the pending queue.  Once certain
      conditions are met, a flush request is issued and on its completion
      all pending requests proceed to the next sequence.
      
      This allows arbitrary merging of different type of flushes.  How they
      are merged can be primarily controlled and tuned by adjusting the
      above said 'conditions' used to determine when to issue the next
      flush.
      
      This is inspired by Darrick's patches to merge multiple zero-data
      flushes which helps workloads with highly concurrent fsync requests.
      
      * As flush requests are never put on the IO scheduler, request fields
        used for flush share space with rq->rb_node.  rq->completion_data is
        moved out of the union.  This increases the request size by one
        pointer.
      
        As rq->elevator_private* are used only by the iosched too, it is
        possible to reduce the request size further.  However, to do that,
        we need to modify request allocation path such that iosched data is
        not allocated for flush requests.
      
      * FLUSH/FUA processing happens on insertion now instead of dispatch.
      
      - Comments updated as per Vivek and Mike.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "Darrick J. Wong" <djwong@us.ibm.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      ae1b1539
  15. 10 11月, 2010 1 次提交
    • C
      block: remove REQ_HARDBARRIER · 02e031cb
      Christoph Hellwig 提交于
      REQ_HARDBARRIER is dead now, so remove the leftovers.  What's left
      at this point is:
      
       - various checks inside the block layer.
       - sanity checks in bio based drivers.
       - now unused bio_empty_barrier helper.
       - Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while,
         but Xen really needs to sort out it's barrier situaton.
       - setting of ordered tags in uas - dead code copied from old scsi
         drivers.
       - scsi different retry for barriers - it's dead and should have been
         removed when flushes were converted to FS requests.
       - blktrace handling of barriers - removed.  Someone who knows blktrace
         better should add support for REQ_FLUSH and REQ_FUA, though.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      02e031cb
  16. 07 10月, 2010 1 次提交
    • J
      elevator: fix oops on early call to elevator_change() · 430c62fb
      Jens Axboe 提交于
      2.6.36 introduces an API for drivers to switch the IO scheduler
      instead of manually calling the elevator exit and init functions.
      This API was added since q->elevator must be cleared in between
      those two calls. And since we already have this functionality
      directly from use by the sysfs interface to switch schedulers
      online, it was prudent to reuse it internally too.
      
      But this API needs the queue to be in a fully initialized state
      before it is called, or it will attempt to unregister elevator
      kobjects before they have been added. This results in an oops
      like this:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000051
      IP: [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
      PGD 47ddfc067 PUD 47c6a1067 PMD 0
      Oops: 0000 [#1] PREEMPT SMP
      last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:04:00.1/irq
      CPU 2
      Modules linked in: t(+) loop hid_apple usbhid ahci ehci_hcd uhci_hcd libahci usbcore nls_base igb
      
      Pid: 7319, comm: modprobe Not tainted 2.6.36-rc6+ #132 QSSC-S4R/QSSC-S4R
      RIP: 0010:[<ffffffff8116f15e>]  [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
      RSP: 0018:ffff88027da25d08  EFLAGS: 00010246
      RAX: ffff88047c68c528 RBX: 00000000fffffffe RCX: 0000000000000000
      RDX: 000000000000002f RSI: 000000000000002f RDI: ffff88047e196c88
      RBP: ffff88027da25d38 R08: 0000000000000000 R09: d84156c5635688c0
      R10: d84156c5635688c0 R11: 0000000000000000 R12: ffff88047e196c88
      R13: 0000000000000000 R14: 0000000000000000 R15: ffff88047c68c528
      FS:  00007fcb0b26f6e0(0000) GS:ffff880287400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000051 CR3: 000000047e76e000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process modprobe (pid: 7319, threadinfo ffff88027da24000, task ffff88027d377090)
      Stack:
       ffff88027da25d58 ffff88047c68c528 00000000fffffffe ffff88047e196c88
      <0> ffff88047c68c528 ffff88047e05bd90 ffff88027da25d78 ffffffff8123fb77
      <0> ffff88047e05bd90 0000000000000000 ffff88047e196c88 ffff88047c68c528
      Call Trace:
       [<ffffffff8123fb77>] kobject_add_internal+0xe7/0x1f0
       [<ffffffff8123fd98>] kobject_add_varg+0x38/0x60
       [<ffffffff8123feb9>] kobject_add+0x69/0x90
       [<ffffffff8116efe0>] ? sysfs_remove_dir+0x20/0xa0
       [<ffffffff8103d48d>] ? sub_preempt_count+0x9d/0xe0
       [<ffffffff8143de20>] ? _raw_spin_unlock+0x30/0x50
       [<ffffffff8116efe0>] ? sysfs_remove_dir+0x20/0xa0
       [<ffffffff8116eff4>] ? sysfs_remove_dir+0x34/0xa0
       [<ffffffff81224204>] elv_register_queue+0x34/0xa0
       [<ffffffff81224aad>] elevator_change+0xfd/0x250
       [<ffffffffa007e000>] ? t_init+0x0/0x361 [t]
       [<ffffffffa007e000>] ? t_init+0x0/0x361 [t]
       [<ffffffffa007e0a8>] t_init+0xa8/0x361 [t]
       [<ffffffff810001de>] do_one_initcall+0x3e/0x170
       [<ffffffff8108c3fd>] sys_init_module+0xbd/0x220
       [<ffffffff81002f2b>] system_call_fastpath+0x16/0x1b
      Code: e5 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 85 ff 74 52 48 8b 47 18 49 c7 c5 00 46 61 81 48 85 c0 74 04 4c 8b 68 30 45 31 f6 <41> 80 7d 51 00 74 0e 49 8b 44 24 28 4c 89 e7 ff 50 20 49 89 c6
      RIP  [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
       RSP <ffff88027da25d08>
      CR2: 0000000000000051
      ---[ end trace a6541d3bf07945df ]---
      
      Fix this by adding a registered bit to the elevator queue, which is
      set when the sysfs kobjects have been registered.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      430c62fb
  17. 10 9月, 2010 1 次提交
    • T
      block: drop barrier ordering by queue draining · 28e7d184
      Tejun Heo 提交于
      Filesystems will take all the responsibilities for ordering requests
      around commit writes and will only indicate how the commit writes
      themselves should be handled by block layers.  This patch drops
      barrier ordering by queue draining from block layer.  Ordering by
      draining implementation was somewhat invasive to request handling.
      List of notable changes follow.
      
      * Each queue has 1 bit color which is flipped on each barrier issue.
        This is used to track whether a given request is issued before the
        current barrier or not.  REQ_ORDERED_COLOR flag and coloring
        implementation in __elv_add_request() are removed.
      
      * Requests which shouldn't be processed yet for draining were stalled
        by returning -EAGAIN from blk_do_ordered() according to the test
        result between blk_ordered_req_seq() and blk_blk_ordered_cur_seq().
        This logic is removed.
      
      * Draining completion logic in elv_completed_request() removed.
      
      * All barrier sequence requests were queued to request queue and then
        trckled to lower layer according to progress and thus maintaining
        request orders during requeue was necessary.  This is replaced by
        queueing the next request in the barrier sequence only after the
        current one is complete from blk_ordered_complete_seq(), which
        removes the need for multiple proxy requests in struct request_queue
        and the request sorting logic in the ELEVATOR_INSERT_REQUEUE path of
        elv_insert().
      
      * As barriers no longer have ordering constraints, there's no need to
        dump the whole elevator onto the dispatch queue on each barrier.
        Insert barriers at the front instead.
      
      * If other barrier requests come to the front of the dispatch queue
        while one is already in progress, they are stored in
        q->pending_barriers and restored to dispatch queue one-by-one after
        each barrier completion from blk_ordered_complete_seq().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      28e7d184
  18. 23 8月, 2010 1 次提交
  19. 12 8月, 2010 1 次提交
  20. 08 8月, 2010 2 次提交
  21. 04 6月, 2010 1 次提交
  22. 24 5月, 2010 1 次提交
  23. 11 5月, 2010 1 次提交
    • M
      block: allow initialization of previously allocated request_queue · 01effb0d
      Mike Snitzer 提交于
      blk_init_queue() allocates the request_queue structure and then
      initializes it as needed (request_fn, elevator, etc).
      
      Split initialization out to blk_init_allocated_queue_node.
      Introduce blk_init_allocated_queue wrapper function to model existing
      blk_init_queue and blk_init_queue_node interfaces.
      
      Export elv_register_queue to allow a newly added elevator to be
      registered with sysfs.  Export elv_unregister_queue for symmetry.
      
      These changes allow DM to initialize a device's request_queue with more
      precision.  In particular, DM no longer unconditionally initializes a
      full request_queue (elevator et al).  It only does so for a
      request-based DM device.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      01effb0d
  24. 09 4月, 2010 1 次提交
    • D
      blkio: Add io_merged stat · 812d4026
      Divyesh Shah 提交于
      This includes both the number of bios merged into requests belonging to this
      cgroup as well as the number of requests merged together.
      In the past, we've observed different merging behavior across upstream kernels,
      some by design some actual bugs. This stat helps a lot in debugging such
      problems when applications report decreased throughput with a new kernel
      version.
      
      This needed adding an extra elevator function to capture bios being merged as I
      did not want to pollute elevator code with blkiocg knowledge and hence needed
      the accounting invocation to come from CFQ.
      
      Signed-off-by: Divyesh Shah<dpshah@google.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      812d4026
  25. 02 4月, 2010 1 次提交
  26. 08 3月, 2010 1 次提交
  27. 29 1月, 2010 1 次提交
    • A
      block: Added in stricter no merge semantics for block I/O · 488991e2
      Alan D. Brunelle 提交于
      Updated 'nomerges' tunable to accept a value of '2' - indicating that _no_
      merges at all are to be attempted (not even the simple one-hit cache).
      
      The following table illustrates the additional benefit - 5 minute runs of
      a random I/O load were applied to a dozen devices on a 16-way x86_64 system.
      
      nomerges        Throughput      %System         Improvement (tput / %sys)
      --------        ------------    -----------     -------------------------
      0               12.45 MB/sec    0.669365609
      1               12.50 MB/sec    0.641519199     0.40% / 2.71%
      2               12.52 MB/sec    0.639849750     0.56% / 2.96%
      Signed-off-by: NAlan D. Brunelle <alan.brunelle@hp.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      488991e2
  28. 09 10月, 2009 1 次提交
    • K
      elv_iosched_store(): fix strstrip() misuse · 8c279598
      KOSAKI Motohiro 提交于
      elv_iosched_store() ignore the return value of strstrip().  It makes small
      inconsistent behavior.
      
      This patch fixes it.
      
       <before>
       ====================================
       # cd /sys/block/{blockdev}/queue
      
       case1:
       # echo "anticipatory" > scheduler
       # cat scheduler
       noop [anticipatory] deadline cfq
      
       case2:
       # echo "anticipatory " > scheduler
       # cat scheduler
       noop [anticipatory] deadline cfq
      
       case3:
       # echo " anticipatory" > scheduler
       bash: echo: write error: Invalid argument
      
       <after>
       ====================================
       # cd /sys/block/{blockdev}/queue
      
       case1:
       # echo "anticipatory" > scheduler
       # cat scheduler
       noop [anticipatory] deadline cfq
      
       case2:
       # echo "anticipatory " > scheduler
       # cat scheduler
       noop [anticipatory] deadline cfq
      
       case3:
       # echo " anticipatory" > scheduler
       noop [anticipatory] deadline cfq
      
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      8c279598
  29. 03 10月, 2009 1 次提交
  30. 11 9月, 2009 2 次提交
  31. 17 7月, 2009 1 次提交
    • T
      block: fix failfast merge testing in elv_rq_merge_ok() · 0a09f431
      Tejun Heo 提交于
      Commit ab0fd1de tries to prevent merge
      of requests with different failfast settings.  In elv_rq_merge_ok(),
      it compares new bio's failfast flags against the merge target
      request's.  However, the flag testing accessors for bio and blk don't
      return boolean but the tested bit value directly and FAILFAST on bio
      and blk don't match, so directly comparing them with == results in
      false negative unnecessary preventing merge of readahead requests.
      
      This patch convert the results to boolean by negating them before
      comparison.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jeff Garzik <jeff@garzik.org>
      0a09f431
  32. 04 7月, 2009 1 次提交
    • T
      block: don't merge requests of different failfast settings · ab0fd1de
      Tejun Heo 提交于
      Block layer used to merge requests and bios with different failfast
      settings.  This caused regular IOs to fail prematurely when they were
      merged into failfast requests for readahead.
      
      Niel Lambrechts could trigger the problem semi-reliably on ext4 when
      resuming from STR.  ext4 uses readahead when reading inodes and
      combined with the deterministic extra SATA PHY exception cycle during
      resume on the specific configuration, non-readahead inode read would
      fail causing ext4 errors.  Please read the following thread for
      details.
      
        http://lkml.org/lkml/2009/5/23/21
      
      This patch makes block layer reject merging if the failfast settings
      don't match.  This is correct but likely to lower IO performance by
      preventing regular IOs from mingling into surrounding readahead
      requests.  Changes to allow such mixed merges and handle errors
      correctly will be added later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NNiel Lambrechts <niel.lambrechts@gmail.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Signed-off-by: NJens Axboe <axboe@carl.(none)>
      ab0fd1de
  33. 10 6月, 2009 1 次提交
    • L
      tracing/events: convert block trace points to TRACE_EVENT() · 55782138
      Li Zefan 提交于
      TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
      these new capabilities to this tracepoint:
      
        - zero-copy and per-cpu splice() tracing
        - binary tracing without printf overhead
        - structured logging records exposed under /debug/tracing/events
        - trace events embedded in function tracer output and other plugins
        - user-defined, per tracepoint filter expressions
        ...
      
      Cons:
      
        - no dev_t info for the output of plug, unplug_timer and unplug_io events.
          no dev_t info for getrq and sleeprq events if bio == NULL.
          no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
      
          This is mainly because we can't get the deivce from a request queue.
          But this may change in the future.
      
        - A packet command is converted to a string in TP_assign, not TP_print.
          While blktrace do the convertion just before output.
      
          Since pc requests should be rather rare, this is not a big issue.
      
        - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
          has a unique format, which means we have some unused data in a trace entry.
      
          The overhead is minimized by using __dynamic_array() instead of __array().
      
      I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
      
            dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
      1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
      2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
      3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s
      
      So the overhead of tracing is very small, and no regression when using
      those trace events vs blktrace.
      
      And the binary output of TRACE_EVENT is much smaller than blktrace:
      
       # ls -l -h
       -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
       -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
       -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
      
      Following are some comparisons between TRACE_EVENT and blktrace:
      
      plug:
        kjournald-480   [000]   303.084981: block_plug: [kjournald]
        kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]
      
      unplug_io:
        kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
        kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1
      
      remap:
        kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
        kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384
      
      bio_backmerge:
        kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
        kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]
      
      getrq:
        kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
        kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]
      
        bash-2066  [001]  1072.953770:   8,0    G   N [bash]
        bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
      
      rq_complete:
        konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
        konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]
      
        ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
        ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
      
      rq_insert:
        kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
        kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]
      
      Changelog from v2 -> v3:
      
      - use the newly introduced __dynamic_array().
      
      Changelog from v1 -> v2:
      
      - use __string() instead of __array() to minimize the memory required
        to store hex dump of rq->cmd().
      
      - support large pc requests.
      
      - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
      
      - some cleanups.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      55782138