1. 10 3月, 2011 3 次提交
    • J
      block: remove per-queue plugging · 7eaceacc
      Jens Axboe 提交于
      Code has been converted over to the new explicit on-stack plugging,
      and delay users have been converted to use the new API for that.
      So lets kill off the old plugging along with aops->sync_page().
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7eaceacc
    • J
      block: initial patch for on-stack per-task plugging · 73c10101
      Jens Axboe 提交于
      This patch adds support for creating a queuing context outside
      of the queue itself. This enables us to batch up pieces of IO
      before grabbing the block device queue lock and submitting them to
      the IO scheduler.
      
      The context is created on the stack of the process and assigned in
      the task structure, so that we can auto-unplug it if we hit a schedule
      event.
      
      The current queue plugging happens implicitly if IO is submitted to
      an empty device, yet callers have to remember to unplug that IO when
      they are going to wait for it. This is an ugly API and has caused bugs
      in the past. Additionally, it requires hacks in the vm (->sync_page()
      callback) to handle that logic. By switching to an explicit plugging
      scheme we make the API a lot nicer and can get rid of the ->sync_page()
      hack in the vm.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      73c10101
    • J
      block: add API for delaying work/request_fn a little bit · 3cca6dc1
      Jens Axboe 提交于
      Currently we use plugging for that, but as plugging is going away,
      we need an alternative mechanism.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      3cca6dc1
  2. 03 3月, 2011 1 次提交
  3. 02 3月, 2011 1 次提交
  4. 11 2月, 2011 2 次提交
  5. 09 2月, 2011 1 次提交
    • J
      cfq-iosched: Don't wait if queue already has requests. · 02a8f01b
      Justin TerAvest 提交于
      Commit 7667aa06 added logic to wait for
      the last queue of the group to become busy (have at least one request),
      so that the group does not lose out for not being continuously
      backlogged. The commit did not check for the condition that the last
      queue already has some requests. As a result, if the queue already has
      requests, wait_busy is set. Later on, cfq_select_queue() checks the
      flag, and decides that since the queue has a request now and wait_busy
      is set, the queue is expired.  This results in early expiration of the
      queue.
      
      This patch fixes the problem by adding a check to see if queue already
      has requests. If it does, wait_busy is not set. As a result, time slices
      do not expire early.
      
      The queues with more than one request are usually buffered writers.
      Testing shows improvement in isolation between buffered writers.
      
      Cc: stable@kernel.org
      Signed-off-by: NJustin TerAvest <teravest@google.com>
      Reviewed-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      02a8f01b
  6. 25 1月, 2011 3 次提交
    • T
      block: reimplement FLUSH/FUA to support merge · ae1b1539
      Tejun Heo 提交于
      The current FLUSH/FUA support has evolved from the implementation
      which had to perform queue draining.  As such, sequencing is done
      queue-wide one flush request after another.  However, with the
      draining requirement gone, there's no reason to keep the queue-wide
      sequential approach.
      
      This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
      request is sequenced individually.  The actual FLUSH execution is
      double buffered and whenever a request wants to execute one for either
      PRE or POSTFLUSH, it queues on the pending queue.  Once certain
      conditions are met, a flush request is issued and on its completion
      all pending requests proceed to the next sequence.
      
      This allows arbitrary merging of different type of flushes.  How they
      are merged can be primarily controlled and tuned by adjusting the
      above said 'conditions' used to determine when to issue the next
      flush.
      
      This is inspired by Darrick's patches to merge multiple zero-data
      flushes which helps workloads with highly concurrent fsync requests.
      
      * As flush requests are never put on the IO scheduler, request fields
        used for flush share space with rq->rb_node.  rq->completion_data is
        moved out of the union.  This increases the request size by one
        pointer.
      
        As rq->elevator_private* are used only by the iosched too, it is
        possible to reduce the request size further.  However, to do that,
        we need to modify request allocation path such that iosched data is
        not allocated for flush requests.
      
      * FLUSH/FUA processing happens on insertion now instead of dispatch.
      
      - Comments updated as per Vivek and Mike.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "Darrick J. Wong" <djwong@us.ibm.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      ae1b1539
    • T
      block: improve flush bio completion · 143a87f4
      Tejun Heo 提交于
      bio's for flush are completed twice - once during the data phase and
      one more time after the whole sequence is complete.  The first
      completion shouldn't notify completion to the issuer.
      
      This was achieved by skipping all bio completion steps in
      req_bio_endio() for the first completion; however, this has two
      drawbacks.
      
      * Error is not recorded in bio and must be tracked somewhere else.
      
      * Partial completion is not supported.
      
      Both don't cause problems for the current users; however, they make
      further improvements difficult.  Change req_bio_endio() such that it
      only skips the actual notification part for the first completion.  bio
      completion is implemented with partial completions on mind anyway so
      this is as simple as moving the REQ_FLUSH_SEQ conditional such that
      only calling of bio_endio() is skipped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      143a87f4
    • T
      block: add REQ_FLUSH_SEQ · 414b4ff5
      Tejun Heo 提交于
      rq == &q->flush_rq was used to determine whether a rq is part of a
      flush sequence, which worked because all requests in a flush sequence
      were sequenced using the single dedicated request.  This is about to
      change, so introduce REQ_FLUSH_SEQ flag to distinguish flush sequence
      requests.
      
      This patch doesn't cause any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      414b4ff5
  7. 21 1月, 2011 1 次提交
    • D
      kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT · 6a108a14
      David Rientjes 提交于
      The meaning of CONFIG_EMBEDDED has long since been obsoleted; the option
      is used to configure any non-standard kernel with a much larger scope than
      only small devices.
      
      This patch renames the option to CONFIG_EXPERT in init/Kconfig and fixes
      references to the option throughout the kernel.  A new CONFIG_EMBEDDED
      option is added that automatically selects CONFIG_EXPERT when enabled and
      can be used in the future to isolate options that should only be
      considered for embedded systems (RISC architectures, SLOB, etc).
      
      Calling the option "EXPERT" more accurately represents its intention: only
      expert users who understand the impact of the configuration changes they
      are making should enable it.
      Reviewed-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NDavid Woodhouse <david.woodhouse@intel.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Greg KH <gregkh@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Robin Holt <holt@sgi.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a108a14
  8. 19 1月, 2011 2 次提交
  9. 14 1月, 2011 2 次提交
  10. 07 1月, 2011 3 次提交
  11. 05 1月, 2011 1 次提交
    • J
      block: fix accounting bug on cross partition merges · 09e099d4
      Jerome Marchand 提交于
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      Also add a refcount to struct hd_struct to keep the partition in
      memory as long as users exist. We use kref_test_and_get() to ensure
      we don't add a reference to a partition which is going away.
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      09e099d4
  12. 03 1月, 2011 1 次提交
  13. 21 12月, 2010 1 次提交
  14. 17 12月, 2010 8 次提交
    • Y
      fs/block: type signature of major_to_index(int) to major_to_index(unsigned) · e61eb2e9
      Yang Zhang 提交于
      The major/minor device numbers are always defined and used as `unsigned'.
      Signed-off-by: NYang Zhang <kthreadd@gmail.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      e61eb2e9
    • Y
      block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p) · b9f985b6
      Yang Zhang 提交于
      Signed-off-by: NYang Zhang <kthreadd@gmail.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      b9f985b6
    • G
      cfq-iosched: don't check cfqg in choose_service_tree() · 7278c9c1
      Gui Jianfeng 提交于
      When cfq_choose_cfqg() is called in select_queue(), there must be at least one
      backlogged CFQ queue waiting for dispatching, hence there must be at least one
      backlogged CFQ group on service tree. So we never call choose_service_tree()
      with cfqg == NULL.
      Signed-off-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7278c9c1
    • M
      block: max hardware sectors limit wrapper · 72d4cd9f
      Mike Snitzer 提交于
      Implement blk_limits_max_hw_sectors() and make
      blk_queue_max_hw_sectors() a wrapper around it.
      
      DM needs this to avoid setting queue_limits' max_hw_sectors and
      max_sectors directly.  dm_set_device_limits() now leverages
      blk_limits_max_hw_sectors() logic to establish the appropriate
      max_hw_sectors minimum (PAGE_SIZE).  Fixes issue where DM was
      incorrectly setting max_sectors rather than max_hw_sectors (which
      caused dm_merge_bvec()'s max_hw_sectors check to be ineffective).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@kernel.org
      Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      72d4cd9f
    • M
      block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead · e692cb66
      Martin K. Petersen 提交于
      When stacking devices, a request_queue is not always available. This
      forced us to have a no_cluster flag in the queue_limits that could be
      used as a carrier until the request_queue had been set up for a
      metadevice.
      
      There were several problems with that approach. First of all it was up
      to the stacking device to remember to set queue flag after stacking had
      completed. Also, the queue flag and the queue limits had to be kept in
      sync at all times. We got that wrong, which could lead to us issuing
      commands that went beyond the max scatterlist limit set by the driver.
      
      The proper fix is to avoid having two flags for tracking the same thing.
      We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
      block layer merging functions. The queue_limit 'no_cluster' is turned
      into 'cluster' to avoid double negatives and to ease stacking.
      Clustering defaults to being enabled as before. The queue flag logic is
      removed from the stacking function, and explicitly setting the cluster
      flag is no longer necessary in DM and MD.
      Reported-by: NEd Lin <ed.lin@promise.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      e692cb66
    • T
      implement in-kernel gendisk events handling · 77ea887e
      Tejun Heo 提交于
      Currently, media presence polling for removeable block devices is done
      from userland.  There are several issues with this.
      
      * Polling is done by periodically opening the device.  For SCSI
        devices, the command sequence generated by such action involves a
        few different commands including TEST_UNIT_READY.  This behavior,
        while perfectly legal, is different from Windows which only issues
        single command, GET_EVENT_STATUS_NOTIFICATION.  Unfortunately, some
        ATAPI devices lock up after being periodically queried such command
        sequences.
      
      * There is no reliable and unintrusive way for a userland program to
        tell whether the target device is safe for media presence polling.
        For example, polling for media presence during an on-going burning
        session can make it fail.  The polling program can avoid this by
        opening the device with O_EXCL but then it risks making a valid
        exclusive user of the device fail w/ -EBUSY.
      
      * Userland polling is unnecessarily heavy and in-kernel implementation
        is lighter and better coordinated (workqueue, timer slack).
      
      This patch implements framework for in-kernel disk event handling,
      which includes media presence polling.
      
      * bdops->check_events() is added, which supercedes ->media_changed().
        It should check whether there's any pending event and return if so.
        Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
        DISK_EVENT_EJECT_REQUEST.  ->check_events() is guaranteed not to be
        called parallelly.
      
      * gendisk->events and ->async_events are added.  These should be
        initialized by block driver before passing the device to add_disk().
        The former contains the mask of all supported events and the latter
        the mask of all events which the device can report without polling.
        /sys/block/*/events[_async] export these to userland.
      
      * Kernel parameter block.events_dfl_poll_msecs controls the system
        polling interval (default is 0 which means disable) and
        /sys/block/*/events_poll_msecs control polling intervals for
        individual devices (default is -1 meaning use system setting).  Note
        that if a device can report all supported events asynchronously and
        its polling interval isn't explicitly set, the device won't be
        polled regardless of the system polling interval.
      
      * If a device is opened exclusively with write access, event checking
        is automatically disabled until all write exclusive accesses are
        released.
      
      * There are event 'clearing' events.  For example, both of currently
        defined events are cleared after the device has been successfully
        opened.  This information is passed to ->check_events() callback
        using @clearing argument as a hint.
      
      * Event checking is always performed from system_nrt_wq and timer
        slack is set to 25% for polling.
      
      * Nothing changes for drivers which implement ->media_changed() but
        not ->check_events().  Going forward, all drivers will be converted
        to ->check_events() and ->media_change() will be dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      77ea887e
    • T
      block: move register_disk() and del_gendisk() to block/genhd.c · d2bf1b67
      Tejun Heo 提交于
      There's no reason for register_disk() and del_gendisk() to be in
      fs/partitions/check.c.  Move both to genhd.c.  While at it, collapse
      unlink_gendisk(), which was artificially in a separate function due to
      genhd.c / check.c split, into del_gendisk().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      d2bf1b67
    • T
      block: kill genhd_media_change_notify() · dddd9dc3
      Tejun Heo 提交于
      There's no user of the facility.  Kill it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      dddd9dc3
  15. 13 12月, 2010 1 次提交
  16. 09 12月, 2010 1 次提交
  17. 02 12月, 2010 2 次提交
    • V
      blk-throttle: Correct the placement of smp_rmb() · 04a6b516
      Vivek Goyal 提交于
      o I was discussing what are the variable being updated without spin lock and
        why do we need barriers and Oleg pointed out that location of smp_rmb()
        should be between read of td->limits_changed and tg->limits_changed. This
        patch fixes it.
      
      o Following is one possible sequence of events. Say cpu0 is executing
        throtl_update_blkio_group_read_bps() and cpu1 is executing
        throtl_process_limit_change().
      
       cpu0                                                cpu1
      
       tg->limits_changed = true;
       smp_mb__before_atomic_inc();
       atomic_inc(&td->limits_changed);
      
                                           if (!atomic_read(&td->limits_changed))
                                                   return;
      
                                           if (tg->limits_changed)
                                                   do_something;
      
       If cpu0 has updated tg->limits_changed and td->limits_changed, we want to
       make sure that if update to td->limits_changed is visible on cpu1, then
       update to tg->limits_changed should also be visible.
      
       Oleg pointed out to ensure that we need to insert an smp_rmb() between
       td->limits_changed read and tg->limits_changed read.
      
      o I had erroneously put smp_rmb() before atomic_read(&td->limits_changed).
        This patch fixes it.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      04a6b516
    • V
      blk-throttle: Trim/adjust slice_end once a bio has been dispatched · d1ae8ffd
      Vivek Goyal 提交于
      o During some testing I did following and noticed throttling stops working.
      
              - Put a very low limit on a cgroup, say 1 byte per second.
              - Start some reads, this will set slice_end to a very high value.
              - Change the limit to higher value say 1MB/s
              - Now IO unthrottles and finishes as expected.
              - Try to do the read again but IO is not limited to 1MB/s as expected.
      
      o What is happening.
              - Initially low value of limit sets slice_end to a very high value.
              - During updation of limit, slice_end is not being truncated.
              - Very high value of slice_end leads to keeping the existing slice
                valid for a very long time and new slice does not start.
              - tg_may_dispatch() is called in blk_throtle_bio(), and trim_slice()
                is not called in this path. So slice_start is some old value and
                practically we are able to do huge amount of IO.
      
      o There are many ways it can be fixed. I have fixed it by trying to
        adjust/cleanup slice_end in trim_slice(). Generally we extend slices if bio
        is big and can't be dispatched in one slice. After dispatch of bio, readjust
        the slice_end to make sure we don't end up with huge values.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      d1ae8ffd
  18. 01 12月, 2010 2 次提交
  19. 29 11月, 2010 1 次提交
  20. 18 11月, 2010 1 次提交
  21. 16 11月, 2010 2 次提交
    • M
    • V
      blk-cgroup: Allow creation of hierarchical cgroups · bdc85df7
      Vivek Goyal 提交于
      o Allow hierarchical cgroup creation for blkio controller
      
      o Currently we disallow it as both the io controller policies (throttling
        as well as proportion bandwidth) do not support hierarhical accounting
        and control. But the flip side is that blkio controller can not be used with
        libvirt as libvirt creates a cgroup hierarchy deeper than 1 level.
      
        <top-level-cgroup-dir>/<controller>/libvirt/qemu/<virtual-machine-groups>
      
      o So this patch will allow creation of cgroup hierarhcy but at the backend
        everything will be treated as flat. So if somebody created a an hierarchy
        like as follows.
      
      			root
      			/  \
      		     test1 test2
      			|
      		     test3
      
        CFQ and throttling will practically treat all groups at same level.
      
      				pivot
      			     /  |   \  \
      			root  test1 test2  test3
      
      o Once we have actual support for hierarchical accounting and control
        then we can introduce another cgroup tunable file "blkio.use_hierarchy"
        which will be 0 by default but if user wants to enforce hierarhical
        control then it can be set to 1. This way there should not be any
        ABI problems down the line.
      
      o The only not so pretty part is introduction of extra file "use_hierarchy"
        down the line. Kame-san had mentioned that hierarhical accounting is
        expensive in memory controller hence they keep it off by default. I
        suspect same will be the case for IO controller also as for each IO
        completion we shall have to account IO through hierarchy up to the root.
        if yes, then it probably is not a very bad idea to introduce this extra
        file so that it will be used only when somebody needs it and some people
        might enable hierarchy only in part of the hierarchy.
      
      o This is how basically memory controller also uses "use_hierarhcy" and
        they also allowed creation of hierarchies when actual backend support
        was not available.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Reviewed-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Reviewed-by: NCiju Rajan K <ciju@linux.vnet.ibm.com>
      Tested-by: NCiju Rajan K <ciju@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      bdc85df7