1. 11 9月, 2009 1 次提交
    • T
      block: use the same failfast bits for bio and request · a82afdfc
      Tejun Heo 提交于
      bio and request use the same set of failfast bits.  This patch makes
      the following changes to simplify things.
      
      * enumify BIO_RW* bits and reorder bits such that BIOS_RW_FAILFAST_*
        bits coincide with __REQ_FAILFAST_* bits.
      
      * The above pushes BIO_RW_AHEAD out of sync with __REQ_FAILFAST_DEV
        but the matching is useless anyway.  init_request_from_bio() is
        responsible for setting FAILFAST bits on FS requests and non-FS
        requests never use BIO_RW_AHEAD.  Drop the code and comment from
        blk_rq_bio_prep().
      
      * Define REQ_FAILFAST_MASK which is OR of all FAILFAST bits and
        simplify FAILFAST flags handling in init_request_from_bio().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      a82afdfc
  2. 02 9月, 2009 1 次提交
    • N
      block: Allow changing max_sectors_kb above the default 512 · c295fc05
      Nikanth Karthikesan 提交于
      The patch "block: Use accessor functions for queue limits"
      (ae03bf63) changed queue_max_sectors_store()
      to use blk_queue_max_sectors() instead of directly assigning the value.
      
      But blk_queue_max_sectors() differs a bit
      1. It sets both max_sectors_kb, and max_hw_sectors_kb
      2. Never allows one to change max_sectors_kb above BLK_DEF_MAX_SECTORS. If one
      specifies a value greater then max_hw_sectors is set to that value but
      max_sectors is set to BLK_DEF_MAX_SECTORS
      
      I am not sure whether blk_queue_max_sectors() should be changed, as it seems
      to be that way for a long time. And there may be callers dependent on that
      behaviour.
      
      This patch simply reverts to the older way of directly assigning the value to
      max_sectors as it was before.
      Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      c295fc05
  3. 05 8月, 2009 1 次提交
  4. 01 8月, 2009 4 次提交
  5. 29 7月, 2009 1 次提交
    • J
      block: make the end_io functions be non-GPL exports · 56ad1740
      Jens Axboe 提交于
      Prior to the change for more sane end_io functions, we exported
      the helpers with the normal EXPORT_SYMBOL(). That got changed
      to _GPL() for the new interface. Revert that particular change,
      on the basis that this is basic functionality and doesn't dip
      into internal structures. If these exports can't be non-GPL,
      then we may as well make EXPORT_SYMBOL() imply GPL for
      everything.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      56ad1740
  6. 28 7月, 2009 2 次提交
  7. 17 7月, 2009 2 次提交
    • X
      block: sysfs fix mismatched queue_var_{store,show} in 64bit kernel · 9cb308ce
      Xiaotian Feng 提交于
      In blk-sysfs.c, queue_var_store uses unsigned long to store data,
      but queue_var_show uses unsigned int to show data.  This causes,
      
      	# echo 70000000000 > /sys/block/<dev>/queue/read_ahead_kb
      	# cat /sys/block/<dev>/queue/read_ahead_kb => get wrong value
      
      Fix it by using unsigned long.
      
      While at it, convert queue_rq_affinity_show() such that it uses bool
      variable instead of explicit != 0 testing.
      Signed-off-by: NXiaotian Feng <dfeng@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      9cb308ce
    • T
      block: fix failfast merge testing in elv_rq_merge_ok() · 0a09f431
      Tejun Heo 提交于
      Commit ab0fd1de tries to prevent merge
      of requests with different failfast settings.  In elv_rq_merge_ok(),
      it compares new bio's failfast flags against the merge target
      request's.  However, the flag testing accessors for bio and blk don't
      return boolean but the tested bit value directly and FAILFAST on bio
      and blk don't match, so directly comparing them with == results in
      false negative unnecessary preventing merge of readahead requests.
      
      This patch convert the results to boolean by negating them before
      comparison.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jeff Garzik <jeff@garzik.org>
      0a09f431
  8. 11 7月, 2009 2 次提交
  9. 04 7月, 2009 1 次提交
    • T
      block: don't merge requests of different failfast settings · ab0fd1de
      Tejun Heo 提交于
      Block layer used to merge requests and bios with different failfast
      settings.  This caused regular IOs to fail prematurely when they were
      merged into failfast requests for readahead.
      
      Niel Lambrechts could trigger the problem semi-reliably on ext4 when
      resuming from STR.  ext4 uses readahead when reading inodes and
      combined with the deterministic extra SATA PHY exception cycle during
      resume on the specific configuration, non-readahead inode read would
      fail causing ext4 errors.  Please read the following thread for
      details.
      
        http://lkml.org/lkml/2009/5/23/21
      
      This patch makes block layer reject merging if the failfast settings
      don't match.  This is correct but likely to lower IO performance by
      preventing regular IOs from mingling into surrounding readahead
      requests.  Changes to allow such mixed merges and handle errors
      correctly will be added later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NNiel Lambrechts <niel.lambrechts@gmail.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Signed-off-by: NJens Axboe <axboe@carl.(none)>
      ab0fd1de
  10. 01 7月, 2009 6 次提交
  11. 21 6月, 2009 1 次提交
  12. 19 6月, 2009 2 次提交
  13. 18 6月, 2009 1 次提交
  14. 16 6月, 2009 7 次提交
  15. 12 6月, 2009 1 次提交
  16. 11 6月, 2009 2 次提交
    • K
      block: add request clone interface (v2) · b0fd271d
      Kiyoshi Ueda 提交于
      This patch adds the following 2 interfaces for request-stacking drivers:
      
        - blk_rq_prep_clone(struct request *clone, struct request *orig,
      		      struct bio_set *bs, gfp_t gfp_mask,
      		      int (*bio_ctr)(struct bio *, struct bio*, void *),
      		      void *data)
            * Clones bios in the original request to the clone request
              (bio_ctr is called for each cloned bios.)
            * Copies attributes of the original request to the clone request.
              The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not
              copied.
      
        - blk_rq_unprep_clone(struct request *clone)
            * Frees cloned bios from the clone request.
      
      Request stacking drivers (e.g. request-based dm) need to make a clone
      request for a submitted request and dispatch it to other devices.
      
      To allocate request for the clone, request stacking drivers may not
      be able to use blk_get_request() because the allocation may be done
      in an irq-disabled context.
      So blk_rq_prep_clone() takes a request allocated by the caller
      as an argument.
      
      For each clone bio in the clone request, request stacking drivers
      should be able to set up their own completion handler.
      So blk_rq_prep_clone() takes a callback function which is called
      for each clone bio, and a pointer for private data which is passed
      to the callback.
      
      NOTE:
      blk_rq_prep_clone() doesn't copy any actual data of the original
      request.  Pages are shared between original bios and cloned bios.
      So caller must not complete the original request before the clone
      request.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      b0fd271d
    • N
      block: prevent possible io_context->refcount overflow · d9c7d394
      Nikanth Karthikesan 提交于
      Currently io_context has an atomic_t(32-bit) as refcount.  In the case of
      cfq, for each device against whcih a task does I/O, a reference to the
      io_context would be taken.  And when there are multiple process sharing
      io_contexts(CLONE_IO) would also have a reference to the same io_context.
      
      Theoretically the possible maximum number of processes sharing the same
      io_context + the number of disks/cfq_data referring to the same io_context
      can overflow the 32-bit counter on a very high-end machine.
      
      Even though it is an improbable case, let us make it atomic_long_t.
      Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d9c7d394
  17. 10 6月, 2009 1 次提交
    • L
      tracing/events: convert block trace points to TRACE_EVENT() · 55782138
      Li Zefan 提交于
      TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
      these new capabilities to this tracepoint:
      
        - zero-copy and per-cpu splice() tracing
        - binary tracing without printf overhead
        - structured logging records exposed under /debug/tracing/events
        - trace events embedded in function tracer output and other plugins
        - user-defined, per tracepoint filter expressions
        ...
      
      Cons:
      
        - no dev_t info for the output of plug, unplug_timer and unplug_io events.
          no dev_t info for getrq and sleeprq events if bio == NULL.
          no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
      
          This is mainly because we can't get the deivce from a request queue.
          But this may change in the future.
      
        - A packet command is converted to a string in TP_assign, not TP_print.
          While blktrace do the convertion just before output.
      
          Since pc requests should be rather rare, this is not a big issue.
      
        - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
          has a unique format, which means we have some unused data in a trace entry.
      
          The overhead is minimized by using __dynamic_array() instead of __array().
      
      I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
      
            dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
      1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
      2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
      3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s
      
      So the overhead of tracing is very small, and no regression when using
      those trace events vs blktrace.
      
      And the binary output of TRACE_EVENT is much smaller than blktrace:
      
       # ls -l -h
       -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
       -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
       -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
      
      Following are some comparisons between TRACE_EVENT and blktrace:
      
      plug:
        kjournald-480   [000]   303.084981: block_plug: [kjournald]
        kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]
      
      unplug_io:
        kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
        kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1
      
      remap:
        kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
        kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384
      
      bio_backmerge:
        kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
        kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]
      
      getrq:
        kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
        kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]
      
        bash-2066  [001]  1072.953770:   8,0    G   N [bash]
        bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
      
      rq_complete:
        konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
        konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]
      
        ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
        ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
      
      rq_insert:
        kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
        kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]
      
      Changelog from v2 -> v3:
      
      - use the newly introduced __dynamic_array().
      
      Changelog from v1 -> v2:
      
      - use __string() instead of __array() to minimize the memory required
        to store hex dump of rq->cmd().
      
      - support large pc requests.
      
      - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
      
      - some cleanups.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      55782138
  18. 09 6月, 2009 4 次提交