1. 24 3月, 2009 1 次提交
  2. 06 3月, 2009 1 次提交
  3. 26 2月, 2009 2 次提交
    • J
      block: reduce stack footprint of blk_recount_segments() · 1e428079
      Jens Axboe 提交于
      blk_recalc_rq_segments() requires a request structure passed in, which
      we don't have from blk_recount_segments(). So the latter allocates one on
      the stack, using > 400 bytes of stack for that. This can cause us to spill
      over one page of stack from ext4 at least:
      
       0)     4560     400   blk_recount_segments+0x43/0x62
       1)     4160      32   bio_phys_segments+0x1c/0x24
       2)     4128      32   blk_rq_bio_prep+0x2a/0xf9
       3)     4096      32   init_request_from_bio+0xf9/0xfe
       4)     4064     112   __make_request+0x33c/0x3f6
       5)     3952     144   generic_make_request+0x2d1/0x321
       6)     3808      64   submit_bio+0xb9/0xc3
       7)     3744      48   submit_bh+0xea/0x10e
       8)     3696     368   ext4_mb_init_cache+0x257/0xa6a [ext4]
       9)     3328     288   ext4_mb_regular_allocator+0x421/0xcd9 [ext4]
      10)     3040     160   ext4_mb_new_blocks+0x211/0x4b4 [ext4]
      11)     2880     336   ext4_ext_get_blocks+0xb61/0xd45 [ext4]
      12)     2544      96   ext4_get_blocks_wrap+0xf2/0x200 [ext4]
      13)     2448      80   ext4_da_get_block_write+0x6e/0x16b [ext4]
      14)     2368     352   mpage_da_map_blocks+0x7e/0x4b3 [ext4]
      15)     2016     352   ext4_da_writepages+0x2ce/0x43c [ext4]
      16)     1664      32   do_writepages+0x2d/0x3c
      17)     1632     144   __writeback_single_inode+0x162/0x2cd
      18)     1488      96   generic_sync_sb_inodes+0x1e3/0x32b
      19)     1392      16   sync_sb_inodes+0xe/0x10
      20)     1376      48   writeback_inodes+0x69/0xb3
      21)     1328     208   balance_dirty_pages_ratelimited_nr+0x187/0x2f9
      22)     1120     224   generic_file_buffered_write+0x1d4/0x2c4
      23)      896     176   __generic_file_aio_write_nolock+0x35f/0x393
      24)      720      80   generic_file_aio_write+0x6c/0xc8
      25)      640      80   ext4_file_write+0xa9/0x137 [ext4]
      26)      560     320   do_sync_write+0xf0/0x137
      27)      240      48   vfs_write+0xb3/0x13c
      28)      192      64   sys_write+0x4c/0x74
      29)      128     128   system_call_fastpath+0x16/0x1b
      
      Split the segment counting out into a __blk_recalc_rq_segments() helper
      to avoid allocating an onstack request just for checking the physical
      segment count.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      1e428079
    • M
      block: add documentation for register_blkdev() · 9e8c0bcc
      Márton Németh 提交于
      Add documentation for register_blkdev() function and for the parameters.
      Signed-off-by: NMárton Németh <nm127@freemail.hu>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      9e8c0bcc
  4. 18 2月, 2009 4 次提交
    • H
      block: fix deadlock in blk_abort_queue() for drivers that readd to timeout list · be987fdb
      Hannes Reinecke 提交于
      blk_abort_queue() iterates the timeout list and aborts each request on the
      list, but if the driver error handling readds a request to the timeout list
      during this processing, we could be looping forever. Fix this by splicing
      current entries to a local list and run over that list instead.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      be987fdb
    • N
      block: fix booting from partitioned md array · 41b8c853
      Neil Brown 提交于
      Hi Tejun,
      
       it looks like your commit:
      
         block: don't depend on consecutive minor space
         f331c029
      
       broke a particular case for booting from partitioned md/raid devices.
       That is the second time this has been broken recently.  The previous
       time was fixed by
      
         block: do_mounts - accept root=<non-existant partition>
         30f2f0eb
      
       Because the data isn't available when an md device is first created
       (we add disks and set it up after creation), the initial partition
       scan finds nothing.  It is not until the device is opened that
       another partition scan happens and finds something.
      
       So at the point where the kernel parameter "root=/dev/md_d0p1" is
       being parsed, md_d0 exists, but md_d0p1 does not.
       However if we let blk_lookup_devt return the correct device number
       even though the device doesn't exist, then the attempt to mount it
       will successfully find the partition.
      
       I have tried in the past to find a way to get the partition table to
       be read as soon as the array is assembled but that proved impossible
       (at the time).  I don't remember the details, and could possibly
       revisit it.  However it would be really nice if blk_lookup_devt
       could be adjusted to again accept non existant partitions.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      41b8c853
    • J
      block: fix bad definition of BIO_RW_SYNC · 93dbb393
      Jens Axboe 提交于
      We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
      and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
      213d9417.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      93dbb393
    • B
      bsg: Fix sense buffer bug in SG_IO · c1c20120
      Boaz Harrosh 提交于
      When submitting requests via SG_IO, which does a sync io, a
      bsg_command is not allocated. So an in-Kernel sense_buffer was not
      set. However when calling blk_execute_rq() with no sense buffer
      one is provided from the stack. Now bsg at blk_complete_sgv4_hdr_rq()
      would check if rq->sense_len and a sense was requested by sg_io_v4
      the rq->sense was copy_user() back, but by now it is already mangled
      stack memory.
      
      I have fixed that by forcing a sense_buffer when calling bsg_map_hdr().
      The bsg_command->sense is provided in the write/read path like before,
      and on-the-stack buffer is provided when doing SG_IO.
      
      I have also fixed a dprintk message to print rq->errors in hex because
      of the scsi bit-field use of this member. For other block devices it
      does not matter anyway.
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      Acked-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      c1c20120
  5. 02 2月, 2009 1 次提交
  6. 30 1月, 2009 8 次提交
  7. 07 1月, 2009 2 次提交
  8. 03 1月, 2009 2 次提交
  9. 29 12月, 2008 19 次提交
    • J
      cfq-iosched: fix race between exiting queue and exiting task · 62c1fe9d
      Jens Axboe 提交于
      Original patch from Nikanth Karthikesan <knikanth@suse.de>
      
      When a queue exits the queue lock is taken and cfq_exit_queue() would free all
      the cic's associated with the queue.
      
      But when a task exits, cfq_exit_io_context() gets cic one by one and then
      locks the associated queue to call __cfq_exit_single_io_context. It looks like
      between getting a cic from the ioc and locking the queue, the queue might have
      exited on another cpu.
      
      Fix this by rechecking the cfq_io_context queue key inside the queue lock
      again, and not calling into __cfq_exit_single_io_context() if somebody
      beat us to it.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      62c1fe9d
    • J
      Get rid of CONFIG_LSF · b3a6ffe1
      Jens Axboe 提交于
      We have two seperate config entries for large devices/files. One
      is CONFIG_LBD that guards just the devices, the other is CONFIG_LSF
      that handles large files. This doesn't make a lot of sense, you typically
      want both or none. So get rid of CONFIG_LSF and change CONFIG_LBD wording
      to indicate that it covers both.
      Acked-by: NJean Delvare <khali@linux-fr.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      b3a6ffe1
    • R
      block: make blk_softirq_init() static · 3c18ce71
      Roel Kluin 提交于
      Sparse asked whether these could be static.
      Signed-off-by: NRoel Kluin <roel.kluin@gmail.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      3c18ce71
    • F
      block: use min_not_zero in blk_queue_stack_limits · 18af8b2c
      FUJITA Tomonori 提交于
      zero is invalid for max_phys_segments, max_hw_segments, and
      max_segment_size. It's better to use use min_not_zero instead of
      min. min() works though (because the commit 0e435ac2 makes sure that
      these values are set to the default values, non zero, if a queue is
      initialized properly).
      
      With this patch, blk_queue_stack_limits does the almost same thing
      that dm's combine_restrictions_low() does. I think that it's easy to
      remove dm's combine_restrictions_low.
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      18af8b2c
    • J
      block: add one-hit cache for disk partition lookup · a6f23657
      Jens Axboe 提交于
      disk_map_sector_rcu() returns a partition from a sector offset,
      which we use for IO statistics on a per-partition basis. The
      lookup itself is an O(N) list lookup, where N is the number of
      partitions. This actually hurts performance quite a bit, even
      on the lower end partitions. On higher numbered partitions,
      it can get pretty bad.
      
      Solve this by adding a one-hit cache for partition lookup.
      This makes the lookup O(1) for the case where we do most IO to
      one partition. Even for mixed partition workloads, amortized cost
      is pretty close to O(1) since the natural IO batching makes the
      one-hit cache last for lots of IOs.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      a6f23657
    • J
      cfq-iosched: remove limit of dispatch depth of max 4 times quantum · 30e0dc28
      Jens Axboe 提交于
      This basically limits the hardware queue depth to 4*quantum at any
      point in time, which is 16 with the default settings. As CFQ uses
      other means to shrink the hardware queue when necessary in the first
      place, there's really no need for this extra heuristic. Additionally,
      it ends up hurting performance in some cases.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      30e0dc28
    • J
      block: get rid of elevator_t typedef · b374d18a
      Jens Axboe 提交于
      Just use struct elevator_queue everywhere instead.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      b374d18a
    • J
      block: don't use plugging on SSD devices · a31a9738
      Jens Axboe 提交于
      We just want to hand the first bits of IO to the device as fast
      as possible. Gains a few percent on the IOPS rate.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      a31a9738
    • T
      block: fix empty barrier on write-through w/ ordered tag · a185eb4b
      Tejun Heo 提交于
      Empty barrier on write-through (or no cache) w/ ordered tag has no
      command to execute and without any command to execute ordered tag is
      never issued to the device and the ordering is never achieved.  Force
      draining for such cases.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      a185eb4b
    • T
      block: simplify empty barrier implementation · 58eea927
      Tejun Heo 提交于
      Empty barrier required special handling in __elv_next_request() to
      complete it without letting the low level driver see it.
      
      With previous changes, barrier code is now flexible enough to skip the
      BAR step using the same barrier sequence selection mechanism.  Drop
      the special handling and mask off q->ordered from start_ordered().
      
      Remove blk_empty_barrier() test which now has no user.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      58eea927
    • T
      block: make barrier completion more robust · 8f11b3e9
      Tejun Heo 提交于
      Barrier completion had the following assumptions.
      
      * start_ordered() couldn't finish the whole sequence properly.  If all
        actions are to be skipped, q->ordseq is set correctly but the actual
        completion was never triggered thus hanging the barrier request.
      
      * Drain completion in elv_complete_request() assumed that there's
        always at least one request in the queue when drain completes.
      
      Both assumptions are true but these assumptions need to be removed to
      improve empty barrier implementation.  This patch makes the following
      changes.
      
      * Make start_ordered() use blk_ordered_complete_seq() to mark skipped
        steps complete and notify __elv_next_request() that it should fetch
        the next request if the whole barrier has completed inside
        start_ordered().
      
      * Make drain completion path in elv_complete_request() check whether
        the queue is empty.  Empty queue also indicates drain completion.
      
      * While at it, convert 0/1 return from blk_do_ordered() to false/true.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      8f11b3e9
    • T
      block: make every barrier action optional · f671620e
      Tejun Heo 提交于
      In all barrier sequences, the barrier write itself was always assumed
      to be issued and thus didn't have corresponding control flag.  This
      patch adds QUEUE_ORDERED_DO_BAR and unify action mask handling in
      start_ordered() such that any barrier action can be skipped.
      
      This patch doesn't introduce any visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      f671620e
    • T
      block: remove duplicate or unused barrier/discard error paths · a7384677
      Tejun Heo 提交于
      * Because barrier mode can be changed dynamically, whether barrier is
        supported or not can be determined only when actually issuing the
        barrier and there is no point in checking it earlier.  Drop barrier
        support check in generic_make_request() and __make_request(), and
        update comment around the support check in blk_do_ordered().
      
      * There is no reason to check discard support in both
        generic_make_request() and __make_request().  Drop the check in
        __make_request().  While at it, move error action block to the end
        of the function and add unlikely() to q existence test.
      
      * Barrier request, be it empty or not, is never passed to low level
        driver and thus it's meaningless to try to copy back req->sector to
        bio->bi_sector on error.  In addition, the notion of failed sector
        doesn't make any sense for empty barrier to begin with.  Drop the
        code block from __end_that_request_first().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      a7384677
    • T
      block: reorganize QUEUE_ORDERED_* constants · 313e4299
      Tejun Heo 提交于
      Separate out ordering type (drain,) and action masks (preflush,
      postflush, fua) from visible ordering mode selectors
      (QUEUE_ORDERED_*).  Ordering types are now named QUEUE_ORDERED_BY_*
      while action masks are named QUEUE_ORDERED_DO_*.
      
      This change is necessary to add QUEUE_ORDERED_DO_BAR and make it
      optional to improve empty barrier implementation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      313e4299
    • C
      block: use cancel_work_sync() instead of kblockd_flush_work() · 64d01dc9
      Cheng Renquan 提交于
      After many improvements on kblockd_flush_work, it is now identical to
      cancel_work_sync, so a direct call to cancel_work_sync is suggested.
      
      The only difference is that cancel_work_sync is a GPL symbol,
      so no non-GPL modules anymore.
      Signed-off-by: NCheng Renquan <crquan@gmail.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      64d01dc9
    • K
      block: Supress Buffer I/O errors when SCSI REQ_QUIET flag set · 08bafc03
      Keith Mannthey 提交于
      Allow the scsi request REQ_QUIET flag to be propagated to the buffer
      file system layer. The basic ideas is to pass the flag from the scsi
      request to the bio (block IO) and then to the buffer layer.  The buffer
      layer can then suppress needless printks.
      
      This patch declutters the kernel log by removed the 40-50 (per lun)
      buffer io error messages seen during a boot in my multipath setup . It
      is a good chance any real errors will be missed in the "noise" it the
      logs without this patch.
      
      During boot I see blocks of messages like
      "
      __ratelimit: 211 callbacks suppressed
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242847
      Buffer I/O error on device sdm, logical block 1
      Buffer I/O error on device sdm, logical block 5242878
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242872
      "
      in my logs.
      
      My disk environment is multipath fiber channel using the SCSI_DH_RDAC
      code and multipathd.  This topology includes an "active" and "ghost"
      path for each lun. IO's to the "ghost" path will never complete and the
      SCSI layer, via the scsi device handler rdac code, quick returns the IOs
      to theses paths and sets the REQ_QUIET scsi flag to suppress the scsi
      layer messages.
      
       I am wanting to extend the QUIET behavior to include the buffer file
      system layer to deal with these errors as well. I have been running this
      patch for a while now on several boxes without issue.  A few runs of
      bonnie++ show no noticeable difference in performance in my setup.
      
      Thanks for John Stultz for the quiet_error finalization.
      Submitted-by: NKeith Mannthey <kmannth@us.ibm.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      08bafc03
    • W
      block: don't take lock on changing ra_pages · 7c239517
      Wu Fengguang 提交于
      There's no need to take queue_lock or kernel_lock when modifying
      bdi->ra_pages. So remove them. Also remove out of date comment for
      queue_max_sectors_store().
      Signed-off-by: NWu Fengguang <wfg@linux.intel.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      7c239517
    • Q
      block/blk-tag.c: cleanup kernel-doc · c6a06f70
      Qinghuang Feng 提交于
      There is no argument named @tags in blk_init_tags,
      remove its' comment.
      Signed-off-by: NQinghuang Feng <qhfeng.kernel@gmail.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      c6a06f70
    • M
      scsi-ioctl: use clock_t <> jiffies · 2b91bafc
      Milton Miller 提交于
      Convert the timeout ioctl scalling to use the clock_t functions
      which are much more accurate with some USER_HZ vs HZ combinations.
      Signed-off-by: NMilton Miller <miltonm@bga.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      2b91bafc