1. 09 12月, 2016 1 次提交
    • C
      block: improve handling of the magic discard payload · f9d03f96
      Christoph Hellwig 提交于
      Instead of allocating a single unused biovec for discard requests, send
      them down without any payload.  Instead we allow the driver to add a
      "special" payload using a biovec embedded into struct request (unioned
      over other fields never used while in the driver), and overloading
      the number of segments for this case.
      
      This has a couple of advantages:
      
       - we don't have to allocate the bio_vec
       - the amount of special casing for discard requests in the block
         layer is significantly reduced
       - using this same scheme for other request types is trivial,
         which will be important for implementing the new WRITE_ZEROES
         op on devices where it actually requires a payload (e.g. SCSI)
       - we can get rid of playing games with the request length, as
         we'll never touch it and completions will work just fine
       - it will allow us to support ranged discard operations in the
         future by merging non-contiguous discard bios into a single
         request
       - last but not least it removes a lot of code
      
      This patch is the common base for my WIP series for ranges discards and to
      remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
      so it would be good to get it in quickly.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f9d03f96
  2. 01 12月, 2016 2 次提交
  3. 28 10月, 2016 1 次提交
    • C
      block: better op and flags encoding · ef295ecf
      Christoph Hellwig 提交于
      Now that we don't need the common flags to overflow outside the range
      of a 32-bit type we can encode them the same way for both the bio and
      request fields.  This in addition allows us to place the operation
      first (and make some room for more ops while we're at it) and to
      stop having to shift around the operation values.
      
      In addition this allows passing around only one value in the block layer
      instead of two (and eventuall also in the file systems, but we can do
      that later) and thus clean up a lot of code.
      
      Last but not least this allows decreasing the size of the cmd_flags
      field in struct request to 32-bits.  Various functions passing this
      value could also be updated, but I'd like to avoid the churn for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ef295ecf
  4. 12 10月, 2016 1 次提交
  5. 21 7月, 2016 2 次提交
  6. 09 6月, 2016 1 次提交
  7. 08 6月, 2016 4 次提交
  8. 06 5月, 2016 1 次提交
    • M
      block: reinstate early return of -EOPNOTSUPP from blkdev_issue_discard · bbd848e0
      Mike Snitzer 提交于
      Commit 38f25255 ("block: add __blkdev_issue_discard") incorrectly
      disallowed the early return of -EOPNOTSUPP if the device doesn't support
      discard (or secure discard).  This early return of -EOPNOTSUPP has
      always been part of blkdev_issue_discard() interface so there isn't a
      good reason to break that behaviour -- especially when it can be easily
      reinstated.
      
      The nuance of allowing early return of -EOPNOTSUPP vs disallowing late
      return of -EOPNOTSUPP is: if the overall device never advertised support
      for discards and one is issued to the device it is beneficial to inform
      the caller that discards are not supported via -EOPNOTSUPP.  But if a
      device advertises discard support it means that at least a subset of the
      device does have discard support -- but it could be that discards issued
      to some regions of a stacked device will not be supported.  In that case
      the late return of -EOPNOTSUPP must be disallowed.
      
      Fixes: 38f25255 ("block: add __blkdev_issue_discard")
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bbd848e0
  9. 02 5月, 2016 2 次提交
  10. 28 10月, 2015 1 次提交
  11. 14 8月, 2015 1 次提交
  12. 29 7月, 2015 1 次提交
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  13. 06 2月, 2015 1 次提交
  14. 22 1月, 2015 1 次提交
    • M
      block: Add discard flag to blkdev_issue_zeroout() function · d93ba7a5
      Martin K. Petersen 提交于
      blkdev_issue_discard() will zero a given block range. This is done by
      way of explicit writing, thus provisioning or allocating the blocks on
      disk.
      
      There are use cases where the desired behavior is to zero the blocks but
      unprovision them if possible. The blocks must deterministically contain
      zeroes when they are subsequently read back.
      
      This patch adds a flag to blkdev_issue_zeroout() that provides this
      variant. If the discard flag is set and a block device guarantees
      discard_zeroes_data we will use REQ_DISCARD to clear the block range. If
      the device does not support discard_zeroes_data or if the discard
      request fails we will fall back to first REQ_WRITE_SAME and then a
      regular REQ_WRITE.
      
      Also update the callers of blkdev_issue_zero() to reflect the new flag
      and make sb_issue_zeroout() prefer the discard approach.
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d93ba7a5
  15. 27 5月, 2014 1 次提交
  16. 13 2月, 2014 1 次提交
  17. 24 11月, 2013 1 次提交
    • K
      block: Abstract out bvec iterator · 4f024f37
      Kent Overstreet 提交于
      Immutable biovecs are going to require an explicit iterator. To
      implement immutable bvecs, a later patch is going to add a bi_bvec_done
      member to this struct; for now, this patch effectively just renames
      things.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: xfs@oss.sgi.com
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: "Roger Pau Monné" <roger.pau@citrix.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <Ian.Campbell@citrix.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchand@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Peng Tao <tao.peng@emc.com>
      Cc: Andy Adamson <andros@netapp.com>
      Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Namjae Jeon <namjae.jeon@samsung.com>
      Cc: Pankaj Kumar <pankaj.km@samsung.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>6
      4f024f37
  18. 09 11月, 2013 1 次提交
    • G
      block: Do not call sector_div() with a 64-bit divisor · 97597dc0
      Geert Uytterhoeven 提交于
      do_div() (called by sector_div() if CONFIG_LBDAF=y) is meant for divisions
      of 64-bit number by 32-bit numbers.  Passing 64-bit divisor types caused
      issues in the past on 32-bit platforms, cfr. commit
      ea077b1b ("m68k: Truncate base in
      do_div()").
      
      As queue_limits.max_discard_sectors and .discard_granularity are unsigned
      int, max_discard_sectors and granularity should be unsigned int.
      As bdev_discard_alignment() returns int, alignment should be int.
      Now 2 calls to sector_div() can be replaced by 32-bit arithmetic:
        - The 64-bit modulo operation can become a 32-bit modulo operation,
        - The 64-bit division and multiplication can be replaced by a 32-bit
          modulo operation and a subtraction.
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      97597dc0
  19. 15 2月, 2013 1 次提交
    • V
      block: account iowait time when waiting for completion of IO request · 5577022f
      Vladimir Davydov 提交于
      Using wait_for_completion() for waiting for a IO request to be executed
      results in wrong iowait time accounting. For example, a system having
      the only task doing write() and fdatasync() on a block device can be
      reported being idle instead of iowaiting as it should because
      blkdev_issue_flush() calls wait_for_completion() which in turn calls
      schedule() that does not increment the iowait proc counter and thus does
      not turn on iowait time accounting.
      
      The patch makes block layer use wait_for_completion_io() instead of
      wait_for_completion() where appropriate to account iowait time
      correctly.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5577022f
  20. 15 12月, 2012 2 次提交
  21. 20 9月, 2012 2 次提交
  22. 02 8月, 2012 2 次提交
    • P
      block: split discard into aligned requests · c6e66634
      Paolo Bonzini 提交于
      When a disk has large discard_granularity and small max_discard_sectors,
      discards are not split with optimal alignment.  In the limit case of
      discard_granularity == max_discard_sectors, no request could be aligned
      correctly, so in fact you might end up with no discarded logical blocks
      at all.
      
      Another example that helps showing the condition in the patch is with
      discard_granularity == 64, max_discard_sectors == 128.  A request that is
      submitted for 256 sectors 2..257 will be split in two: 2..129, 130..257.
      However, only 2 aligned blocks out of 3 are included in the request;
      128..191 may be left intact and not discarded.  With this patch, the
      first request will be truncated to ensure good alignment of what's left,
      and the split will be 2..127, 128..255, 256..257.  The patch will also
      take into account the discard_alignment.
      
      At most one extra request will be introduced, because the first request
      will be reduced by at most granularity-1 sectors, and granularity
      must be less than max_discard_sectors.  Subsequent requests will run
      on round_down(max_discard_sectors, granularity) sectors, as in the
      current code.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Tested-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c6e66634
    • P
      block: reorganize rounding of max_discard_sectors · f6ff53d3
      Paolo Bonzini 提交于
      Mostly a preparation for the next patch.
      
      In principle this fixes an infinite loop if max_discard_sectors < granularity,
      but that really shouldn't happen.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Tested-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f6ff53d3
  23. 24 7月, 2011 1 次提交
  24. 07 7月, 2011 1 次提交
    • M
      block: eliminate potential for infinite loop in blkdev_issue_discard · 0f799603
      Mike Snitzer 提交于
      Due to the recently identified overflow in read_capacity_16() it was
      possible for max_discard_sectors to be zero but still have discards
      enabled on the associated device's queue.
      
      Eliminate the possibility for blkdev_issue_discard to infinitely loop.
      
      Interestingly this issue wasn't identified until a device, whose
      discard_granularity was 0 due to read_capacity_16 overflow, was consumed
      by blk_stack_limits() to construct limits for a higher-level DM
      multipath device.  The multipath device's resulting limits never had the
      discard limits stacked because blk_stack_limits() will only do so if
      the bottom device's discard_granularity != 0.  This resulted in the
      multipath device's limits.max_discard_sectors being 0.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      0f799603
  25. 07 5月, 2011 3 次提交
    • L
      blkdev: Do not return -EOPNOTSUPP if discard is supported · 8af1954d
      Lukas Czerner 提交于
      Currently we return -EOPNOTSUPP in blkdev_issue_discard() if any of the
      bio fails due to underlying device not supporting discard request.
      However, if the device is for example dm device composed of devices
      which some of them support discard and some of them does not, it is ok
      for some bios to fail with EOPNOTSUPP, but it does not mean that discard
      is not supported at all.
      
      This commit removes the check for bios failed with EOPNOTSUPP and change
      blkdev_issue_discard() to return operation not supported if and only if
      the device does not actually supports it, not just part of the device as
      some bios might indicate.
      
      This change also fixes problem with BLKDISCARD ioctl() which now works
      correctly on such dm devices.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      CC: Jens Axboe <jaxboe@fusionio.com>
      CC: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      8af1954d
    • L
      blkdev: Simple cleanup in blkdev_issue_zeroout() · 5baebe5c
      Lukas Czerner 提交于
      In blkdev_issue_zeroout() we are submitting regular WRITE bios, so we do
      not need to check for -EOPNOTSUPP specifically in case of error. Also
      there is no need to have label submit: because there is no way to jump
      out from the while cycle without an error and we really want to exit,
      rather than try again. And also remove the check for (sz == 0) since at
      that point sz can never be zero.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      CC: Dmitry Monakhov <dmonakhov@openvz.org>
      CC: Jens Axboe <jaxboe@fusionio.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      5baebe5c
    • L
      blkdev: Submit discard bio in batches in blkdev_issue_discard() · 5dba3089
      Lukas Czerner 提交于
      Currently we are waiting for every submitted REQ_DISCARD bio separately,
      but it can have unwanted consequences of repeatedly flushing the queue,
      so we rather submit bios in batches and wait for the entire batch, hence
      narrowing the window of other ios going in.
      
      Use bio_batch_end_io() and struct bio_batch for that purpose, the same
      is used by blkdev_issue_zeroout(). Also change bio_batch_end_io() so we
      always set !BIO_UPTODATE in the case of error and remove the check for
      bb, since we are the only user of this function and we always set this.
      
      Remove bio_get()/bio_put() from the blkdev_issue_discard() since
      bio_alloc() and bio_batch_end_io() is doing the same thing, hence it is
      not needed anymore.
      
      I have done simple dd testing with surprising results. The script I have
      used is:
      
      for i in $(seq 10); do
              echo $i
              dd if=/dev/sdb1 of=/dev/sdc1 bs=4k &
              sleep 5
      done
      /usr/bin/time -f %e ./blkdiscard /dev/sdc1
      
      Running time of BLKDISCARD on the whole device:
      with patch              without patch
      0.95                    15.58
      
      So we can see that in this artificial test the kernel with the patch
      applied is approx 16x faster in discarding the device.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      CC: Dmitry Monakhov <dmonakhov@openvz.org>
      CC: Jens Axboe <jaxboe@fusionio.com>
      CC: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      5dba3089
  26. 12 3月, 2011 1 次提交
  27. 11 3月, 2011 1 次提交
    • L
      block: fix mis-synchronisation in blkdev_issue_zeroout() · 0aeea189
      Lukas Czerner 提交于
      BZ29402
      https://bugzilla.kernel.org/show_bug.cgi?id=29402
      
      We can hit serious mis-synchronization in bio completion path of
      blkdev_issue_zeroout() leading to a panic.
      
      The problem is that when we are going to wait_for_completion() in
      blkdev_issue_zeroout() we check if the bb.done equals issued (number of
      submitted bios). If it does, we can skip the wait_for_completition()
      and just out of the function since there is nothing to wait for.
      However, there is a ordering problem because bio_batch_end_io() is
      calling atomic_inc(&bb->done) before complete(), hence it might seem to
      blkdev_issue_zeroout() that all bios has been completed and exit. At
      this point when bio_batch_end_io() is going to call complete(bb->wait),
      bb and wait does not longer exist since it was allocated on stack in
      blkdev_issue_zeroout() ==> panic!
      
      (thread 1)                      (thread 2)
      bio_batch_end_io()              blkdev_issue_zeroout()
        if(bb) {                      ...
          if (bb->end_io)             ...
            bb->end_io(bio, err);     ...
          atomic_inc(&bb->done);      ...
          ...                         while (issued != atomic_read(&bb.done))
          ...                         (let issued == bb.done)
          ...                         (do the rest of the function)
          ...                         return ret;
          complete(bb->wait);
          ^^^^^^^^
          panic
      
      We can fix this easily by simplifying bio_batch and completion counting.
      
      Also remove bio_end_io_t *end_io since it is not used.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reported-by: NEric Whitney <eric.whitney@hp.com>
      Tested-by: NEric Whitney <eric.whitney@hp.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      CC: Dmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      0aeea189
  28. 02 3月, 2011 1 次提交
  29. 17 9月, 2010 1 次提交
    • C
      block: remove BLKDEV_IFL_WAIT · dd3932ed
      Christoph Hellwig 提交于
      All the blkdev_issue_* helpers can only sanely be used for synchronous
      caller.  To issue cache flushes or barriers asynchronously the caller needs
      to set up a bio by itself with a completion callback to move the asynchronous
      state machine ahead.  So drop the BLKDEV_IFL_WAIT flag that is always
      specified when calling blkdev_issue_* and also remove the now unused flags
      argument to blkdev_issue_flush and blkdev_issue_zeroout.  For
      blkdev_issue_discard we need to keep it for the secure discard flag, which
      gains a more descriptive name and loses the bitops vs flag confusion.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      dd3932ed