1. 18 1月, 2017 1 次提交
  2. 09 12月, 2016 1 次提交
    • C
      block: improve handling of the magic discard payload · f9d03f96
      Christoph Hellwig 提交于
      Instead of allocating a single unused biovec for discard requests, send
      them down without any payload.  Instead we allow the driver to add a
      "special" payload using a biovec embedded into struct request (unioned
      over other fields never used while in the driver), and overloading
      the number of segments for this case.
      
      This has a couple of advantages:
      
       - we don't have to allocate the bio_vec
       - the amount of special casing for discard requests in the block
         layer is significantly reduced
       - using this same scheme for other request types is trivial,
         which will be important for implementing the new WRITE_ZEROES
         op on devices where it actually requires a payload (e.g. SCSI)
       - we can get rid of playing games with the request length, as
         we'll never touch it and completions will work just fine
       - it will allow us to support ranged discard operations in the
         future by merging non-contiguous discard bios into a single
         request
       - last but not least it removes a lot of code
      
      This patch is the common base for my WIP series for ranges discards and to
      remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
      so it would be good to get it in quickly.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f9d03f96
  3. 01 12月, 2016 2 次提交
  4. 28 10月, 2016 1 次提交
  5. 24 8月, 2016 1 次提交
    • M
      block: make sure a big bio is split into at most 256 bvecs · 4d70dca4
      Ming Lei 提交于
      After arbitrary bio size was introduced, the incoming bio may
      be very big. We have to split the bio into small bios so that
      each holds at most BIO_MAX_PAGES bvecs for safety reason, such
      as bio_clone().
      
      This patch fixes the following kernel crash:
      
      > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
      > [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
      > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
      > [  172.660399] Oops: 0000 [#1] SMP
      > [...]
      > [  172.664780] Call Trace:
      > [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
      > [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
      > [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
      > [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
      > [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
      > [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
      > [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
      
      The issue can be reproduced by the following steps:
      	- create one raid1 over two virtio-blk
      	- build bcache device over the above raid1 and another cache device
      	and bucket size is set as 2Mbytes
      	- set cache mode as writeback
      	- run random write over ext4 on the bcache device
      
      Fixes: 54efd50b(block: make generic_make_request handle arbitrarily sized bios)
      Reported-by: NSebastian Roesner <sroesner-kernelorg@roesner-online.de>
      Reported-by: NEric Wheeler <bcache@lists.ewheeler.net>
      Cc: stable@vger.kernel.org (4.3+)
      Cc: Shaohua Li <shli@fb.com>
      Acked-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4d70dca4
  6. 16 8月, 2016 1 次提交
  7. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  8. 21 7月, 2016 2 次提交
    • D
      block: Fix front merge check · 17007f39
      Damien Le Moal 提交于
      For a front merge, the maximum number of sectors of the
      request must be checked against the front merge BIO sector,
      not the current sector of the request.
      Signed-off-by: NDamien Le Moal <damien.lemoal@hgst.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      17007f39
    • T
      block: do not merge requests without consulting with io scheduler · 72ef799b
      Tahsin Erdogan 提交于
      Before merging a bio into an existing request, io scheduler is called to
      get its approval first. However, the requests that come from a plug
      flush may get merged by block layer without consulting with io
      scheduler.
      
      In case of CFQ, this can cause fairness problems. For instance, if a
      request gets merged into a low weight cgroup's request, high weight cgroup
      now will depend on low weight cgroup to get scheduled. If high weigt cgroup
      needs that io request to complete before submitting more requests, then it
      will also lose its timeslice.
      
      Following script demonstrates the problem. Group g1 has a low weight, g2
      and g3 have equal high weights but g2's requests are adjacent to g1's
      requests so they are subject to merging. Due to these merges, g2 gets
      poor disk time allocation.
      
      cat > cfq-merge-repro.sh << "EOF"
      #!/bin/bash
      set -e
      
      IO_ROOT=/mnt-cgroup/io
      
      mkdir -p $IO_ROOT
      
      if ! mount | grep -qw $IO_ROOT; then
        mount -t cgroup none -oblkio $IO_ROOT
      fi
      
      cd $IO_ROOT
      
      for i in g1 g2 g3; do
        if [ -d $i ]; then
          rmdir $i
        fi
      done
      
      mkdir g1 && echo 10 > g1/blkio.weight
      mkdir g2 && echo 495 > g2/blkio.weight
      mkdir g3 && echo 495 > g3/blkio.weight
      
      RUNTIME=10
      
      (echo $BASHPID > g1/cgroup.procs &&
       fio --readonly --name name1 --filename /dev/sdb \
           --rw read --size 64k --bs 64k --time_based \
           --runtime=$RUNTIME --offset=0k &> /dev/null)&
      
      (echo $BASHPID > g2/cgroup.procs &&
       fio --readonly --name name1 --filename /dev/sdb \
           --rw read --size 64k --bs 64k --time_based \
           --runtime=$RUNTIME --offset=64k &> /dev/null)&
      
      (echo $BASHPID > g3/cgroup.procs &&
       fio --readonly --name name1 --filename /dev/sdb \
           --rw read --size 64k --bs 64k --time_based \
           --runtime=$RUNTIME --offset=256k &> /dev/null)&
      
      sleep $((RUNTIME+1))
      
      for i in g1 g2 g3; do
        echo ---- $i ----
        cat $i/blkio.time
      done
      
      EOF
      # ./cfq-merge-repro.sh
      ---- g1 ----
      8:16 162
      ---- g2 ----
      8:16 165
      ---- g3 ----
      8:16 686
      
      After applying the patch:
      
      # ./cfq-merge-repro.sh
      ---- g1 ----
      8:16 90
      ---- g2 ----
      8:16 445
      ---- g3 ----
      8:16 471
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      72ef799b
  9. 09 6月, 2016 1 次提交
  10. 08 6月, 2016 3 次提交
  11. 04 3月, 2016 1 次提交
  12. 23 1月, 2016 1 次提交
  13. 13 1月, 2016 1 次提交
    • K
      block: split bios to max possible length · e36f6204
      Keith Busch 提交于
      This splits bio in the middle of a vector to form the largest possible
      bio at the h/w's desired alignment, and guarantees the bio being split
      will have some data.
      
      The criteria for splitting is changed from the max sectors to the h/w's
      optimal sector alignment if it is provided. For h/w that advertise their
      block storage's underlying chunk size, it's a big performance win to not
      submit commands that cross them. If sector alignment is not provided,
      this patch uses the max sectors as before.
      
      This addresses the performance issue commit d3805611 attempted to
      fix, but was reverted due to splitting logic error.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: <stable@vger.kernel.org> # 4.4.x-
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e36f6204
  14. 09 1月, 2016 1 次提交
    • J
      Revert "block: Split bios on chunk boundaries" · 6126eb24
      Jens Axboe 提交于
      This reverts commit d3805611.
      
      If we end up splitting on the first segment, we don't adjust
      the sector count. That results in hitting a BUG() with attempting
      to split 0 sectors.
      
      As this is just a performance issue and not a regression since
      4.3 release, let's just rever this change. That gives us more
      time to test a real fix for 4.5, which would be marked for
      stable anyway.
      6126eb24
  15. 23 12月, 2015 1 次提交
  16. 04 12月, 2015 1 次提交
  17. 01 12月, 2015 1 次提交
  18. 24 11月, 2015 3 次提交
  19. 22 10月, 2015 2 次提交
  20. 17 9月, 2015 1 次提交
    • M
      block: blk-merge: fast-clone bio when splitting rw bios · 52cc6eea
      Ming Lei 提交于
      biovecs has become immutable since v3.13, so it isn't necessary
      to allocate biovecs for the new cloned bios, then we can save
      one extra biovecs allocation/copy, and the allocation is often
      not fixed-length and a bit more expensive.
      
      For example, if the 'max_sectors_kb' of null blk's queue is set
      as 16(32 sectors) via sysfs just for making more splits, this patch
      can increase throught about ~70% in the sequential read test over
      null_blk(direct io, bs: 1M).
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Ming Lin <ming.l@ssi.samsung.com>
      Cc: Dongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      
      This fixes a performance regression introduced by commit 54efd50b,
      and allows us to take full advantage of the fact that we have immutable
      bio_vecs. Hand applied, as it rejected violently with commit
      5014c311.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      52cc6eea
  21. 11 9月, 2015 1 次提交
  22. 04 9月, 2015 1 次提交
  23. 03 9月, 2015 1 次提交
  24. 02 9月, 2015 1 次提交
  25. 20 8月, 2015 1 次提交
  26. 17 8月, 2015 1 次提交
  27. 14 8月, 2015 2 次提交
    • K
      block: kill merge_bvec_fn() completely · 8ae12666
      Kent Overstreet 提交于
      As generic_make_request() is now able to handle arbitrarily sized bios,
      it's no longer necessary for each individual block driver to define its
      own ->merge_bvec_fn() callback. Remove every invocation completely.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: ceph-devel@vger.kernel.org
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: also remove ->merge_bvec_fn() in dm-thin as well as
       dm-era-target, and resolve merge conflicts]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ae12666
    • K
      block: make generic_make_request handle arbitrarily sized bios · 54efd50b
      Kent Overstreet 提交于
      The way the block layer is currently written, it goes to great lengths
      to avoid having to split bios; upper layer code (such as bio_add_page())
      checks what the underlying device can handle and tries to always create
      bios that don't need to be split.
      
      But this approach becomes unwieldy and eventually breaks down with
      stacked devices and devices with dynamic limits, and it adds a lot of
      complexity. If the block layer could split bios as needed, we could
      eliminate a lot of complexity elsewhere - particularly in stacked
      drivers. Code that creates bios can then create whatever size bios are
      convenient, and more importantly stacked drivers don't have to deal with
      both their own bio size limitations and the limitations of the
      (potentially multiple) devices underneath them.  In the future this will
      let us delete merge_bvec_fn and a bunch of other code.
      
      We do this by adding calls to blk_queue_split() to the various
      make_request functions that need it - a few can already handle arbitrary
      size bios. Note that we add the call _after_ any call to
      blk_queue_bounce(); this means that blk_queue_split() and
      blk_recalc_rq_segments() don't need to be concerned with bouncing
      affecting segment merging.
      
      Some make_request_fn() callbacks were simple enough to audit and verify
      they don't need blk_queue_split() calls. The skipped ones are:
      
       * nfhd_make_request (arch/m68k/emu/nfblock.c)
       * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
       * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
       * brd_make_request (ramdisk - drivers/block/brd.c)
       * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
       * loop_make_request
       * null_queue_bio
       * bcache's make_request fns
      
      Some others are almost certainly safe to remove now, but will be left
      for future patches.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Jim Paris <jim@jtan.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      54efd50b
  28. 29 7月, 2015 1 次提交
    • J
      block: manipulate bio->bi_flags through helpers · b7c44ed9
      Jens Axboe 提交于
      Some places use helpers now, others don't. We only have the 'is set'
      helper, add helpers for setting and clearing flags too.
      
      It was a bit of a mess of atomic vs non-atomic access. With
      BIO_UPTODATE gone, we don't have any risk of concurrent access to the
      flags. So relax the restriction and don't make any of them atomic. The
      flags that do have serialization issues (reffed and chained), we
      already handle those separately.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b7c44ed9
  29. 30 5月, 2015 1 次提交
  30. 20 3月, 2015 1 次提交
  31. 12 2月, 2015 2 次提交