1. 30 5月, 2015 10 次提交
  2. 29 5月, 2015 1 次提交
    • M
      dm: fix false warning in free_rq_clone() for unmapped requests · e5d8de32
      Mike Snitzer 提交于
      When stacking request-based dm device on non blk-mq device and
      device-mapper target could not map the request (error target is used,
      multipath target with all paths down, etc), the WARN_ON_ONCE() in
      free_rq_clone() will trigger when it shouldn't.
      
      The warning was added by commit aa6df8dd ("dm: fix free_rq_clone() NULL
      pointer when requeueing unmapped request").  But free_rq_clone() with
      clone->q == NULL is valid usage for the case where
      dm_kill_unmapped_request() initiates request cleanup.
      
      Fix this false warning by just removing the WARN_ON -- it only generated
      false positives and was never useful in catching the intended case
      (completing clone request not being mapped e.g. clone->q being NULL).
      
      Fixes: aa6df8dd ("dm: fix free_rq_clone() NULL pointer when requeueing unmapped request")
      Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reported-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      e5d8de32
  3. 28 5月, 2015 2 次提交
    • M
      dm: requeue from blk-mq dm_mq_queue_rq() using BLK_MQ_RQ_QUEUE_BUSY · 45714fbe
      Mike Snitzer 提交于
      Use BLK_MQ_RQ_QUEUE_BUSY to requeue a blk-mq request directly from the
      DM blk-mq device's .queue_rq.  This cleans up the previous convoluted
      handling of request requeueing that would return BLK_MQ_RQ_QUEUE_OK
      (even though it wasn't) and then run blk_mq_requeue_request() followed
      by blk_mq_kick_requeue_list().
      
      Also, document that DM blk-mq ontop of old request_fn devices cannot
      fail in clone_rq() since the clone request is preallocated as part of
      the pdu.
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      45714fbe
    • M
      dm mpath: fix leak of dm_mpath_io structure in blk-mq .queue_rq error path · 4c6dd53d
      Mike Snitzer 提交于
      Otherwise kmemleak reported:
      
      unreferenced object 0xffff88009b14e2b0 (size 16):
        comm "fio", pid 4274, jiffies 4294978034 (age 1253.210s)
        hex dump (first 16 bytes):
          40 12 f3 99 01 88 ff ff 00 10 00 00 00 00 00 00  @...............
        backtrace:
          [<ffffffff81600029>] kmemleak_alloc+0x49/0xb0
          [<ffffffff811679a8>] kmem_cache_alloc+0xf8/0x160
          [<ffffffff8111c950>] mempool_alloc_slab+0x10/0x20
          [<ffffffff8111cb37>] mempool_alloc+0x57/0x150
          [<ffffffffa04d2b61>] __multipath_map.isra.17+0xe1/0x220 [dm_multipath]
          [<ffffffffa04d2cb5>] multipath_clone_and_map+0x15/0x20 [dm_multipath]
          [<ffffffffa02889b5>] map_request.isra.39+0xd5/0x220 [dm_mod]
          [<ffffffffa028b0e4>] dm_mq_queue_rq+0x134/0x240 [dm_mod]
          [<ffffffff812cccb5>] __blk_mq_run_hw_queue+0x1d5/0x380
          [<ffffffff812ccaa5>] blk_mq_run_hw_queue+0xc5/0x100
          [<ffffffff812ce350>] blk_sq_make_request+0x240/0x300
          [<ffffffff812c0f30>] generic_make_request+0xc0/0x110
          [<ffffffff812c0ff2>] submit_bio+0x72/0x150
          [<ffffffff811c07cb>] do_blockdev_direct_IO+0x1f3b/0x2da0
          [<ffffffff811c166e>] __blockdev_direct_IO+0x3e/0x40
          [<ffffffff8120aa1a>] ext4_direct_IO+0x1aa/0x390
      
      Fixes: e5863d9a ("dm: allocate requests in target when stacking on blk-mq devices")
      Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 4.0+
      4c6dd53d
  4. 27 5月, 2015 1 次提交
    • J
      dm: fix NULL pointer when clone_and_map_rq returns !DM_MAPIO_REMAPPED · 3a140755
      Junichi Nomura 提交于
      When stacking request-based DM on blk_mq device, request cloning and
      remapping are done in a single call to target's clone_and_map_rq().
      The clone is allocated and valid only if clone_and_map_rq() returns
      DM_MAPIO_REMAPPED.
      
      The "IS_ERR(clone)" check in map_request() does not cover all the
      !DM_MAPIO_REMAPPED cases that are possible (E.g. if underlying devices
      are not ready or unavailable, clone_and_map_rq() may return
      DM_MAPIO_REQUEUE without ever having established an ERR_PTR).  Fix this
      by explicitly checking for a return that is not DM_MAPIO_REMAPPED in
      map_request().
      
      Without this fix, DM core may call setup_clone() for a NULL clone
      and oops like this:
      
         BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
         IP: [<ffffffff81227525>] blk_rq_prep_clone+0x7d/0x137
         ...
         CPU: 2 PID: 5793 Comm: kdmwork-253:3 Not tainted 4.0.0-nm #1
         ...
         Call Trace:
          [<ffffffffa01d1c09>] map_tio_request+0xa9/0x258 [dm_mod]
          [<ffffffff81071de9>] kthread_worker_fn+0xfd/0x150
          [<ffffffff81071cec>] ? kthread_parkme+0x24/0x24
          [<ffffffff81071cec>] ? kthread_parkme+0x24/0x24
          [<ffffffff81071fdd>] kthread+0xe6/0xee
          [<ffffffff81093a59>] ? put_lock_stats+0xe/0x20
          [<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b
          [<ffffffff814c2d98>] ret_from_fork+0x58/0x90
          [<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b
      
      Fixes: e5863d9a ("dm: allocate requests in target when stacking on blk-mq devices")
      Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 4.0+
      3a140755
  5. 26 5月, 2015 1 次提交
  6. 22 5月, 2015 2 次提交
    • C
      block, dm: don't copy bios for request clones · 5f1b670d
      Christoph Hellwig 提交于
      Currently dm-multipath has to clone the bios for every request sent
      to the lower devices, which wastes cpu cycles and ties down memory.
      
      This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
      to not complete bios attached to a request, which we set on clone
      requests similar to bios in a flush sequence.  With this change I/O
      errors on a path failure only get propagated to dm-multipath, which
      can then either resubmit the I/O or complete the bios on the original
      request.
      
      I've done some basic testing of this on a Linux target with ALUA support,
      and it survives path failures during I/O nicely.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5f1b670d
    • M
      block: remove management of bi_remaining when restoring original bi_end_io · 326e1dbb
      Mike Snitzer 提交于
      Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
      non-chains") regressed all existing callers that followed this pattern:
       1) saving a bio's original bi_end_io
       2) wiring up an intermediate bi_end_io
       3) restoring the original bi_end_io from intermediate bi_end_io
       4) calling bio_endio() to execute the restored original bi_end_io
      
      The regression was due to BIO_CHAIN only ever getting set if
      bio_inc_remaining() is called.  For the above pattern it isn't set until
      step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
      the first bio_endio(), in step 2 above, never decremented __bi_remaining
      before calling the intermediate bi_end_io -- leaving __bi_remaining with
      the value 1 instead of 0.  When bio_inc_remaining() occurred during step
      3 it brought it to a value of 2.  When the second bio_endio() was
      called, in step 4 above, it should've called the original bi_end_io but
      it didn't because there was an extra reference that wasn't dropped (due
      to atomic operations being optimized away since BIO_CHAIN wasn't set
      upfront).
      
      Fix this issue by removing the __bi_remaining management complexity for
      all callers that use the above pattern -- bio_chain() is the only
      interface that _needs_ to be concerned with __bi_remaining.  For the
      above pattern callers just expect the bi_end_io they set to get called!
      Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
      that aren't associated with the bio_chain() interface.
      
      Also, the bio_inc_remaining() interface has been moved local to bio.c.
      
      Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      326e1dbb
  7. 21 5月, 2015 3 次提交
  8. 08 5月, 2015 7 次提交
    • N
      md/raid5: fix handling of degraded stripes in batches. · bb27051f
      NeilBrown 提交于
      There is no need for special handling of stripe-batches when the array
      is degraded.
      
      There may be if there is a failure in the batch, but STRIPE_DEGRADED
      does not imply an error.
      
      So don't set STRIPE_BATCH_ERR in ops_run_io just because the array is
      degraded.
      This actually causes a bug: the STRIPE_DEGRADED flag gets cleared in
      check_break_stripe_batch_list() and so the bitmap bit gets cleared
      when it shouldn't.
      
      So in check_break_stripe_batch_list(), split the batch up completely -
      again STRIPE_DEGRADED isn't meaningful.
      
      Also don't set STRIPE_BATCH_ERR when there is a write error to a
      replacement device.  This simply removes the replacement device and
      requires no extra handling.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bb27051f
    • N
      md/raid5: fix allocation of 'scribble' array. · 738a2738
      NeilBrown 提交于
      As the new 'scribble' array is sized based on chunk size,
      we need to make sure the size matches the largest of 'old'
      and 'new' chunk sizes when the array is undergoing reshape.
      
      We also potentially need to resize it even when not resizing
      the stripe cache, as chunk size can change without changing
      number of devices.
      
      So move the 'resize' code into a separate function, and
      consider old and new sizes when allocating.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Fixes: 46d5b785 ("raid5: use flex_array for scribble data")
      738a2738
    • N
      md/raid5: don't record new size if resize_stripes fails. · 6e9eac2d
      NeilBrown 提交于
      If any memory allocation in resize_stripes fails we will return
      -ENOMEM, but in some cases we update conf->pool_size anyway.
      
      This means that if we try again, the allocations will be assumed
      to be larger than they are, and badness results.
      
      So only update pool_size if there is no error.
      
      This bug was introduced in 2.6.17 and the patch is suitable for
      -stable.
      
      Fixes: ad01c9e3 ("[PATCH] md: Allow stripes to be expanded in preparation for expanding an array")
      Cc: stable@vger.kernel.org (v2.6.17+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6e9eac2d
    • N
      md/raid5: avoid reading parity blocks for full-stripe write to degraded array · 10d82c5f
      NeilBrown 提交于
      When performing a reconstruct write, we need to read all blocks
      that are not being over-written .. except the parity (P and Q) blocks.
      
      The code currently reads these (as they are not being over-written!)
      unnecessarily.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Fixes: ea664c82 ("md/raid5: need_this_block: tidy/fix last condition.")
      10d82c5f
    • N
      md/raid5: more incorrect BUG_ON in handle_stripe_fill. · b0c783b3
      NeilBrown 提交于
      It is not incorrect to call handle_stripe_fill() when
      a batch of full-stripe writes is active.
      It is, however, a BUG if fetch_block() then decides
      it needs to actually fetch anything.
      
      So move the 'BUG_ON' to where it belongs.
      Signed-off-by: NNeilBrown  <neilb@suse.de>
      Fixes: 59fc630b ("RAID5: batch adjacent full stripe write")
      b0c783b3
    • N
      md/raid5: new alloc_stripe() to allocate an initialize a stripe. · f18c1a35
      NeilBrown 提交于
      The new batch_lock and batch_list fields are being initialized in
      grow_one_stripe() but not in resize_stripes().  This causes a crash
      on resize.
      
      So separate the core initialization into a new function and call it
      from both allocation sites.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Fixes: 59fc630b ("RAID5: batch adjacent full stripe write")
      f18c1a35
    • H
      md-raid0: conditional mddev->queue access to suit dm-raid · b6538fe3
      Heinz Mauelshagen 提交于
      This patch is a prerequisite for dm-raid "raid0" support to allow
      dm-raid to access the MD RAID0 personality doing unconditional
      accesses to mddev->queue, which is NULL in case of dm-raid stacked on
      top of MD.
      
      Most of the conditional mddev->queue accesses made it to upstream but
      this missing one, which prohibits md raid0 to set disk stack limits
      (being done in dm core in case of md underneath dm).
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Tested-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b6538fe3
  9. 06 5月, 2015 3 次提交
    • J
      bio: skip atomic inc/dec of ->bi_cnt for most use cases · dac56212
      Jens Axboe 提交于
      Struct bio has a reference count that controls when it can be freed.
      Most uses cases is allocating the bio, which then returns with a
      single reference to it, doing IO, and then dropping that single
      reference. We can remove this atomic_dec_and_test() in the completion
      path, if nobody else is holding a reference to the bio.
      
      If someone does call bio_get() on the bio, then we flag the bio as
      now having valid count and that we must properly honor the reference
      count when it's being put.
      Tested-by: NRobert Elliott <elliott@hp.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dac56212
    • J
      bio: skip atomic inc/dec of ->bi_remaining for non-chains · c4cf5261
      Jens Axboe 提交于
      Struct bio has an atomic ref count for chained bio's, and we use this
      to know when to end IO on the bio. However, most bio's are not chained,
      so we don't need to always introduce this atomic operation as part of
      ending IO.
      
      Add a helper to elevate the bi_remaining count, and flag the bio as
      now actually needing the decrement at end_io time. Rename the field
      to __bi_remaining to catch any current users of this doing the
      incrementing manually.
      
      For high IOPS workloads, this reduces the overhead of bio_endio()
      substantially.
      Tested-by: NRobert Elliott <elliott@hp.com>
      Acked-by: NKent Overstreet <kent.overstreet@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c4cf5261
    • R
      Revert "dm crypt: fix deadlock when async crypto algorithm returns -EBUSY" · c0403ec0
      Rabin Vincent 提交于
      This reverts Linux 4.1-rc1 commit 0618764c.
      
      The problem which that commit attempts to fix actually lies in the
      Freescale CAAM crypto driver not dm-crypt.
      
      dm-crypt uses CRYPTO_TFM_REQ_MAY_BACKLOG.  This means the the crypto
      driver should internally backlog requests which arrive when the queue is
      full and process them later.  Until the crypto hw's queue becomes full,
      the driver returns -EINPROGRESS.  When the crypto hw's queue if full,
      the driver returns -EBUSY, and if CRYPTO_TFM_REQ_MAY_BACKLOG is set, is
      expected to backlog the request and process it when the hardware has
      queue space.  At the point when the driver takes the request from the
      backlog and starts processing it, it calls the completion function with
      a status of -EINPROGRESS.  The completion function is called (for a
      second time, in the case of backlogged requests) with a status/err of 0
      when a request is done.
      
      Crypto drivers for hardware without hardware queueing use the helpers,
      crypto_init_queue(), crypto_enqueue_request(), crypto_dequeue_request()
      and crypto_get_backlog() helpers to implement this behaviour correctly,
      while others implement this behaviour without these helpers (ccp, for
      example).
      
      dm-crypt (before the patch that needs reverting) uses this API
      correctly.  It queues up as many requests as the hw queues will allow
      (i.e. as long as it gets back -EINPROGRESS from the request function).
      Then, when it sees at least one backlogged request (gets -EBUSY), it
      waits till that backlogged request is handled (completion gets called
      with -EINPROGRESS), and then continues.  The references to
      af_alg_wait_for_completion() and af_alg_complete() in that commit's
      commit message are irrelevant because those functions only handle one
      request at a time, unlink dm-crypt.
      
      The problem is that the Freescale CAAM driver, which that commit
      describes as having being tested with, fails to implement the
      backlogging behaviour correctly.  In cam_jr_enqueue(), if the hardware
      queue is full, it simply returns -EBUSY without backlogging the request.
      What the observed deadlock was is not described in the commit message
      but it is obviously the wait_for_completion() in crypto_convert() where
      dm-crypto would wait for the completion being called with -EINPROGRESS
      in the case of backlogged requests.  This completion will never be
      completed due to the bug in the CAAM driver.
      
      Commit 0618764c incorrectly made dm-crypt wait for every request,
      even when the driver/hardware queues are not full, which means that
      dm-crypt will never see -EBUSY.  This means that that commit will cause
      a performance regression on all crypto drivers which implement the API
      correctly.
      
      Revert it.  Correct backlog handling should be implemented in the CAAM
      driver instead.
      
      Cc'ing stable purely because commit 0618764c did.  If for some reason
      a stable@ kernel did pick up commit 0618764c it should get reverted.
      Signed-off-by: NRabin Vincent <rabin.vincent@axis.com>
      Reviewed-by: NHoria Geanta <horia.geanta@freescale.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c0403ec0
  10. 30 4月, 2015 2 次提交
    • M
      dm: fix free_rq_clone() NULL pointer when requeueing unmapped request · aa6df8dd
      Mike Snitzer 提交于
      Commit 02233342 ("dm: optimize dm_mq_queue_rq to _not_ use kthread if
      using pure blk-mq") mistakenly removed free_rq_clone()'s clone->q check
      before testing clone->q->mq_ops.  It was an oversight to discontinue
      that check for 1 of the 2 use-cases for free_rq_clone():
      1) free_rq_clone() called when an unmapped original request is requeued
      2) free_rq_clone() called in the request-based IO completion path
      
      The clone->q check made sense for case #1 but not for #2.  However, we
      cannot just reinstate the check as it'd mask a serious bug in the IO
      completion case #2 -- no in-flight request should have an uninitialized
      request_queue (basic block layer refcounting _should_ ensure this).
      
      The NULL pointer seen for case #1 is detailed here:
      https://www.redhat.com/archives/dm-devel/2015-April/msg00160.html
      
      Fix this free_rq_clone() NULL pointer by simply checking if the
      mapped_device's type is DM_TYPE_MQ_REQUEST_BASED (clone's queue is
      blk-mq) rather than checking clone->q->mq_ops.  This avoids the need to
      dereference clone->q, but a WARN_ON_ONCE is added to let us know if an
      uninitialized clone request is being completed.
      Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      aa6df8dd
    • C
      dm: only initialize the request_queue once · 3e6180f0
      Christoph Hellwig 提交于
      Commit bfebd1cd ("dm: add full blk-mq support to request-based DM")
      didn't properly account for the need to short-circuit re-initializing
      DM's blk-mq request_queue if it was already initialized.
      
      Otherwise, reloading a blk-mq request-based DM table (either manually
      or via multipathd) resulted in errors, see:
       https://www.redhat.com/archives/dm-devel/2015-April/msg00132.html
      
      Fix is to only initialize the request_queue on the initial table load
      (when the mapped_device type is assigned).
      
      This is better than having dm_init_request_based_blk_mq_queue() return
      early if the queue was already initialized because it elevates the
      constraint to a more meaningful location in DM core.  As such the
      pre-existing early return in dm_init_request_based_queue() can now be
      removed.
      
      Fixes: bfebd1cd ("dm: add full blk-mq support to request-based DM")
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      3e6180f0
  11. 28 4月, 2015 1 次提交
    • N
      block: destroy bdi before blockdev is unregistered. · 6cd18e71
      NeilBrown 提交于
      Because of the peculiar way that md devices are created (automatically
      when the device node is opened), a new device can be created and
      registered immediately after the
      	blk_unregister_region(disk_devt(disk), disk->minors);
      call in del_gendisk().
      
      Therefore it is important that all visible artifacts of the previous
      device are removed before this call.  In particular, the 'bdi'.
      
      Since:
      commit c4db59d3
      Author: Christoph Hellwig <hch@lst.de>
          fs: don't reassign dirty inodes to default_backing_dev_info
      
      moved the
         device_unregister(bdi->dev);
      call from bdi_unregister() to bdi_destroy() it has been quite easy to
      lose a race and have a new (e.g.) "md127" be created after the
      blk_unregister_region() call and before bdi_destroy() is ultimately
      called by the final 'put_disk', which must come after del_gendisk().
      
      The new device finds that the bdi name is already registered in sysfs
      and complains
      
      > [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70()
      > [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127'
      
      We can fix this by moving the bdi_destroy() call out of
      blk_release_queue() (which can happen very late when a refcount
      reaches zero) and into blk_cleanup_queue() - which happens exactly when the md
      device driver calls it.
      
      Then it is only necessary for md to call blk_cleanup_queue() before
      del_gendisk().  As loop.c devices are also created on demand by
      opening the device node, we make the same change there.
      
      Fixes: c4db59d3Reported-by: NAzat Khuzhin <a3at.mail@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org (v4.0)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6cd18e71
  12. 22 4月, 2015 7 次提交
    • E
      md/raid5: don't do chunk aligned read on degraded array. · 9ffc8f7c
      Eric Mei 提交于
      When array is degraded, read data landed on failed drives will result in
      reading rest of data in a stripe. So a single sequential read would
      result in same data being read twice.
      
      This patch is to avoid chunk aligned read for degraded array. The
      downside is to involve stripe cache which means associated CPU overhead
      and extra memory copy.
      
      Test Results:
      Following test are done on a enterprise storage node with Seagate 6T SAS
      drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2,
      chunk size 128 KiB.
      
      I use FIO, using direct-io with various bs size, enough queue depth,
      tested sequential and 100% random read against 3 array config:
       1) optimal, as baseline;
       2) degraded;
       3) degraded with this patch.
      Kernel version is 4.0-rc3.
      
      Each individual test I only did once so there might be some variations,
      but we just focus on big trend.
      
      Sequential Read:
        bs=(KiB)  optimal(MiB/s)  degraded(MiB/s)  degraded-with-patch (MiB/s)
         1024       1608            656              995
          512       1624            710              956
          256       1635            728              980
          128       1636            771              983
           64       1612           1119             1000
           32       1580           1420             1004
           16       1368            688              986
            8        768            647              953
            4        411            413              850
      
      Random Read:
        bs=(KiB)  optimal(IOPS)  degraded(IOPS)  degraded-with-patch (IOPS)
         1024        163            160              156
          512        274            273              272
          256        426            428              424
          128        576            592              591
           64        726            724              726
           32        849            848              837
           16        900            970              971
            8        927            940              929
            4        948            940              955
      
      Some notes:
        * In sequential + optimal, as bs size getting smaller, the FIO thread
      become CPU bound.
        * In sequential + degraded, there's big increase when bs is 64K and
      32K, I don't have explanation.
        * In sequential + degraded-with-patch, the MD thread mostly become CPU
      bound.
      
      If you want to we can discuss specific data point in those data. But in
      general it seems with this patch, we have more predictable and in most
      cases significant better sequential read performance when array is
      degraded, and almost no noticeable impact on random read.
      
      Performance is a complicated thing, the patch works well for this
      particular configuration, but may not be universal. For example I
      imagine testing on all SSD array may have very different result. But I
      personally think in most cases IO bandwidth is more scarce resource than
      CPU.
      Signed-off-by: NEric Mei <eric.mei@seagate.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9ffc8f7c
    • N
      md/raid5: allow the stripe_cache to grow and shrink. · edbe83ab
      NeilBrown 提交于
      The default setting of 256 stripe_heads is probably
      much too small for many configurations.  So it is best to make it
      auto-configure.
      
      Shrinking the cache under memory pressure is easy.  The only
      interesting part here is that we put a fairly high cost
      ('seeks') on shrinking the cache as the cost is greater than
      just having to read more data, it reduces parallelism.
      
      Growing the cache on demand needs to be done carefully.  If we allow
      fast growth, that can upset memory balance as lots of dirty memory can
      quickly turn into lots of memory queued in the stripe_cache.
      It is important for the raid5 block device to appear congested to
      allow write-throttling to work.
      
      So we only add stripes slowly. We set a flag when an allocation
      fails because all stripes are in use, allocate at a convenient
      time when that flag is set, and don't allow it to be set again
      until at least one stripe_head has been released for re-use.
      
      This means that a spurt of requests will only cause one stripe_head
      to be allocated, but a steady stream of requests will slowly
      increase the cache size - until memory pressure puts it back again.
      
      It could take hours to reach a steady state.
      
      The value written to, and displayed in, stripe_cache_size is
      used as a minimum.  The cache can grow above this and shrink back
      down to it.  The actual size is not directly visible, though it can
      be deduced to some extent by watching stripe_cache_active.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      edbe83ab
    • N
      md/raid5: change ->inactive_blocked to a bit-flag. · 5423399a
      NeilBrown 提交于
      This allows us to easily add more (atomic) flags.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5423399a
    • N
      md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe · 486f0644
      NeilBrown 提交于
      Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe()
      succeeds, do it inside the functions.
      
      Also choose the correct hash to handle next inside the functions.
      
      This removes duplication and will help with future new uses of
      {grow,drop}_one_stripe.
      
      This also fixes a minor bug where the "md/raid:%md: allocate XXkB"
      message always said "0kB".
      Signed-off-by: NNeilBrown <neilb@suse.de>
      486f0644
    • N
      md/raid5: pass gfp_t arg to grow_one_stripe() · a9683a79
      NeilBrown 提交于
      This is needed for future improvement to stripe cache management.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a9683a79
    • M
      md/raid5: introduce configuration option rmw_level · d06f191f
      Markus Stockhausen 提交于
      Depending on the available coding we allow optimized rmw logic for write
      operations. To support easier testing this patch allows manual control
      of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level.
      
      The configuration can handle three levels of control.
      
      rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q
      calculation has no implementation path yet to factor in/out chunks of
      a syndrome. Enforcing this level can be benefical for slow CPUs with
      hardware syndrome support and fast SSDs.
      
      rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will
      save IOs. This equals the "old" unpatched behaviour and will be the
      default.
      
      rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are
      equal. We might have higher CPU consumption because of calculating the
      parity twice but it can be benefical otherwise. E.g. RAID4 with fast
      dedicated parity disk/SSD. The option is implemented just to be
      forward-looking and will ONLY work with this patch!
      Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d06f191f
    • M
      md/raid5: activate raid6 rmw feature · 584acdd4
      Markus Stockhausen 提交于
      Glue it altogehter. The raid6 rmw path should work the same as the
      already existing raid5 logic. So emulate the prexor handling/flags
      and split functions as needed.
      
      1) Enable xor_syndrome() in the async layer.
      
      2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
      at the start of a rmw run as we did it before for the single parity.
      
      3) Take care of rmw run in ops_run_reconstruct6(). Again process only
      the changed pages to get syndrome back into sync.
      
      4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
      run. The lower layers will calculate start & end pages from that and
      call the xor_syndrome() correspondingly.
      
      5) Adapt the several places where we ignored Q handling up to now.
      
      Performance numbers for a single E5630 system with a mix of 10 7200k
      desktop/server disks. 300 seconds random write with 8 threads onto a
      3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
      
      bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
              skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
         4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
         8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
        16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
        32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
        64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
       128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
       256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
       512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s
      Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      584acdd4