1. 05 8月, 2014 2 次提交
  2. 12 6月, 2014 2 次提交
  3. 10 6月, 2014 1 次提交
    • E
      raid5: speedup sync_request processing · 053f5b65
      Eivind Sarto 提交于
      The raid5 sync_request() processing calls handle_stripe() within the context of
      the resync-thread.  The resync-thread issues the first set of read requests
      and this adds execution latency and slows down the scheduling of the next
      sync_request().
      The current rebuild/resync speed of raid5 is not much faster than what
      rotational HDDs can sustain.
      Testing the following patch on a 6-drive array, I can increase the rebuild
      speed from 100 MB/s to 175 MB/s.
      The sync_request() now just sets STRIPE_HANDLE and releases the stripe.  This
      creates some more parallelism between the resync-thread and raid5 kernel daemon.
      Signed-off-by: NEivind Sarto <esarto@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      053f5b65
  4. 05 6月, 2014 1 次提交
    • H
      md/raid5: deadlock between retry_aligned_read with barrier io · 2844dc32
      hui jiao 提交于
      A chunk aligned read increases counter active_aligned_reads and
      decreases it after sub-device handle it successfully. But when a read
      error occurs,  the read redispatched by raid5d, and the
      active_aligned_reads will not be decreased until we can grab a stripe
      head in retry_aligned_read. Now suppose, a barrier io comes, set
      conf->quiesce to 2, and wait until both active_stripes and
      active_aligned_reads are zero. The retried chunk aligned read gets
      stuck at get_active_stripe waiting until conf->quiesce becomes 0.
      Retry_aligned_read and barrier io are waiting each other now.
      One possible solution is that we ignore conf->quiesce, let the retried
      aligned read finish. I reproduced this deadlock and test this patch on
      centos6.0
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2844dc32
  5. 04 6月, 2014 9 次提交
  6. 29 5月, 2014 7 次提交
    • S
      raid5: add an option to avoid copy data from bio to stripe cache · d592a996
      Shaohua Li 提交于
      The stripe cache has two goals:
      1. cache data, so next time if data can be found in stripe cache, disk access
      can be avoided.
      2. stable data. data is copied from bio to stripe cache and calculated parity.
      data written to disk is from stripe cache, so if upper layer changes bio data,
      data written to disk isn't impacted.
      
      In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
      can guarantee 2 too. For 1, it's not common too. block plug mechanism will
      dispatch a bunch of sequentail small requests together. And since I'm using
      SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
      
      So I'd like to avoid the copy from bio to stripe cache and it's very helpful
      for performance. In my 1M randwrite tests, avoid the copy can increase the
      performance more than 30%.
      
      Of course, this shouldn't be enabled by default. It's reported enabling
      BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
      control it.
      
      Neilb:
        changed BUG_ON to WARN_ON
        Removed some assignments from raid5_build_block which are now not needed.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d592a996
    • N
      md/bitmap: remove confusing code from filemap_get_page. · f2e06c58
      NeilBrown 提交于
      file_page_index(store, 0) is *always* 0.
      This is because the bitmap sb, at 256 bytes, is *always* less than
      one page.
      So subtracting it has no effect and the code should be removed.
      Reported-by: NGoldwyn Rodrigues <rgoldwyn@suse.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f2e06c58
    • E
      raid5: avoid release list until last reference of the stripe · cf170f3f
      Eivind Sarto 提交于
      The (lockless) release_list reduces lock contention, but there is excessive
      queueing and dequeuing of stripes on this list.  A stripe will currently be
      queued on the release_list with a stripe reference count > 1.  This can cause
      the raid5 kernel thread(s) to dequeue the stripe and decrement the refcount
      without doing any other useful processing of the stripe.  The are two cases
      when the stripe can be put on the release_list multiple times before it is
      actually handled by the kernel thread(s).
      1) make_request() activates the stripe processing in 4k increments.  When a
         write request is large enough to span multiple chunks of a stripe_head, the
         first 4k chunk adds the stripe to the plug list.  The next 4k chunk that is
         processed for the same stripe puts the stripe on the release_list with a
         refcount=2.  This can cause the kernel thread to process and decrement the
         stripe before the stripe us unplugged, which again will put it back on the
         release_list.
      2) Whenever IO is scheduled on a stripe (pre-read and/or write), the stripe
         refcount is set to the number of active IO (for each chunk).  The stripe is
         released as each IO complete, and can be queued and dequeued multiple times
         on the release_list, until its refcount finally reached zero.
      
      This simple patch will ensure a stripe is only queued on the release_list when
      its refcount=1 and is ready to be handled by the kernel thread(s).  I added some
      instrumentation to raid5 and counted the number of times striped were queued on
      the release_list for a variety of write IO sizes.  Without this patch the number
      of times stripes got queued on the release_list was 100-500% higher than with
      the patch.  The excess queuing will increase with the IO size.  The patch also
      improved throughput by 5-10%.
      Signed-off-by: NEivind Sarto <esarto@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cf170f3f
    • N
      md: md_clear_badblocks should return an error code on failure. · 8b32bf5e
      NeilBrown 提交于
      Julia Lawall and coccinelle report that md_clear_badblocks always
      returns 0, despite appearing to have an error path.
      The error path really should return an error code.  ENOSPC is
      reasonably appropriate.
      Reported-by: NJulia Lawall <Julia.Lawall@lip6.fr>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8b32bf5e
    • N
      md/raid56: Don't perform reads to support writes until stripe is ready. · 67f45548
      NeilBrown 提交于
      If it is found that we need to pre-read some blocks before a write
      can succeed, we normally set STRIPE_DELAYED and don't actually perform
      the read until STRIPE_PREREAD_ACTIVE subsequently gets set.
      
      However for a degraded RAID6 we currently perform the reads as soon
      as we see that a write is pending.  This significantly hurts
      throughput.
      
      So:
       - when handle_stripe_dirtying find a block that it wants on a device
         that is failed, set STRIPE_DELAY, instead of doing nothing, and
       - when fetch_block detects that a read might be required to satisfy a
         write, only perform the read if STRIPE_PREREAD_ACTIVE is set,
         and if we would actually need to read something to complete the write.
      
      This also helps RAID5, though less often as RAID5 supports a
      read-modify-write cycle.  For RAID5 the read is performed too early
      only if the write is not a full 4K aligned write (i.e. no an
      R5_OVERWRITE).
      
      Also clean up a couple of horrible bits of formatting.
      Reported-by: NPatrik Horník <patrik@dsl.sk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      67f45548
    • N
      md: refuse to change shape of array if it is active but read-only · bd8839e0
      NeilBrown 提交于
      read-only arrays should not be changed.  This includes changing
      the level, layout, size, or number of devices.
      
      So reject those changes for readonly arrays.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bd8839e0
    • N
      md: always set MD_RECOVERY_INTR when interrupting a reshape thread. · 2ac295a5
      NeilBrown 提交于
      Commit 8313b8e5
         md: fix problem when adding device to read-only array with bitmap.
      
      added a called to md_reap_sync_thread() which cause a reshape thread
      to be interrupted (in particular, it could cause md_thread() to never even
      call md_do_sync()).
      However it didn't set MD_RECOVERY_INTR so ->finish_reshape() would not
      know that the reshape didn't complete.
      
      This only happens when mddev->ro is set and normally reshape threads
      don't run in that situation.  But raid5 and raid10 can start a reshape
      thread during "run" is the array is in the middle of a reshape.
      They do this even if ->ro is set.
      
      So it is best to set MD_RECOVERY_INTR before abortingg the
      sync thread, just in case.
      
      Though it rare for this to trigger a problem it can cause data corruption
      because the reshape isn't finished properly.
      So it is suitable for any stable which the offending commit was applied to.
      (3.2 or later)
      
      Fixes: 8313b8e5
      Cc: stable@vger.kernel.org (3.2+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2ac295a5
  7. 28 5月, 2014 1 次提交
    • N
      md: always set MD_RECOVERY_INTR when aborting a reshape or other "resync". · 3991b31e
      NeilBrown 提交于
      If mddev->ro is set, md_to_sync will (correctly) abort.
      However in that case MD_RECOVERY_INTR isn't set.
      
      If a RESHAPE had been requested, then ->finish_reshape() will be
      called and it will think the reshape was successful even though
      nothing happened.
      
      Normally a resync will not be requested if ->ro is set, but if an
      array is stopped while a reshape is on-going, then when the array is
      started, the reshape will be restarted.  If the array is also set
      read-only at this point, the reshape will instantly appear to success,
      resulting in data corruption.
      
      Consequently, this patch is suitable for any -stable kernel.
      
      Cc: stable@vger.kernel.org (any)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3991b31e
  8. 27 5月, 2014 2 次提交
  9. 21 5月, 2014 1 次提交
    • M
      dm thin: add 'no_space_timeout' dm-thin-pool module param · 80c57893
      Mike Snitzer 提交于
      Commit 85ad643b ("dm thin: add timeout to stop out-of-data-space mode
      holding IO forever") introduced a fixed 60 second timeout.  Users may
      want to either disable or modify this timeout.
      
      Allow the out-of-data-space timeout to be configured using the
      'no_space_timeout' dm-thin-pool module param.  Setting it to 0 will
      disable the timeout, resulting in IO being queued until more data space
      is added to the thin-pool.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.14+
      80c57893
  10. 15 5月, 2014 4 次提交
    • M
      dm mpath: fix lock order inconsistency in multipath_ioctl · 4cdd2ad7
      Mike Snitzer 提交于
      Commit 3e9f1be1 ("dm mpath: remove process_queued_ios()") did not
      consistently take the multipath device's spinlock (m->lock) before
      calling dm_table_run_md_queue_async() -- which takes the q->queue_lock.
      
      Found with code inspection using hint from reported lockdep warning.
      Reported-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      4cdd2ad7
    • J
      dm thin: add timeout to stop out-of-data-space mode holding IO forever · 85ad643b
      Joe Thornber 提交于
      If the pool runs out of data space, dm-thin can be configured to
      either error IOs that would trigger provisioning, or hold those IOs
      until the pool is resized.  Unfortunately, holding IOs until the pool is
      resized can result in a cascade of tasks hitting the hung_task_timeout,
      which may render the system unavailable.
      
      Add a fixed timeout so IOs can only be held for a maximum of 60 seconds.
      If LVM is going to resize a thin-pool that is out of data space it needs
      to be prompt about it.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.14+
      85ad643b
    • J
      dm thin: allow metadata commit if pool is in PM_OUT_OF_DATA_SPACE mode · 8d07e8a5
      Joe Thornber 提交于
      Commit 3e1a0699 ("dm thin: fix out of data space handling") introduced
      a regression in the metadata commit() method by returning an error if
      the pool is in PM_OUT_OF_DATA_SPACE mode.  This oversight caused a thin
      device to return errors even if the default queue_if_no_space ENOSPC
      handling mode is used.
      
      Fix commit() to only fail if pool is in PM_READ_ONLY or PM_FAIL mode.
      
      Reported-by: qindehua@163.com
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.14+
      8d07e8a5
    • M
      dm crypt: fix cpu hotplug crash by removing per-cpu structure · 610f2de3
      Mikulas Patocka 提交于
      The DM crypt target used per-cpu structures to hold pointers to a
      ablkcipher_request structure.  The code assumed that the work item keeps
      executing on a single CPU, so it didn't use synchronization when
      accessing this structure.
      
      If a CPU is disabled by writing 0 to /sys/devices/system/cpu/cpu*/online,
      the work item could be moved to another CPU.  This causes dm-crypt
      crashes, like the following, because the code starts using an incorrect
      ablkcipher_request:
      
       smpboot: CPU 7 is now offline
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000130
       IP: [<ffffffffa1862b3d>] crypt_convert+0x12d/0x3c0 [dm_crypt]
       ...
       Call Trace:
        [<ffffffffa1864415>] ? kcryptd_crypt+0x305/0x470 [dm_crypt]
        [<ffffffff81062060>] ? finish_task_switch+0x40/0xc0
        [<ffffffff81052a28>] ? process_one_work+0x168/0x470
        [<ffffffff8105366b>] ? worker_thread+0x10b/0x390
        [<ffffffff81053560>] ? manage_workers.isra.26+0x290/0x290
        [<ffffffff81058d9f>] ? kthread+0xaf/0xc0
        [<ffffffff81058cf0>] ? kthread_create_on_node+0x120/0x120
        [<ffffffff813464ac>] ? ret_from_fork+0x7c/0xb0
        [<ffffffff81058cf0>] ? kthread_create_on_node+0x120/0x120
      
      Fix this bug by removing the per-cpu definition.  The structure
      ablkcipher_request is accessed via a pointer from convert_context.
      Consequently, if the work item is rescheduled to a different CPU, the
      thread still uses the same ablkcipher_request.
      
      This change may undermine performance improvements intended by commit
      c0297721 ("dm crypt: scale to multiple cpus") on select hardware.  In
      practice no performance difference was observed on recent hardware.  But
      regardless, correctness is more important than performance.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      610f2de3
  11. 06 5月, 2014 2 次提交
    • N
      md: avoid possible spinning md thread at shutdown. · 0f62fb22
      NeilBrown 提交于
      If an md array with externally managed metadata (e.g. DDF or IMSM)
      is in use, then we should not set safemode==2 at shutdown because:
      
      1/ this is ineffective: user-space need to be involved in any 'safemode' handling,
      2/ The safemode management code doesn't cope with safemode==2 on external metadata
         and md_check_recover enters an infinite loop.
      
      Even at shutdown, an infinite-looping process can be problematic, so this
      could cause shutdown to hang.
      
      Cc: stable@vger.kernel.org (any kernel)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0f62fb22
    • N
      md/raid10: call wait_barrier() for each request submitted. · cc13b1d1
      NeilBrown 提交于
      wait_barrier() includes a counter, so we must call it precisely once
      (unless balanced by allow_barrier()) for each request submitted.
      
      Since
      commit 20d0189b
          block: Introduce new bio_split()
      in 3.14-rc1, we don't call it for the extra requests generated when
      we need to split a bio.
      
      When this happens the counter goes negative, any resync/recovery will
      never start, and  "mdadm --stop" will hang.
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Fixes: 20d0189b
      Cc: stable@vger.kernel.org (3.14+)
      Cc: Kent Overstreet <kmo@daterainc.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cc13b1d1
  12. 02 5月, 2014 1 次提交
  13. 29 4月, 2014 1 次提交
  14. 18 4月, 2014 1 次提交
  15. 17 4月, 2014 1 次提交
  16. 16 4月, 2014 2 次提交
    • J
      block: remove struct request buffer member · b4f42e28
      Jens Axboe 提交于
      This was used in the olden days, back when onions were proper
      yellow. Basically it mapped to the current buffer to be
      transferred. With highmem being added more than a decade ago,
      most drivers map pages out of a bio, and rq->buffer isn't
      pointing at anything valid.
      
      Convert old style drivers to just use bio_data().
      
      For the discard payload use case, just reference the page
      in the bio.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b4f42e28
    • M
      dm verity: fix biovecs hash calculation regression · 3a774521
      Milan Broz 提交于
      Commit 003b5c57 ("block: Convert drivers
      to immutable biovecs") incorrectly converted biovec iteration in
      dm-verity to always calculate the hash from a full biovec, but the
      function only needs to calculate the hash from part of the biovec (up to
      the calculated "todo" value).
      
      Fix this issue by limiting hash input to only the requested data size.
      
      This problem was identified using the cryptsetup regression test for
      veritysetup (verity-compat-test).
      Signed-off-by: NMilan Broz <gmazyland@gmail.com>
      Acked-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.14+
      3a774521
  17. 09 4月, 2014 2 次提交
    • S
      raid5: get_active_stripe avoids device_lock · e240c183
      Shaohua Li 提交于
      For sequential workload (or request size big workload), get_active_stripe can
      find cached stripe. In this case, we always hold device_lock, which exposes a
      lot of lock contention for such workload. If stripe count isn't 0, we don't
      need hold the lock actually, since we just increase its count. And this is the
      hot code path for such workload. Unfortunately we must delete the BUG_ON.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e240c183
    • S
      raid5: make_request does less prepare wait · 27c0f68f
      Shaohua Li 提交于
      In NUMA machine, prepare_to_wait/finish_wait in make_request exposes a
      lot of contention for sequential workload (or big request size
      workload). For such workload, each bio includes several stripes. So we
      can just do prepare_to_wait/finish_wait once for the whold bio instead
      of every stripe.  This reduces the lock contention completely for such
      workload. Random workload might have the similar lock contention too,
      but I didn't see it yet, maybe because my stroage is still not fast
      enough.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      27c0f68f