1. 08 8月, 2014 3 次提交
  2. 02 8月, 2014 2 次提交
    • A
      dm cache: fix race affecting dirty block count · 44fa816b
      Anssi Hannula 提交于
      nr_dirty is updated without locking, causing it to drift so that it is
      non-zero (either a small positive integer, or a very large one when an
      underflow occurs) even when there are no actual dirty blocks.  This was
      due to a race between the workqueue and map function accessing nr_dirty
      in parallel without proper protection.
      
      People were seeing under runs due to a race on increment/decrement of
      nr_dirty, see: https://lkml.org/lkml/2014/6/3/648
      
      Fix this by using an atomic_t for nr_dirty.
      
      Reported-by: roma1390@gmail.com
      Signed-off-by: NAnssi Hannula <anssi.hannula@iki.fi>
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      44fa816b
    • G
      dm bufio: fully initialize shrinker · d8c712ea
      Greg Thelen 提交于
      1d3d4437 ("vmscan: per-node deferred work") added a flags field to
      struct shrinker assuming that all shrinkers were zero filled.  The dm
      bufio shrinker is not zero filled, which leaves arbitrary kmalloc() data
      in flags.  So far the only defined flags bit is SHRINKER_NUMA_AWARE.
      But there are proposed patches which add other bits to shrinker.flags
      (e.g. memcg awareness).
      
      Rather than simply initializing the shrinker, this patch uses kzalloc()
      when allocating the dm_bufio_client to ensure that the embedded shrinker
      and any other similar structures are zeroed.
      
      This fixes theoretical over aggressive shrinking of dm bufio objects.
      If the uninitialized dm_bufio_client.shrinker.flags contains
      SHRINKER_NUMA_AWARE then shrink_slab() would call the dm shrinker for
      each numa node rather than just once.  This has been broken since 3.12.
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # v3.12+
      d8c712ea
  3. 31 7月, 2014 2 次提交
    • N
      md: disable probing for md devices 512 and over. · af5628f0
      NeilBrown 提交于
      The way md devices are traditionally created in the kernel
      is simply to open the device with the desired major/minor number.
      
      This can be problematic as some support tools, notably udev and
      programs run by udev, can open a device just to see what is there, and
      find that it has created something.  It is easy for a race to cause
      udev to open an md device just after it was destroy, causing it to
      suddenly re-appear.
      
      For some time we have had an alternate way to create md devices
        echo md_somename > /sys/modules/md_mod/paramaters/new_array
      
      This will always use a minor number of 512 or higher, which mdadm
      normally avoids.
      Using this makes the creation-by-opening unnecessary, but does
      not disable it, so it is still there to cause problems.
      
      This patch disable probing for devices with a major of 9 (MD_MAJOR)
      and a minor of 512 and up.  This devices created by writing to
      new_array cannot be re-created by opening the node in /dev.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      af5628f0
    • N
      md/raid1,raid10: always abort recover on write error. · 2446dba0
      NeilBrown 提交于
      Currently we don't abort recovery on a write error if the write error
      to the recovering device was triggerd by normal IO (as opposed to
      recovery IO).
      
      This means that for one bitmap region, the recovery might write to the
      recovering device for a few sectors, then not bother for subsequent
      sectors (as it never writes to failed devices).  In this case
      the bitmap bit will be cleared, but it really shouldn't.
      
      The result is that if the recovering device fails and is then re-added
      (after fixing whatever hardware problem triggerred the failure),
      the second recovery won't redo the region it was in the middle of,
      so some of the device will not be recovered properly.
      
      If we abort the recovery, the region being processes will be cancelled
      (bit not cleared) and the whole region will be retried.
      
      As the bug can result in data corruption the patch is suitable for
      -stable.  For kernels prior to 3.11 there is a conflict in raid10.c
      which will require care.
      
      Original-from: jiao hui <jiaohui@bwstor.com.cn>
      Reported-and-tested-by: Njiao hui <jiaohui@bwstor.com.cn>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@vger.kernel.org
      2446dba0
  4. 16 7月, 2014 3 次提交
    • N
      sched: Remove proliferation of wait_on_bit() action functions · 74316201
      NeilBrown 提交于
      The current "wait_on_bit" interface requires an 'action'
      function to be provided which does the actual waiting.
      There are over 20 such functions, many of them identical.
      Most cases can be satisfied by one of just two functions, one
      which uses io_schedule() and one which just uses schedule().
      
      So:
       Rename wait_on_bit and        wait_on_bit_lock to
              wait_on_bit_action and wait_on_bit_lock_action
       to make it explicit that they need an action function.
      
       Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
       which are *not* given an action function but implicitly use
       a standard one.
       The decision to error-out if a signal is pending is now made
       based on the 'mode' argument rather than being encoded in the action
       function.
      
       All instances of the old wait_on_bit and wait_on_bit_lock which
       can use the new version have been changed accordingly and their
       action functions have been discarded.
       wait_on_bit{_lock} does not return any specific error code in the
       event of a signal so the caller must check for non-zero and
       interpolate their own error code as appropriate.
      
      The wait_on_bit() call in __fscache_wait_on_invalidate() was
      ambiguous as it specified TASK_UNINTERRUPTIBLE but used
      fscache_wait_bit_interruptible as an action function.
      David Howells confirms this should be uniformly
      "uninterruptible"
      
      The main remaining user of wait_on_bit{,_lock}_action is NFS
      which needs to use a freezer-aware schedule() call.
      
      A comment in fs/gfs2/glock.c notes that having multiple 'action'
      functions is useful as they display differently in the 'wchan'
      field of 'ps'. (and /proc/$PID/wchan).
      As the new bit_wait{,_io} functions are tagged "__sched", they
      will not show up at all, but something higher in the stack.  So
      the distinction will still be visible, only with different
      function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
      gfs2/glock.c case).
      
      Since first version of this patch (against 3.15) two new action
      functions appeared, on in NFS and one in CIFS.  CIFS also now
      uses an action function that makes the same freezer aware
      schedule call as NFS.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
      Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steve French <sfrench@samba.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brownSigned-off-by: NIngo Molnar <mingo@kernel.org>
      74316201
    • M
      dm cache metadata: do not allow the data block size to change · 048e5a07
      Mike Snitzer 提交于
      The block size for the dm-cache's data device must remained fixed for
      the life of the cache.  Disallow any attempt to change the cache's data
      block size.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Cc: stable@vger.kernel.org
      048e5a07
    • M
      dm thin metadata: do not allow the data block size to change · 9aec8629
      Mike Snitzer 提交于
      The block size for the thin-pool's data device must remained fixed for
      the life of the thin-pool.  Disallow any attempt to change the
      thin-pool's data block size.
      
      It should be noted that attempting to change the data block size via
      thin-pool table reload will be ignored as a side-effect of the thin-pool
      handover that the thin-pool target does during thin-pool table reload.
      
      Here is an example outcome of attempting to load a thin-pool table that
      reduced the thin-pool's data block size from 1024K to 512K.
      
      Before:
      kernel: device-mapper: thin: 253:4: growing the data device from 204800 to 409600 blocks
      
      After:
      kernel: device-mapper: thin metadata: changing the data block size (from 2048 to 1024) is not supported
      kernel: device-mapper: table: 253:4: thin-pool: Error creating metadata object
      kernel: device-mapper: ioctl: error adding target to table
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Cc: stable@vger.kernel.org
      9aec8629
  5. 11 7月, 2014 4 次提交
    • J
      dm mpath: fix IO hang due to logic bug in multipath_busy · 7a7a3b45
      Jun'ichi Nomura 提交于
      Commit e8099177 ("dm mpath: push back requests instead of queueing")
      modified multipath_busy() to return true if !pg_ready().  pg_ready()
      checks the current state of the multipath device and may return false
      even if a new IO is needed to change the state.
      
      Bart Van Assche reported that he had multipath IO lockup when he was
      performing cable pull tests.  Analysis showed that the multipath
      device had a single path group with both paths active, but that the
      path group itself was not active.  During the multipath device state
      transitions 'queue_io' got set but nothing could clear it.  Clearing
      'queue_io' only happens in __choose_pgpath(), but it won't be called
      if multipath_busy() returns true due to pg_ready() returning false
      when 'queue_io' is set.
      
      As such the !pg_ready() check in multipath_busy() is wrong because new
      IO will not be sent to multipath target and the multipath state change
      won't happen.  That results in multipath IO lockup.
      
      The intent of multipath_busy() is to avoid unnecessary cycles of
      dequeue + request_fn + requeue if it is known that the multipath
      device will requeue.
      
      Such "busy" situations would be:
        - path group is being activated
        - there is no path and the multipath is setup to requeue if no path
      
      Fix multipath_busy() to return "busy" early only for these specific
      situations.
      Reported-by: NBart Van Assche <bvanassche@acm.org>
      Tested-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # v3.15
      7a7a3b45
    • J
      dm io: fix a race condition in the wake up code for sync_io · 10f1d5d1
      Joe Thornber 提交于
      There's a race condition between the atomic_dec_and_test(&io->count)
      in dec_count() and the waking of the sync_io() thread.  If the thread
      is spuriously woken immediately after the decrement it may exit,
      making the on stack io struct invalid, yet the dec_count could still
      be using it.
      
      Fix this race by using a completion in sync_io() and dec_count().
      Reported-by: NMinfei Huang <huangminfei@ucloud.cn>
      Signed-off-by: NJoe Thornber <thornber@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      10f1d5d1
    • J
      bf14299f
    • M
      dm: allocate a special workqueue for deferred device removal · acfe0ad7
      Mikulas Patocka 提交于
      The commit 2c140a24 ("dm: allow remove to be deferred") introduced a
      deferred removal feature for the device mapper.  When this feature is
      used (by passing a flag DM_DEFERRED_REMOVE to DM_DEV_REMOVE_CMD ioctl)
      and the user tries to remove a device that is currently in use, the
      device will be removed automatically in the future when the last user
      closes it.
      
      Device mapper used the system workqueue to perform deferred removals.
      However, some targets (dm-raid1, dm-mpath, dm-stripe) flush work items
      scheduled for the system workqueue from their destructor.  If the
      destructor itself is called from the system workqueue during deferred
      removal, it introduces a possible deadlock - the workqueue tries to flush
      itself.
      
      Fix this possible deadlock by introducing a new workqueue for deferred
      removals.  We allocate just one workqueue for all dm targets.  The
      ability of dm targets to process IOs isn't dependent on deferred removal
      of unused targets, so a deadlock due to shared workqueue isn't possible.
      
      Also, cleanup local_init() to eliminate potential for returning success
      on failure.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.13+
      acfe0ad7
  6. 03 7月, 2014 2 次提交
  7. 12 6月, 2014 2 次提交
  8. 10 6月, 2014 1 次提交
    • E
      raid5: speedup sync_request processing · 053f5b65
      Eivind Sarto 提交于
      The raid5 sync_request() processing calls handle_stripe() within the context of
      the resync-thread.  The resync-thread issues the first set of read requests
      and this adds execution latency and slows down the scheduling of the next
      sync_request().
      The current rebuild/resync speed of raid5 is not much faster than what
      rotational HDDs can sustain.
      Testing the following patch on a 6-drive array, I can increase the rebuild
      speed from 100 MB/s to 175 MB/s.
      The sync_request() now just sets STRIPE_HANDLE and releases the stripe.  This
      creates some more parallelism between the resync-thread and raid5 kernel daemon.
      Signed-off-by: NEivind Sarto <esarto@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      053f5b65
  9. 05 6月, 2014 1 次提交
    • H
      md/raid5: deadlock between retry_aligned_read with barrier io · 2844dc32
      hui jiao 提交于
      A chunk aligned read increases counter active_aligned_reads and
      decreases it after sub-device handle it successfully. But when a read
      error occurs,  the read redispatched by raid5d, and the
      active_aligned_reads will not be decreased until we can grab a stripe
      head in retry_aligned_read. Now suppose, a barrier io comes, set
      conf->quiesce to 2, and wait until both active_stripes and
      active_aligned_reads are zero. The retried chunk aligned read gets
      stuck at get_active_stripe waiting until conf->quiesce becomes 0.
      Retry_aligned_read and barrier io are waiting each other now.
      One possible solution is that we ignore conf->quiesce, let the retried
      aligned read finish. I reproduced this deadlock and test this patch on
      centos6.0
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2844dc32
  10. 04 6月, 2014 9 次提交
  11. 29 5月, 2014 7 次提交
    • S
      raid5: add an option to avoid copy data from bio to stripe cache · d592a996
      Shaohua Li 提交于
      The stripe cache has two goals:
      1. cache data, so next time if data can be found in stripe cache, disk access
      can be avoided.
      2. stable data. data is copied from bio to stripe cache and calculated parity.
      data written to disk is from stripe cache, so if upper layer changes bio data,
      data written to disk isn't impacted.
      
      In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
      can guarantee 2 too. For 1, it's not common too. block plug mechanism will
      dispatch a bunch of sequentail small requests together. And since I'm using
      SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
      
      So I'd like to avoid the copy from bio to stripe cache and it's very helpful
      for performance. In my 1M randwrite tests, avoid the copy can increase the
      performance more than 30%.
      
      Of course, this shouldn't be enabled by default. It's reported enabling
      BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
      control it.
      
      Neilb:
        changed BUG_ON to WARN_ON
        Removed some assignments from raid5_build_block which are now not needed.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d592a996
    • N
      md/bitmap: remove confusing code from filemap_get_page. · f2e06c58
      NeilBrown 提交于
      file_page_index(store, 0) is *always* 0.
      This is because the bitmap sb, at 256 bytes, is *always* less than
      one page.
      So subtracting it has no effect and the code should be removed.
      Reported-by: NGoldwyn Rodrigues <rgoldwyn@suse.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f2e06c58
    • E
      raid5: avoid release list until last reference of the stripe · cf170f3f
      Eivind Sarto 提交于
      The (lockless) release_list reduces lock contention, but there is excessive
      queueing and dequeuing of stripes on this list.  A stripe will currently be
      queued on the release_list with a stripe reference count > 1.  This can cause
      the raid5 kernel thread(s) to dequeue the stripe and decrement the refcount
      without doing any other useful processing of the stripe.  The are two cases
      when the stripe can be put on the release_list multiple times before it is
      actually handled by the kernel thread(s).
      1) make_request() activates the stripe processing in 4k increments.  When a
         write request is large enough to span multiple chunks of a stripe_head, the
         first 4k chunk adds the stripe to the plug list.  The next 4k chunk that is
         processed for the same stripe puts the stripe on the release_list with a
         refcount=2.  This can cause the kernel thread to process and decrement the
         stripe before the stripe us unplugged, which again will put it back on the
         release_list.
      2) Whenever IO is scheduled on a stripe (pre-read and/or write), the stripe
         refcount is set to the number of active IO (for each chunk).  The stripe is
         released as each IO complete, and can be queued and dequeued multiple times
         on the release_list, until its refcount finally reached zero.
      
      This simple patch will ensure a stripe is only queued on the release_list when
      its refcount=1 and is ready to be handled by the kernel thread(s).  I added some
      instrumentation to raid5 and counted the number of times striped were queued on
      the release_list for a variety of write IO sizes.  Without this patch the number
      of times stripes got queued on the release_list was 100-500% higher than with
      the patch.  The excess queuing will increase with the IO size.  The patch also
      improved throughput by 5-10%.
      Signed-off-by: NEivind Sarto <esarto@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cf170f3f
    • N
      md: md_clear_badblocks should return an error code on failure. · 8b32bf5e
      NeilBrown 提交于
      Julia Lawall and coccinelle report that md_clear_badblocks always
      returns 0, despite appearing to have an error path.
      The error path really should return an error code.  ENOSPC is
      reasonably appropriate.
      Reported-by: NJulia Lawall <Julia.Lawall@lip6.fr>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8b32bf5e
    • N
      md/raid56: Don't perform reads to support writes until stripe is ready. · 67f45548
      NeilBrown 提交于
      If it is found that we need to pre-read some blocks before a write
      can succeed, we normally set STRIPE_DELAYED and don't actually perform
      the read until STRIPE_PREREAD_ACTIVE subsequently gets set.
      
      However for a degraded RAID6 we currently perform the reads as soon
      as we see that a write is pending.  This significantly hurts
      throughput.
      
      So:
       - when handle_stripe_dirtying find a block that it wants on a device
         that is failed, set STRIPE_DELAY, instead of doing nothing, and
       - when fetch_block detects that a read might be required to satisfy a
         write, only perform the read if STRIPE_PREREAD_ACTIVE is set,
         and if we would actually need to read something to complete the write.
      
      This also helps RAID5, though less often as RAID5 supports a
      read-modify-write cycle.  For RAID5 the read is performed too early
      only if the write is not a full 4K aligned write (i.e. no an
      R5_OVERWRITE).
      
      Also clean up a couple of horrible bits of formatting.
      Reported-by: NPatrik Horník <patrik@dsl.sk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      67f45548
    • N
      md: refuse to change shape of array if it is active but read-only · bd8839e0
      NeilBrown 提交于
      read-only arrays should not be changed.  This includes changing
      the level, layout, size, or number of devices.
      
      So reject those changes for readonly arrays.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bd8839e0
    • N
      md: always set MD_RECOVERY_INTR when interrupting a reshape thread. · 2ac295a5
      NeilBrown 提交于
      Commit 8313b8e5
         md: fix problem when adding device to read-only array with bitmap.
      
      added a called to md_reap_sync_thread() which cause a reshape thread
      to be interrupted (in particular, it could cause md_thread() to never even
      call md_do_sync()).
      However it didn't set MD_RECOVERY_INTR so ->finish_reshape() would not
      know that the reshape didn't complete.
      
      This only happens when mddev->ro is set and normally reshape threads
      don't run in that situation.  But raid5 and raid10 can start a reshape
      thread during "run" is the array is in the middle of a reshape.
      They do this even if ->ro is set.
      
      So it is best to set MD_RECOVERY_INTR before abortingg the
      sync thread, just in case.
      
      Though it rare for this to trigger a problem it can cause data corruption
      because the reshape isn't finished properly.
      So it is suitable for any stable which the offending commit was applied to.
      (3.2 or later)
      
      Fixes: 8313b8e5
      Cc: stable@vger.kernel.org (3.2+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2ac295a5
  12. 28 5月, 2014 1 次提交
    • N
      md: always set MD_RECOVERY_INTR when aborting a reshape or other "resync". · 3991b31e
      NeilBrown 提交于
      If mddev->ro is set, md_to_sync will (correctly) abort.
      However in that case MD_RECOVERY_INTR isn't set.
      
      If a RESHAPE had been requested, then ->finish_reshape() will be
      called and it will think the reshape was successful even though
      nothing happened.
      
      Normally a resync will not be requested if ->ro is set, but if an
      array is stopped while a reshape is on-going, then when the array is
      started, the reshape will be restarted.  If the array is also set
      read-only at this point, the reshape will instantly appear to success,
      resulting in data corruption.
      
      Consequently, this patch is suitable for any -stable kernel.
      
      Cc: stable@vger.kernel.org (any)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3991b31e
  13. 27 5月, 2014 2 次提交
  14. 21 5月, 2014 1 次提交
    • M
      dm thin: add 'no_space_timeout' dm-thin-pool module param · 80c57893
      Mike Snitzer 提交于
      Commit 85ad643b ("dm thin: add timeout to stop out-of-data-space mode
      holding IO forever") introduced a fixed 60 second timeout.  Users may
      want to either disable or modify this timeout.
      
      Allow the out-of-data-space timeout to be configured using the
      'no_space_timeout' dm-thin-pool module param.  Setting it to 0 will
      disable the timeout, resulting in IO being queued until more data space
      is added to the thin-pool.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.14+
      80c57893