1. 26 6月, 2015 4 次提交
  2. 18 6月, 2015 5 次提交
  3. 17 6月, 2015 1 次提交
    • J
      dm space map metadata: fix occasional leak of a metadata block on resize · 6096d91a
      Joe Thornber 提交于
      The metadata space map has a simplified 'bootstrap' mode that is
      operational when extending the space maps.  Whilst in this mode it's
      possible for some refcount decrement operations to become queued (eg, as
      a result of shadowing one of the bitmap indexes).  These decrements were
      not being applied when switching out of bootstrap mode.
      
      The effect of this bug was the leaking of a 4k metadata block.  This is
      detected by the latest version of thin_check as a non fatal error.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      6096d91a
  4. 12 6月, 2015 15 次提交
    • N
      md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync · ea358cd0
      NeilBrown 提交于
      MD_RECOVERY_DONE is normally cleared by md_check_recovery after a
      resync etc finished.  However it is possible for raid5_start_reshape
      to race and start a reshape before MD_RECOVERY_DONE is cleared.  This
      can lean to multiple reshapes running at the same time, which isn't
      good.
      
      To make sure it is cleared before starting a reshape, and also clear
      it when reaping a thread, just to be safe.
      Signed-off-by: NNeilBrown  <neilb@suse.de>
      ea358cd0
    • N
      md: Close race when setting 'action' to 'idle'. · 8e8e2518
      NeilBrown 提交于
      Checking ->sync_thread without holding the mddev_lock()
      isn't really safe, even after flushing the workqueue which
      ensures md_start_sync() has been run.
      
      While this code is waiting for the lock, md_check_recovery could reap
      the thread itself, and then start another thread (e.g. recovery might
      finish, then reshape starts).  When this thread gets the lock
      md_start_sync() hasn't run so it doesn't get reaped, but
      MD_RECOVERY_RUNNING gets cleared.  This allows two threads to start
      which leads to confusion.
      
      So don't both if MD_RECOVERY_RUNNING isn't set, but if it is do
      the flush and the test and the reap all under the mddev_lock to
      avoid any race with md_check_recovery.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Fixes: 6791875e ("md: make reconfig_mutex optional for writes to md sysfs files.")
      Cc: stable@vger.kernel.org (v4.0+)
      8e8e2518
    • N
      md: don't return 0 from array_state_store · c008f1d3
      NeilBrown 提交于
      Returning zero from a 'store' function is bad.
      The return value should be either len length of the string
      or an error.
      
      So use 'len' if 'err' is zero.
      
      Fixes: 6791875e ("md: make reconfig_mutex optional for writes to md sysfs files.")
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@vger.kernel (v4.0+)
      c008f1d3
    • J
      dm thin metadata: fix a race when entering fail mode · b1f11aff
      Joe Thornber 提交于
      In dm_thin_find_block() the ->fail_io flag was checked outside the
      metadata device's root_lock, causing dm_thin_find_block() to race with
      the setting of this flag.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b1f11aff
    • M
      dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages · fd467696
      Mike Snitzer 提交于
      Use EOPNOTSUPP, rather than EINVAL, error code when user attempts to
      send the pool a message.  Otherwise usespace is led to believe the
      message failed due to invalid argument.
      Reported-by: NZdenek Kabelac <zkabelac@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      fd467696
    • J
      dm thin: range discard support · 34fbcf62
      Joe Thornber 提交于
      Previously REQ_DISCARD bios have been split into block sized chunks
      before submission to the thin target.  There are a couple of issues with
      this:
      
       - If the block size is small, a large discard request can
         get broken up into a great many bios which is both slow and causes
         a lot of memory pressure.
      
       - The thin pool block size and the discard granularity for the
         underlying data device need to be compatible if we want to passdown
         the discard.
      
      This patch relaxes the block size granularity for thin devices.  It
      makes use of the recent range locking added to the bio_prison to
      quiesce a whole range of thin blocks before unmapping them.  Once a
      thin range has been unmapped the discard can then be passed down to
      the data device for those sub ranges where the data blocks are no
      longer used (ie. they weren't shared in the first place).
      
      This patch also doesn't make any apologies about open-coding portions
      of block core as a means to supporting async discard completions in the
      near-term -- if/when late bio splitting lands it'll all get cleaned up.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      34fbcf62
    • J
      dm thin metadata: add dm_thin_remove_range() · 6550f075
      Joe Thornber 提交于
      Removes a range of blocks from the btree.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      6550f075
    • J
      dm thin metadata: add dm_thin_find_mapped_range() · a5d895a9
      Joe Thornber 提交于
      Retrieve the next run of contiguously mapped blocks.  Useful for working
      out where to break up IO.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      a5d895a9
    • J
      dm btree: add dm_btree_remove_leaves() · 4ec331c3
      Joe Thornber 提交于
      Removes a range of leaf values from the tree.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      4ec331c3
    • P
      dm stats: Use kvfree() in dm_kvfree() · 0f24b79b
      Pekka Enberg 提交于
      Use kvfree() instead of open-coding it.
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      0f24b79b
    • J
      dm cache: age and write back cache entries even without active IO · fba10109
      Joe Thornber 提交于
      The policy tick() method is normally called from interrupt context.
      Both the mq and smq policies do some bottom half work for the tick
      method in their map functions.  However if no IO is going through the
      cache, then that bottom half work doesn't occur.  With these policies
      this means recently hit entries do not age and do not get written
      back as early as we'd like.
      
      Fix this by introducing a new 'can_block' parameter to the tick()
      method.  When this is set the bottom half work occurs immediately.
      'can_block' is set when the tick method is called every second by the
      core target (not in interrupt context).
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      fba10109
    • M
      dm cache: prefix all DMERR and DMINFO messages with cache device name · b61d9509
      Mike Snitzer 提交于
      Having the DM device name associated with the ERR or INFO message is
      very helpful.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b61d9509
    • J
      dm cache: add fail io mode and needs_check flag · 028ae9f7
      Joe Thornber 提交于
      If a cache metadata operation fails (e.g. transaction commit) the
      cache's metadata device will abort the current transaction, set a new
      needs_check flag, and the cache will transition to "read-only" mode.  If
      aborting the transaction or setting the needs_check flag fails the cache
      will transition to "fail-io" mode.
      
      Once needs_check is set the cache device will not be allowed to
      activate.  Activation requires write access to metadata.  Future work is
      needed to add proper support for running the cache in read-only mode.
      
      Once in fail-io mode the cache will report a status of "Fail".
      
      Also, add commit() wrapper that will disallow commits if in read_only or
      fail mode.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      028ae9f7
    • J
      dm cache: wake the worker thread every time we free a migration object · 88bf5184
      Joe Thornber 提交于
      When the cache is idle, writeback work was only being issued every
      second.  With this change outstanding writebacks are streamed
      constantly.  This offers a writeback performance improvement.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      88bf5184
    • J
      dm cache: add stochastic-multi-queue (smq) policy · 66a63635
      Joe Thornber 提交于
      The stochastic-multi-queue (smq) policy addresses some of the problems
      with the current multiqueue (mq) policy.
      
      Memory usage
      ------------
      
      The mq policy uses a lot of memory; 88 bytes per cache block on a 64
      bit machine.
      
      SMQ uses 28bit indexes to implement it's data structures rather than
      pointers.  It avoids storing an explicit hit count for each block.  It
      has a 'hotspot' queue rather than a pre cache which uses a quarter of
      the entries (each hotspot block covers a larger area than a single
      cache block).
      
      All these mean smq uses ~25bytes per cache block.  Still a lot of
      memory, but a substantial improvement nontheless.
      
      Level balancing
      ---------------
      
      MQ places entries in different levels of the multiqueue structures
      based on their hit count (~ln(hit count)).  This means the bottom
      levels generally have the most entries, and the top ones have very
      few.  Having unbalanced levels like this reduces the efficacy of the
      multiqueue.
      
      SMQ does not maintain a hit count, instead it swaps hit entries with
      the least recently used entry from the level above.  The over all
      ordering being a side effect of this stochastic process.  With this
      scheme we can decide how many entries occupy each multiqueue level,
      resulting in better promotion/demotion decisions.
      
      Adaptability
      ------------
      
      The MQ policy maintains a hit count for each cache block.  For a
      different block to get promoted to the cache it's hit count has to
      exceed the lowest currently in the cache.  This means it can take a
      long time for the cache to adapt between varying IO patterns.
      Periodically degrading the hit counts could help with this, but I
      haven't found a nice general solution.
      
      SMQ doesn't maintain hit counts, so a lot of this problem just goes
      away.  In addition it tracks performance of the hotspot queue, which
      is used to decide which blocks to promote.  If the hotspot queue is
      performing badly then it starts moving entries more quickly between
      levels.  This lets it adapt to new IO patterns very quickly.
      
      Performance
      -----------
      
      In my tests SMQ shows substantially better performance than MQ.  Once
      this matures a bit more I'm sure it'll become the default policy.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      66a63635
  5. 02 6月, 2015 2 次提交
    • T
      writeback: separate out include/linux/backing-dev-defs.h · 66114cad
      Tejun Heo 提交于
      With the planned cgroup writeback support, backing-dev related
      declarations will be more widely used across block and cgroup;
      unfortunately, including backing-dev.h from include/linux/blkdev.h
      makes cyclic include dependency quite likely.
      
      This patch separates out backing-dev-defs.h which only has the
      essential definitions and updates blkdev.h to include it.  c files
      which need access to more backing-dev details now include
      backing-dev.h directly.  This takes backing-dev.h off the common
      include dependency chain making it a lot easier to use it across block
      and cgroup.
      
      v2: fs/fat build failure fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      66114cad
    • T
      writeback: move backing_dev_info->state into bdi_writeback · 4452226e
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bdi->state into wb.
      
      * enum bdi_state is renamed to wb_state and the prefix of all enums is
        changed from BDI_ to WB_.
      
      * Explicit zeroing of bdi->state is removed without adding zeoring of
        wb->state as the whole data structure is zeroed on init anyway.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->state are mechanically replaced with bdi->wb.state
        introducing no behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: drbd-dev@lists.linbit.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4452226e
  6. 30 5月, 2015 13 次提交