1. 12 10月, 2015 5 次提交
    • G
      md-cluster: Perform a lazy update · 2aa82191
      Goldwyn Rodrigues 提交于
      In a clustered environment, a change such as marking a device faulty,
      can be recorded by any of the nodes. This is communicated to all the
      nodes and re-recording such a change is unnecessary, and quite often
      pretty disruptive.
      
      With this patch, just before the update, we detect for the changes
      and if the changes are already in superblock, we abort the update
      after clearing all the flags
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      2aa82191
    • G
      md-cluster: Improve md_reload_sb to be less error prone · 70bcecdb
      Goldwyn Rodrigues 提交于
      md_reload_sb is too simplistic and it explicitly needs to determine
      the changes made by the writing node. However, there are multiple areas
      where a simple reload could fail.
      
      Instead, read the superblock of one of the "good" rdevs and update
      the necessary information:
      
      - read the superblock into a newly allocated page, by temporarily
        swapping out rdev->sb_page and calling ->load_super.
      - if that fails return
      - if it succeeds, call check_sb_changes
        1. iterates over list of active devices and checks the matching
         dev_roles[] value.
         	If that is 'faulty', the device must be  marked as faulty
      	 - call md_error to mark the device as faulty. Make sure
      	   not to set CHANGE_DEVS and wakeup mddev->thread or else
      	   it would initiate a resync process, which is the responsibility
      	   of the "primary" node.
      	 - clear the Blocked bit
      	 - Call remove_and_add_spares() to hot remove the device.
      	If the device is 'spare':
      	 - call remove_and_add_spares() to get the number of spares
      	   added in this operation.
      	 - Reduce mddev->degraded to mark the array as not degraded.
        2. reset recovery_cp
      - read the rest of the rdevs to update recovery_offset. If recovery_offset
        is equal to MaxSector, call spare_active() to set it In_sync
      
      This required that recovery_offset be initialized to MaxSector, as
      opposed to zero so as to communicate the end of sync for a rdev.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      70bcecdb
    • G
      md: remove_and_add_spares() to activate specific rdev · 2910ff17
      Goldwyn Rodrigues 提交于
      remove_and_add_spares() checks for all devices to activate spare.
      Change it to activate a specific device if a non-null rdev
      argument is passed.
      
      remove_and_add_spares() can be used to activate spares in
      slot_store() as well.
      
      For hot_remove_disk(), check if rdev->raid_disk == -1 before
      calling remove_and_add_spares()
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      2910ff17
    • G
      md-cluster: Use a small window for resync · c40f341f
      Goldwyn Rodrigues 提交于
      Suspending the entire device for resync could take too long. Resync
      in small chunks.
      
      cluster's resync window (32M) is maintained in r1conf as
      cluster_sync_low and cluster_sync_high and processed in
      raid1's sync_request(). If the current resync is outside the cluster
      resync window:
      
      1. Set the cluster_sync_low to curr_resync_completed.
      2. Check if the sync will fit in the new window, if not issue a
         wait_barrier() and set cluster_sync_low to sector_nr.
      3. Set cluster_sync_high to cluster_sync_low + resync_window.
      4. Send a message to all nodes so they may add it in their suspension
         list.
      
      bitmap_cond_end_sync is modified to allow to force a sync inorder
      to get the curr_resync_completed uptodate with the sector passed.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c40f341f
    • G
      md: Increment version for clustered bitmaps · 3c462c88
      Goldwyn Rodrigues 提交于
      Add BITMAP_MAJOR_CLUSTERED as 5, in order to prevent older kernels
      to assemble a clustered device.
      
      In order to maximize compatibility, the major version is set to
      BITMAP_MAJOR_CLUSTERED *only* if the bitmap is clustered.
      
      Added MD_FEATURE_CLUSTERED in order to return error for older
      kernels which would assemble MD even if the bitmap is corrupted.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      3c462c88
  2. 02 10月, 2015 2 次提交
    • S
      md: clear CHANGE_PENDING in readonly array · d4929add
      Shaohua Li 提交于
      If faulty disks of an array are more than allowed degraded number, the
      array enters error handling. It will be marked as read-only with
      MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't
      clear CHANGE_PENDING bit for read-only array.  If MD_CHANGE_PENDING is
      set for a raid5 array, all returned IO will be hold on a list till the
      bit is clear. But recovery nevery clears this bit, the IO is always in
      pending state and nevery finish. This has bad effects like upper layer
      can't get an IO error and the array can't be stopped.
      
      Fixes: c3cce6cd ("md/raid5: ensure device failure recorded before write request returns.")
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      d4929add
    • N
      md: wait for pending superblock updates before switching to read-only · 88724bfa
      NeilBrown 提交于
      If a superblock update is pending, wait for it to complete before
      letting md_set_readonly() switch to readonly.
      Otherwise we might lose important information about a device having
      failed.
      
      For external arrays, waiting for superblock updates can wait on
      user-space, so in that case, just return an error.
      Reported-and-tested-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      88724bfa
  3. 01 9月, 2015 9 次提交
    • N
      md/raid1: ensure device failure recorded before write request returns. · 55ce74d4
      NeilBrown 提交于
      When a write to one of the legs of a RAID1 fails, the failure is
      recorded in the metadata of the other leg(s) so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again  (maybe a cable was unplugged).
      
      Similarly when we record a bad-block in response to a write failure,
      we must not let the write complete until the bad-block update is safe.
      
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
      
      So:
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that experienced an error on a new queue which
         is only processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      55ce74d4
    • N
      md: extend spinlock protection in register_md_cluster_operations · 6022e75b
      NeilBrown 提交于
      This code looks racy.
      
      The only possible race is if two modules try to register at the same
      time and that won't happen.  But make the code look safe anyway.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      6022e75b
    • G
      md-cluster: transfer the resync ownership to another node · dc737d7c
      Guoqing Jiang 提交于
      When node A stops an array while the array is doing a resync, we need
      to let another node B take over the resync task.
      
      To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC
      message to the cluster. And the node B which received that message will
      invoke __recover_slot to do resync.
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      dc737d7c
    • S
      md: setup safemode_timer before it's being used · 25b2edfa
      Sasha Levin 提交于
      We used to set up the safemode_timer timer in md_run. If md_run
      would fail before the timer was set up we'd end up trying to modify
      a timer that doesn't have a callback function when we access safe_delay_store,
      which would trigger a BUG.
      
      neilb: delete init_timer() call as setup_timer() does that.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      25b2edfa
    • N
      md: sync sync_completed has correct value as recovery finishes. · 5ed1df2e
      NeilBrown 提交于
      There can be a small window between the moment that recovery
      actually writes the last block and the time when various sysfs
      and /proc/mdstat attributes report that it has finished.
      During this time, 'sync_completed' can have the wrong value.
      This can confuse monitoring software.
      
      So:
       - don't set curr_resync_completed beyond the end of the devices,
       - set it correctly when resync/recovery has completed.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      5ed1df2e
    • N
      md: be careful when testing resync_max against curr_resync_completed. · c5e19d90
      NeilBrown 提交于
      While it generally shouldn't happen, it is not impossible for
      curr_resync_completed to exceed resync_max.
      This can particularly happen when reshaping RAID5 - the current
      status isn't copied to curr_resync_completed promptly, so when it
      is, it can exceed resync_max.
      This happens when the reshape is 'frozen', resync_max is set low,
      and reshape is re-enabled.
      
      Taking a difference between two unsigned numbers is always dangerous
      anyway, so add a test to behave correctly if
         curr_resync_completed > resync_max
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c5e19d90
    • N
      md: set MD_RECOVERY_RECOVER when starting a degraded array. · a4a3d26d
      NeilBrown 提交于
      This ensures that 'sync_action' will show 'recover' immediately the
      array is started.  If there is no spare the status will change to
      'idle' once that is detected.
      
      Clear MD_RECOVERY_RECOVER for a read-only array to ensure this change
      happens.
      
      This allows scripts which monitor status not to get confused -
      particularly my test scripts.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      a4a3d26d
    • N
      md: close some races between setting and checking sync_action. · 985ca973
      NeilBrown 提交于
      When checking sync_action in a script, we want to be sure it is
      as accurate as possible.
      As resync/reshape etc doesn't always start immediately (a separate
      thread is scheduled to do it), it is best if 'action_show'
      checks if MD_RECOVER_NEEDED is set (which it does) and in that
      case reports what is likely to start soon (which it only sometimes
      does).
      
      So:
       - report 'reshape' if reshape_position suggests one might start.
       - set MD_RECOVERY_RECOVER in raid1_reshape(), because that is very
         likely to happen next.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      985ca973
    • N
      md: Keep /proc/mdstat reporting recovery until fully DONE. · f7851be7
      NeilBrown 提交于
      Currently when a recovery completes, mdstat shows that it has finished
      before the new device is marked as a full member.  Because of this it
      can appear to a script that the recovery finished but the array isn't
      in sync.
      
      So while MD_RECOVERY_DONE is still set, keep mdstat reporting "recovery".
      Once md_reap_sync_thread() completes, the spare will be active and then
      MD_RECOVERY_DONE will be cleared.
      
      To ensure this is race-free, set MD_RECOVERY_DONE before clearning
      curr_resync.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      f7851be7
  4. 14 8月, 2015 2 次提交
    • K
      block: kill merge_bvec_fn() completely · 8ae12666
      Kent Overstreet 提交于
      As generic_make_request() is now able to handle arbitrarily sized bios,
      it's no longer necessary for each individual block driver to define its
      own ->merge_bvec_fn() callback. Remove every invocation completely.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: ceph-devel@vger.kernel.org
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: also remove ->merge_bvec_fn() in dm-thin as well as
       dm-era-target, and resolve merge conflicts]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ae12666
    • K
      block: make generic_make_request handle arbitrarily sized bios · 54efd50b
      Kent Overstreet 提交于
      The way the block layer is currently written, it goes to great lengths
      to avoid having to split bios; upper layer code (such as bio_add_page())
      checks what the underlying device can handle and tries to always create
      bios that don't need to be split.
      
      But this approach becomes unwieldy and eventually breaks down with
      stacked devices and devices with dynamic limits, and it adds a lot of
      complexity. If the block layer could split bios as needed, we could
      eliminate a lot of complexity elsewhere - particularly in stacked
      drivers. Code that creates bios can then create whatever size bios are
      convenient, and more importantly stacked drivers don't have to deal with
      both their own bio size limitations and the limitations of the
      (potentially multiple) devices underneath them.  In the future this will
      let us delete merge_bvec_fn and a bunch of other code.
      
      We do this by adding calls to blk_queue_split() to the various
      make_request functions that need it - a few can already handle arbitrary
      size bios. Note that we add the call _after_ any call to
      blk_queue_bounce(); this means that blk_queue_split() and
      blk_recalc_rq_segments() don't need to be concerned with bouncing
      affecting segment merging.
      
      Some make_request_fn() callbacks were simple enough to audit and verify
      they don't need blk_queue_split() calls. The skipped ones are:
      
       * nfhd_make_request (arch/m68k/emu/nfblock.c)
       * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
       * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
       * brd_make_request (ramdisk - drivers/block/brd.c)
       * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
       * loop_make_request
       * null_queue_bio
       * bcache's make_request fns
      
      Some others are almost certainly safe to remove now, but will be left
      for future patches.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Jim Paris <jim@jtan.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      54efd50b
  5. 03 8月, 2015 2 次提交
    • B
      md: simplify get_bitmap_file now that "file" is zeroed. · 25eafe1a
      Benjamin Randazzo 提交于
      There is no point assigning '\0' to file->pathname[0] as
      file is now zeroed out, so remove that branch and
      simplify the code.
      
      [Original patch combined this with the change to use
       kzalloc.  I split the two so that the change to kzalloc
       is easier to backport. - neilb]
      Signed-off-by: NBenjamin Randazzo <benjamin@randazzo.fr>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      25eafe1a
    • B
      md: use kzalloc() when bitmap is disabled · b6878d9e
      Benjamin Randazzo 提交于
      In drivers/md/md.c get_bitmap_file() uses kmalloc() for creating a
      mdu_bitmap_file_t called "file".
      
      5769         file = kmalloc(sizeof(*file), GFP_NOIO);
      5770         if (!file)
      5771                 return -ENOMEM;
      
      This structure is copied to user space at the end of the function.
      
      5786         if (err == 0 &&
      5787             copy_to_user(arg, file, sizeof(*file)))
      5788                 err = -EFAULT
      
      But if bitmap is disabled only the first byte of "file" is initialized
      with zero, so it's possible to read some bytes (up to 4095) of kernel
      space memory from user space. This is an information leak.
      
      5775         /* bitmap disabled, zero the first byte and copy out */
      5776         if (!mddev->bitmap_info.file)
      5777                 file->pathname[0] = '\0';
      Signed-off-by: NBenjamin Randazzo <benjamin@randazzo.fr>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      b6878d9e
  6. 29 7月, 2015 1 次提交
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  7. 24 7月, 2015 1 次提交
  8. 22 7月, 2015 1 次提交
  9. 26 6月, 2015 1 次提交
  10. 25 6月, 2015 3 次提交
    • N
      md: clear Blocked flag on failed devices when array is read-only. · ab16bfc7
      Neil Brown 提交于
      The Blocked flag indicates that a device has failed but that this
      fact hasn't been recorded in the metadata yet.  Writes to such
      devices cannot be allowed until the metadata has been updated.
      
      On a read-only array, the Blocked flag will never be cleared.
      This prevents the device being removed from the array.
      
      If the metadata is being handled by the kernel
      (i.e. !mddev->external), then we can be sure that if the array is
      switch to writable, then a metadata update will happen and will
      record the failure.  So we don't need the flag set.
      
      If metadata is externally managed, it is upto the external manager
      to clear the 'blocked' flag.
      Reported-by: NXiaoNi <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ab16bfc7
    • N
      md: unlock mddev_lock on an error path. · 9a8c0fa8
      NeilBrown 提交于
      This error path retuns while still holding the lock - bad.
      
      Fixes: 6791875e ("md: make reconfig_mutex optional for writes to md sysfs files.")
      Cc: stable@vger.kernel.org (v4.0+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      9a8c0fa8
    • N
      md: clear mddev->private when it has been freed. · bd691922
      NeilBrown 提交于
      If ->private is set when ->run is called, it is assumed to be
      a 'config'  prepared as part of 'reshape'.
      
      So it is important when we free that config, that we also clear ->private.
      This is not often a problem as the mddev will normally be discarded
      shortly after the config us freed.
      However if an 'assemble' races with a final close, the assemble can use
      the old mddev which has a stale ->private.  This leads to any of
      various sorts of crashes.
      
      So clear ->private after calling ->free().
      Reported-by: NNate Clark <nate@neworld.us>
      Cc: stable@vger.kernel.org (v4.0+)
      Fixes: afa0f557 ("md: rename ->stop to ->free")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      bd691922
  11. 24 6月, 2015 1 次提交
  12. 17 6月, 2015 2 次提交
  13. 12 6月, 2015 3 次提交
    • N
      md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync · ea358cd0
      NeilBrown 提交于
      MD_RECOVERY_DONE is normally cleared by md_check_recovery after a
      resync etc finished.  However it is possible for raid5_start_reshape
      to race and start a reshape before MD_RECOVERY_DONE is cleared.  This
      can lean to multiple reshapes running at the same time, which isn't
      good.
      
      To make sure it is cleared before starting a reshape, and also clear
      it when reaping a thread, just to be safe.
      Signed-off-by: NNeilBrown  <neilb@suse.de>
      ea358cd0
    • N
      md: Close race when setting 'action' to 'idle'. · 8e8e2518
      NeilBrown 提交于
      Checking ->sync_thread without holding the mddev_lock()
      isn't really safe, even after flushing the workqueue which
      ensures md_start_sync() has been run.
      
      While this code is waiting for the lock, md_check_recovery could reap
      the thread itself, and then start another thread (e.g. recovery might
      finish, then reshape starts).  When this thread gets the lock
      md_start_sync() hasn't run so it doesn't get reaped, but
      MD_RECOVERY_RUNNING gets cleared.  This allows two threads to start
      which leads to confusion.
      
      So don't both if MD_RECOVERY_RUNNING isn't set, but if it is do
      the flush and the test and the reap all under the mddev_lock to
      avoid any race with md_check_recovery.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Fixes: 6791875e ("md: make reconfig_mutex optional for writes to md sysfs files.")
      Cc: stable@vger.kernel.org (v4.0+)
      8e8e2518
    • N
      md: don't return 0 from array_state_store · c008f1d3
      NeilBrown 提交于
      Returning zero from a 'store' function is bad.
      The return value should be either len length of the string
      or an error.
      
      So use 'len' if 'err' is zero.
      
      Fixes: 6791875e ("md: make reconfig_mutex optional for writes to md sysfs files.")
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@vger.kernel (v4.0+)
      c008f1d3
  14. 28 5月, 2015 1 次提交
    • N
      md: fix race when unfreezing sync_action · 56ccc112
      NeilBrown 提交于
      A recent change removed the need for locking around writing
      to "sync_action" (and various other places), but introduced a
      subtle race.
      When e.g. setting 'reshape' on a 'frozen' array, the 'frozen'
      flag is cleared before 'reshape' is set, so the md thread can
      get in and start trying recovery - which isn't wanted.
      
      So instead of clearing MD_RECOVERY_FROZEN for any command
      except 'frozen', only clear it when each specific command
      is parsed.  This allows the handling of 'reshape' to clear
      the bit while a lock is held.
      
      Also remove some places where we set MD_RECOVERY_NEEDED,
      as it is always set on non-error exit of the function.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Fixes: 6791875e ("md: make reconfig_mutex optional for writes to md sysfs files.")
      56ccc112
  15. 28 4月, 2015 1 次提交
    • N
      block: destroy bdi before blockdev is unregistered. · 6cd18e71
      NeilBrown 提交于
      Because of the peculiar way that md devices are created (automatically
      when the device node is opened), a new device can be created and
      registered immediately after the
      	blk_unregister_region(disk_devt(disk), disk->minors);
      call in del_gendisk().
      
      Therefore it is important that all visible artifacts of the previous
      device are removed before this call.  In particular, the 'bdi'.
      
      Since:
      commit c4db59d3
      Author: Christoph Hellwig <hch@lst.de>
          fs: don't reassign dirty inodes to default_backing_dev_info
      
      moved the
         device_unregister(bdi->dev);
      call from bdi_unregister() to bdi_destroy() it has been quite easy to
      lose a race and have a new (e.g.) "md127" be created after the
      blk_unregister_region() call and before bdi_destroy() is ultimately
      called by the final 'put_disk', which must come after del_gendisk().
      
      The new device finds that the bdi name is already registered in sysfs
      and complains
      
      > [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70()
      > [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127'
      
      We can fix this by moving the bdi_destroy() call out of
      blk_release_queue() (which can happen very late when a refcount
      reaches zero) and into blk_cleanup_queue() - which happens exactly when the md
      device driver calls it.
      
      Then it is only necessary for md to call blk_cleanup_queue() before
      del_gendisk().  As loop.c devices are also created on demand by
      opening the device node, we make the same change there.
      
      Fixes: c4db59d3Reported-by: NAzat Khuzhin <a3at.mail@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org (v4.0)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6cd18e71
  16. 22 4月, 2015 5 次提交
    • N
      md: allow resync to go faster when there is competing IO. · ac8fa419
      NeilBrown 提交于
      When md notices non-sync IO happening while it is trying
      to resync (or reshape or recover) it slows down to the
      set minimum.
      
      The default minimum might have made sense many years ago
      but the drives have become faster.  Changing the default
      to match the times isn't really a long term solution.
      
      This patch changes the code so that instead of waiting until the speed
      has dropped to the target, it just waits until pending requests
      have completed.
      This means that the delay inserted is a function of the speed
      of the devices.
      
      Testing shows that:
       - for some loads, the resync speed is unchanged.  For those loads
         increasing the minimum doesn't change the speed either.
         So this is a good result.  To increase resync speed under such
         loads we would probably need to increase the resync window
         size.
      
       - for other loads, resync speed does increase to a reasonable
         fraction (e.g. 20%) of maximum possible, and throughput of
         the load only drops a little bit (e.g. 10%)
      
       - for other loads, throughput of the non-sync load drops quite a bit
         more.  These seem to be latency-sensitive loads.
      
      So it isn't a perfect solution, but it is mostly an improvement.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ac8fa419
    • N
      md: remove 'go_faster' option from ->sync_request() · 09314799
      NeilBrown 提交于
      This option is not well justified and testing suggests that
      it hardly ever makes any difference.
      
      The comment suggests there might be a need to wait for non-resync
      activity indicated by ->nr_waiting, however raise_barrier()
      already waits for all of that.
      
      So just remove it to simplify reasoning about speed limiting.
      
      This allows us to remove a 'FIXME' comment from raid5.c as that
      never used the flag.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      09314799
    • N
      md: don't require sync_min to be a multiple of chunk_size. · 50c37b13
      NeilBrown 提交于
      There is really no need for sync_min to be a multiple of
      chunk_size, and values read from here often aren't.
      That means you cannot read a value and expect to be able
      to write it back later.
      
      So remove the chunk_size check, and round down to a multiple
      of 4K, to be sure everything works with 4K-sector devices.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      50c37b13
    • G
      md-cluster: re-add capabilities · 97f6cd39
      Goldwyn Rodrigues 提交于
      When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state,
      the clustered md:
      
      1. Sends RE_ADD message with the desc_nr. Nodes receiving the message
         clear the Faulty bit in their respective rdev->flags.
      2. The node initiating re-add, gathers the bitmaps of all nodes
         and copies them into the local bitmap. It does not clear the bitmap
         from which it is copying.
      3. Initiating node schedules a md recovery to sync the devices.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      97f6cd39
    • G
      md: re-add a failed disk · a6da4ef8
      Goldwyn Rodrigues 提交于
      This adds the capability of re-adding a failed disk by
      writing "re-add" to /sys/block/mdXX/md/dev-YYY/state.
      
      This facilitates adding disks which have encountered a temporary
      error such as a network disconnection/hiccup in an iSCSI device,
      or a SAN cable disconnection which has been restored. In such
      a situation, you do not need to remove and re-add the device.
      Writing re-add to the failed device's state would add it again
      to the array and perform the recovery of only the blocks which
      were written after the device failed.
      
      This works for generic md, and is not related to clustering. However,
      this patch is to ease re-add operations listed above in clustering
      environments.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a6da4ef8