1. 10 11月, 2016 1 次提交
  2. 12 10月, 2015 1 次提交
    • G
      md-cluster: Use a small window for resync · c40f341f
      Goldwyn Rodrigues 提交于
      Suspending the entire device for resync could take too long. Resync
      in small chunks.
      
      cluster's resync window (32M) is maintained in r1conf as
      cluster_sync_low and cluster_sync_high and processed in
      raid1's sync_request(). If the current resync is outside the cluster
      resync window:
      
      1. Set the cluster_sync_low to curr_resync_completed.
      2. Check if the sync will fit in the new window, if not issue a
         wait_barrier() and set cluster_sync_low to sector_nr.
      3. Set cluster_sync_high to cluster_sync_low + resync_window.
      4. Send a message to all nodes so they may add it in their suspension
         list.
      
      bitmap_cond_end_sync is modified to allow to force a sync inorder
      to get the curr_resync_completed uptodate with the sector passed.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c40f341f
  3. 01 9月, 2015 1 次提交
    • N
      md/raid1: ensure device failure recorded before write request returns. · 55ce74d4
      NeilBrown 提交于
      When a write to one of the legs of a RAID1 fails, the failure is
      recorded in the metadata of the other leg(s) so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again  (maybe a cable was unplugged).
      
      Similarly when we record a bad-block in response to a write failure,
      we must not let the write complete until the bad-block update is safe.
      
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
      
      So:
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that experienced an error on a new queue which
         is only processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      55ce74d4
  4. 04 2月, 2015 1 次提交
    • N
      md: make ->congested robust against personality changes. · 5c675f83
      NeilBrown 提交于
      There is currently no locking around calls to the 'congested'
      bdi function.  If called at an awkward time while an array is
      being converted from one level (or personality) to another, there
      is a tiny chance of running code in an unreferenced module etc.
      
      So add a 'congested' function to the md_personality operations
      structure, and call it with appropriate locking from a central
      'mddev_congested'.
      
      When the array personality is changing the array will be 'suspended'
      so no IO is processed.
      If mddev_congested detects this, it simply reports that the
      array is congested, which is a safe guess.
      As mddev_suspend calls synchronize_rcu(), mddev_congested can
      avoid races by included the whole call inside an rcu_read_lock()
      region.
      This require that the congested functions for all subordinate devices
      can be run under rcu_lock.  Fortunately this is the case.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5c675f83
  5. 14 10月, 2014 1 次提交
  6. 19 11月, 2013 2 次提交
    • M
      raid1: Rewrite the implementation of iobarrier. · 79ef3a8a
      majianpeng 提交于
      There is an iobarrier in raid1 because of contention between normal IO and
      resync IO.  It suspends all normal IO when resync/recovery happens.
      
      However if normal IO is out side the resync window, there is no contention.
      So this patch changes the barrier mechanism to only block IO that
      could contend with the resync that is currently happening.
      
      We partition the whole space into five parts.
      |---------|-----------|------------|----------------|-------|
              start   next_resync   start_next_window    end_window
      
      start + RESYNC_WINDOW = next_resync
      next_resync + NEXT_NORMALIO_DISTANCE = start_next_window
      start_next_window + NEXT_NORMALIO_DISTANCE = end_window
      
      Firstly we introduce some concepts:
      
      1 - RESYNC_WINDOW: For resync, there are 32 resync requests at most at the
            same time. A sync request is RESYNC_BLOCK_SIZE(64*1024).
            So the RESYNC_WINDOW is 32 * RESYNC_BLOCK_SIZE, that is 2MB.
      2 - NEXT_NORMALIO_DISTANCE: the distance between next_resync
            and start_next_window.  It also indicates the distance between
            start_next_window and end_window.
            It is currently 3 * RESYNC_WINDOW_SIZE but could be tuned if
            this turned out not to be optimal.
      3 - next_resync: the next sector at which we will do sync IO.
      4 - start: a position which is at most RESYNC_WINDOW before
            next_resync.
      5 - start_next_window:  a position which is NEXT_NORMALIO_DISTANCE
            beyond next_resync.  Normal-io after this position doesn't need to
            wait for resync-io to complete.
      6 - end_window:  a position which is 2 * NEXT_NORMALIO_DISTANCE beyond
            next_resync.  This also doesn't need to wait, but is counted
            differently.
      7 - current_window_requests:  the count of normalIO between
            start_next_window and end_window.
      8 - next_window_requests: the count of normalIO after end_window.
      
      NormalIO will be partitioned into four types:
      
      NormIO1:  the end sector of bio is smaller or equal the start
      NormIO2:  the start sector of bio larger or equal to end_window
      NormIO3:  the start sector of bio larger or equal to
                start_next_window.
      NormIO4:  the location between start_next_window and end_window
      
      |--------|-----------|--------------------|----------------|-------------|
          | start   |   next_resync   |  start_next_window   |  end_window |
       NormIO1   NormIO4            NormIO4                NormIO3      NormIO2
      
      For NormIO1, we don't need any io barrier.
      For NormIO4, we used a similar approach to the original iobarrier
          mechanism.  The normalIO and resyncIO must be kept separate.
      For NormIO2/3, we add two fields to struct r1conf: "current_window_requests"
          and "next_window_requests". They indicate the count of active
          requests in the two window.
          For these, we don't wait for resync io to complete.
      
      For resync action, if there are NormIO4s, we must wait for it.
      If not, we can proceed.
      But if resync action reaches start_next_window and
      current_window_requests > 0 (that is there are NormIO3s), we must
      wait until the current_window_requests becomes zero.
      When current_window_requests becomes zero,  start_next_window also
      moves forward. Then current_window_requests will replaced by
      next_window_requests.
      
      There is a problem which when and how to change from NormIO2 to
      NormIO3.  Only then can sync action progress.
      
      We add a field in struct r1conf "start_next_window".
      
      A: if start_next_window == MaxSector, it means there are no NormIO2/3.
         So start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
      B: if current_window_requests == 0 && next_window_requests != 0, it
         means start_next_window move to end_window
      
      There is another problem which how to differentiate between
      old NormIO2(now it is NormIO3) and NormIO2.
      For example, there are many bios which are NormIO2 and a bio which is
      NormIO3. NormIO3 firstly completed, so the bios of NormIO2 became NormIO3.
      
      We add a field in struct r1bio "start_next_window".
      This is used to record the position conf->start_next_window when the call
      to wait_barrier() is made in make_request().
      
      In allow_barrier(), we check the conf->start_next_window.
      If r1bio->stat_next_window == conf->start_next_window, it means
      there is no transition between NormIO2 and NormIO3.
      If r1bio->start_next_window != conf->start_next_window, it mean
      there was a transition between NormIO2 and NormIO3.  There can only
      have been one transition.  So it only means the bio is old NormIO2.
      
      For one bio, there may be many r1bio's. So we make sure
      all the r1bio->start_next_window are the same value.
      If we met blocked_dev in make_request(), it must call allow_barrier
      and wait_barrier. So the former and the later value of
      conf->start_next_window will be change.
      If there are many r1bio's with differnet start_next_window,
      for the relevant bio, it depend on the last value of r1bio.
      It will cause error. To avoid this, we must wait for previous r1bios
      to complete.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      79ef3a8a
    • M
      raid1: Add a field array_frozen to indicate whether raid in freeze state. · b364e3d0
      majianpeng 提交于
      Because the following patch will rewrite the content between normal IO
      and resync IO. So we used a parameter to indicate whether raid is in freeze
      array.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b364e3d0
  7. 31 7月, 2012 4 次提交
    • S
      md/raid1: prevent merging too large request · 12cee5a8
      Shaohua Li 提交于
      For SSD, if request size exceeds specific value (optimal io size), request size
      isn't important for bandwidth. In such condition, if making request size bigger
      will cause some disks idle, the total throughput will actually drop. A good
      example is doing a readahead in a two-disk raid1 setup.
      
      So when should we split big requests? We absolutly don't want to split big
      request to very small requests. Even in SSD, big request transfer is more
      efficient. This patch only considers request with size above optimal io size.
      
      If all disks are busy, is it worth doing a split? Say optimal io size is 16k,
      two requests 32k and two disks. We can let each disk run one 32k request, or
      split the requests to 4 16k requests and each disk runs two. It's hard to say
      which case is better, depending on hardware.
      
      So only consider case where there are idle disks. For readahead, split is
      always better in this case. And in my test, below patch can improve > 30%
      thoughput. Hmm, not 100%, because disk isn't 100% busy.
      
      Such case can happen not just in readahead, for example, in directio. But I
      suppose directio usually will have bigger IO depth and make all disks busy, so
      I ignored it.
      
      Note: if the raid uses any hard disk, we don't prevent merging. That will make
      performace worse.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      12cee5a8
    • S
      md/raid1: make sequential read detection per disk based · be4d3280
      Shaohua Li 提交于
      Currently the sequential read detection is global wide. It's natural to make it
      per disk based, which can improve the detection for concurrent multiple
      sequential reads. And next patch will make SSD read balance not use distance
      based algorithm, where this change help detect truly sequential read for SSD.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      be4d3280
    • J
      MD: Move macros from raid1*.h to raid1*.c · 473e87ce
      Jonathan Brassow 提交于
      MD RAID1/RAID10: Move some macros from .h file to .c file
      
      There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined
      in both raid1.h and raid10.h.  They are only used in there respective .c files.
      However, if we wish to make RAID10 accessible to the device-mapper RAID
      target (dm-raid.c), then we need to move these macros into the .c files where
      they are used so that they do not conflict with each other.
      
      The macros from the two files are identical and could be moved into md.h, but
      I chose to leave the duplication and have them remain in the personality
      files.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      473e87ce
    • J
      MD RAID1: rename mirror_info structure · 0eaf822c
      Jonathan Brassow 提交于
      MD RAID1: Rename the structure 'mirror_info' to 'raid1_info'
      
      The same structure name ('mirror_info') is used by raid10.  Each of these
      structures are defined in there respective header files.  If dm-raid is
      to support both RAID1 and RAID10, the header files will be included and
      the structure names must not collide.  While only one of these structure
      names needs to change, this patch adds consistency to the naming of the
      structure.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0eaf822c
  8. 23 12月, 2011 1 次提交
    • N
      md/raid1: Allocate spare to store replacement devices and their bios. · 8f19ccb2
      NeilBrown 提交于
      In RAID1, a replacement is much like a normal device, so we just
      double the size of the relevant arrays and look at all possible
      devices for reads and writes.
      
      This means that the array looks like it is now double the size in some
      way - we need to be careful about that.
      In particular, we checking if the array is still degraded while
      creating a recovery request we need to only consider the first 'half'
      - i.e. the real (non-replacement) devices.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8f19ccb2
  9. 11 10月, 2011 7 次提交
  10. 07 10月, 2011 1 次提交
  11. 28 7月, 2011 4 次提交
    • N
      md/raid1: Handle write errors by updating badblock log. · cd5ff9a1
      NeilBrown 提交于
      When we get a write error (in the data area, not in metadata),
      update the badblock log rather than failing the whole device.
      
      As the write may well be many blocks, we trying writing each
      block individually and only log the ones which fail.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      cd5ff9a1
    • N
      md/raid1: store behind-write pages in bi_vecs. · 2ca68f5e
      NeilBrown 提交于
      When performing write-behind we allocate pages to store the data
      during write.
      Previously we just keep a list of pages.  Now we keep a list of
      bi_vec which includes offset and size.
      This means that the r1bio has complete information to create a new
      bio which will be needed for retrying after write errors.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      2ca68f5e
    • N
      md/raid1: clear bad-block record when write succeeds. · 4367af55
      NeilBrown 提交于
      If we succeed in writing to a block that was recorded as
      being bad, we clear the bad-block record.
      
      This requires some delayed handling as the bad-block-list update has
      to happen in process-context.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      4367af55
    • N
      md/raid1: avoid reading from known bad blocks. · d2eb35ac
      NeilBrown 提交于
      Now that we have a bad block list, we should not read from those
      blocks.
      There are several main parts to this:
        1/ read_balance needs to check for bad blocks, and return not only
           the chosen device, but also how many good blocks are available
           there.
        2/ fix_read_error needs to avoid trying to read from bad blocks.
        3/ read submission must be ready to issue multiple reads to
           different devices as different bad blocks on different devices
           could mean that a single large read cannot be served by any one
           device, but can still be served by the array.
           This requires keeping count of the number of outstanding requests
           per bio.  This count is stored in 'bi_phys_segments'
        4/ retrying a read needs to also be ready to submit a smaller read
           and queue another request for the rest.
      
      This does not yet handle bad blocks when reading to perform resync,
      recovery, or check.
      
      'md_trim_bio' will also be used for RAID10, so put it in md.c and
      export it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d2eb35ac
  12. 27 7月, 2011 1 次提交
    • N
      md: change managed of recovery_disabled. · 5389042f
      NeilBrown 提交于
      If we hit a read error while recovering a mirror, we want to abort the
      recovery without necessarily failing the disk - as having a disk this
      a read error is better than not having an array at all.
      
      Currently this is managed with a per-array flag "recovery_disabled"
      and is only implemented for RAID1.  For RAID10 we will need finer
      grained control as we might want to disable recovery for individual
      devices separately.
      
      So push more of the decision making into the personality.
      'recovery_disabled' is now a 'cookie' which is copied when the
      personality want to disable recovery and is changed when a device is
      added to the array as this is used as a trigger to 'try recovery
      again'.
      
      This will allow RAID10 to get the control that it needs.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5389042f
  13. 08 6月, 2011 1 次提交
  14. 11 5月, 2011 1 次提交
    • N
      md/raid1: improve handling of pages allocated for write-behind. · af6d7b76
      NeilBrown 提交于
      The current handling and freeing of these pages is a bit fragile.
      We only keep the list of allocated pages in each bio, so we need to
      still have a valid bio when freeing the pages, which is a bit clumsy.
      
      So simply store the allocated page list in the r1_bio so it can easily
      be found and freed when we are finished with the r1_bio.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      af6d7b76
  15. 29 10月, 2010 1 次提交
  16. 10 9月, 2010 1 次提交
    • T
      md: implment REQ_FLUSH/FUA support · e9c7469b
      Tejun Heo 提交于
      This patch converts md to support REQ_FLUSH/FUA instead of now
      deprecated REQ_HARDBARRIER.  In the core part (md.c), the following
      changes are notable.
      
      * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
        processing of other requests and thus there is no reason to mark the
        queue congested while FLUSH/FUA is in progress.
      
      * REQ_FLUSH/FUA failures are final and its users don't need retry
        logic.  Retry logic is removed.
      
      * Preflush needs to be issued to all member devices but FUA writes can
        be handled the same way as other writes - their processing can be
        deferred to request_queue of member devices.  md_barrier_request()
        is renamed to md_flush_request() and simplified accordingly.
      
      For linear, raid0 and multipath, the core changes are enough.  raid1,
      5 and 10 need the following conversions.
      
      * raid1: Handling of FLUSH/FUA bio's can simply be deferred to
        request_queues of member devices.  Barrier related logic removed.
      
      * raid5: Queue draining logic dropped.  FUA bit is propagated through
        biodrain and stripe resconstruction such that all the updated parts
        of the stripe are written out with FUA writes if any of the dirtying
        writes was FUA.  preread_active_stripes handling in make_request()
        is updated as suggested by Neil Brown.
      
      * raid10: FUA bit needs to be propagated to write clones.
      
      linear, raid0, 1, 5 and 10 tested.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      e9c7469b
  17. 14 12月, 2009 1 次提交
  18. 16 6月, 2009 1 次提交
    • N
      md: remove mddev_to_conf "helper" macro · 070ec55d
      NeilBrown 提交于
      Having a macro just to cast a void* isn't really helpful.
      I would must rather see that we are simply de-referencing ->private,
      than have to know what the macro does.
      
      So open code the macro everywhere and remove the pointless cast.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      070ec55d
  19. 31 3月, 2009 2 次提交
  20. 03 10月, 2006 1 次提交
  21. 23 3月, 2006 1 次提交
  22. 07 1月, 2006 3 次提交
  23. 09 11月, 2005 1 次提交
    • N
      [PATCH] md: support BIO_RW_BARRIER for md/raid1 · a9701a30
      NeilBrown 提交于
      We can only accept BARRIER requests if all slaves handle
      barriers, and that can, of course, change with time....
      
      So we keep track of whether the whole array seems safe for barriers,
      and also whether each individual rdev handles barriers.
      
      We initially assumes barriers are OK.
      
      When writing the superblock we try a barrier, and if that fails, we flag
      things for no-barriers.  This will usually clear the flags fairly quickly.
      
      If writing the superblock finds that BIO_RW_BARRIER is -ENOTSUPP, we need to
      resubmit, so introduce function "md_super_wait" which waits for requests to
      finish, and retries ENOTSUPP requests without the barrier flag.
      
      When writing the real raid1, write requests which were BIO_RW_BARRIER but
      which aresn't supported need to be retried.  So raid1d is enhanced to do this,
      and when any bio write completes (i.e.  no retry needed) we remove it from the
      r1bio, so that devices needing retry are easy to find.
      
      We should hardly ever get -ENOTSUPP errors when writing data to the raid.
      It should only happen if:
        1/ the device used to support BARRIER, but now doesn't.  Few devices
           change like this, though raid1 can!
      or
        2/ the array has no persistent superblock, so there was no opportunity to
           pre-test for barriers when writing the superblock.
      Signed-off-by: NNeil Brown <neilb@cse.unsw.edu.au>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a9701a30
  24. 10 9月, 2005 1 次提交