1. 19 11月, 2013 10 次提交
    • M
      raid1: Rewrite the implementation of iobarrier. · 79ef3a8a
      majianpeng 提交于
      There is an iobarrier in raid1 because of contention between normal IO and
      resync IO.  It suspends all normal IO when resync/recovery happens.
      
      However if normal IO is out side the resync window, there is no contention.
      So this patch changes the barrier mechanism to only block IO that
      could contend with the resync that is currently happening.
      
      We partition the whole space into five parts.
      |---------|-----------|------------|----------------|-------|
              start   next_resync   start_next_window    end_window
      
      start + RESYNC_WINDOW = next_resync
      next_resync + NEXT_NORMALIO_DISTANCE = start_next_window
      start_next_window + NEXT_NORMALIO_DISTANCE = end_window
      
      Firstly we introduce some concepts:
      
      1 - RESYNC_WINDOW: For resync, there are 32 resync requests at most at the
            same time. A sync request is RESYNC_BLOCK_SIZE(64*1024).
            So the RESYNC_WINDOW is 32 * RESYNC_BLOCK_SIZE, that is 2MB.
      2 - NEXT_NORMALIO_DISTANCE: the distance between next_resync
            and start_next_window.  It also indicates the distance between
            start_next_window and end_window.
            It is currently 3 * RESYNC_WINDOW_SIZE but could be tuned if
            this turned out not to be optimal.
      3 - next_resync: the next sector at which we will do sync IO.
      4 - start: a position which is at most RESYNC_WINDOW before
            next_resync.
      5 - start_next_window:  a position which is NEXT_NORMALIO_DISTANCE
            beyond next_resync.  Normal-io after this position doesn't need to
            wait for resync-io to complete.
      6 - end_window:  a position which is 2 * NEXT_NORMALIO_DISTANCE beyond
            next_resync.  This also doesn't need to wait, but is counted
            differently.
      7 - current_window_requests:  the count of normalIO between
            start_next_window and end_window.
      8 - next_window_requests: the count of normalIO after end_window.
      
      NormalIO will be partitioned into four types:
      
      NormIO1:  the end sector of bio is smaller or equal the start
      NormIO2:  the start sector of bio larger or equal to end_window
      NormIO3:  the start sector of bio larger or equal to
                start_next_window.
      NormIO4:  the location between start_next_window and end_window
      
      |--------|-----------|--------------------|----------------|-------------|
          | start   |   next_resync   |  start_next_window   |  end_window |
       NormIO1   NormIO4            NormIO4                NormIO3      NormIO2
      
      For NormIO1, we don't need any io barrier.
      For NormIO4, we used a similar approach to the original iobarrier
          mechanism.  The normalIO and resyncIO must be kept separate.
      For NormIO2/3, we add two fields to struct r1conf: "current_window_requests"
          and "next_window_requests". They indicate the count of active
          requests in the two window.
          For these, we don't wait for resync io to complete.
      
      For resync action, if there are NormIO4s, we must wait for it.
      If not, we can proceed.
      But if resync action reaches start_next_window and
      current_window_requests > 0 (that is there are NormIO3s), we must
      wait until the current_window_requests becomes zero.
      When current_window_requests becomes zero,  start_next_window also
      moves forward. Then current_window_requests will replaced by
      next_window_requests.
      
      There is a problem which when and how to change from NormIO2 to
      NormIO3.  Only then can sync action progress.
      
      We add a field in struct r1conf "start_next_window".
      
      A: if start_next_window == MaxSector, it means there are no NormIO2/3.
         So start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
      B: if current_window_requests == 0 && next_window_requests != 0, it
         means start_next_window move to end_window
      
      There is another problem which how to differentiate between
      old NormIO2(now it is NormIO3) and NormIO2.
      For example, there are many bios which are NormIO2 and a bio which is
      NormIO3. NormIO3 firstly completed, so the bios of NormIO2 became NormIO3.
      
      We add a field in struct r1bio "start_next_window".
      This is used to record the position conf->start_next_window when the call
      to wait_barrier() is made in make_request().
      
      In allow_barrier(), we check the conf->start_next_window.
      If r1bio->stat_next_window == conf->start_next_window, it means
      there is no transition between NormIO2 and NormIO3.
      If r1bio->start_next_window != conf->start_next_window, it mean
      there was a transition between NormIO2 and NormIO3.  There can only
      have been one transition.  So it only means the bio is old NormIO2.
      
      For one bio, there may be many r1bio's. So we make sure
      all the r1bio->start_next_window are the same value.
      If we met blocked_dev in make_request(), it must call allow_barrier
      and wait_barrier. So the former and the later value of
      conf->start_next_window will be change.
      If there are many r1bio's with differnet start_next_window,
      for the relevant bio, it depend on the last value of r1bio.
      It will cause error. To avoid this, we must wait for previous r1bios
      to complete.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      79ef3a8a
    • M
      raid1: Add some macros to make code clearly. · 8e005f7c
      majianpeng 提交于
      In a subsequent patch, we'll use some const parameters.
      Using macros will make the code clearly.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8e005f7c
    • M
      raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array... · 07169fd4
      majianpeng 提交于
      raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
      
      We used to use raise_barrier to suspend normal IO while we reconfigure
      the array.  However raise_barrier will soon only suspend some normal
      IO, not all.  So we need something else.
      Change it to use freeze_array.
      But freeze_array not only suspends normal io, it also suspends
      resync io.
      For the place where call raise_barrier for reconfigure, it isn't a
      problem.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      07169fd4
    • M
      raid1: Add a field array_frozen to indicate whether raid in freeze state. · b364e3d0
      majianpeng 提交于
      Because the following patch will rewrite the content between normal IO
      and resync IO. So we used a parameter to indicate whether raid is in freeze
      array.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b364e3d0
    • J
      md: Convert use of typedef ctl_table to struct ctl_table · 82592c38
      Joe Perches 提交于
      This typedef is unnecessary and should just be removed.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      82592c38
    • N
      md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes. · 30b8feb7
      NeilBrown 提交于
      When raid5 recovery hits a fresh badblock, this badblock will flagged as unack
      badblock until md_update_sb() is called.
      But md_stop will take reconfig lock which means raid5d can't call
      md_update_sb() in md_check_recovery(), the badblock will always
      be unack, so raid5d thread enters an infinite loop and md_stop_write()
      can never stop sync_thread. This causes deadlock.
      
      To solve this, when STOP_ARRAY ioctl is issued and sync_thread is
      running, we need set md->recovery FROZEN and INTR flags and wait for
      sync_thread to stop before we (re)take reconfig lock.
      
      This requires that raid5 reshape_request notices MD_RECOVERY_INTR
      (which it probably should have noticed anyway) and stops waiting for a
      metadata update in that case.
      Reported-by: NJianpeng Ma <majianpeng@gmail.com>
      Reported-by: NBian Yu <bianyu@kedacom.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      30b8feb7
    • N
      md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread. · c91abf5a
      NeilBrown 提交于
      We currently use kthread_should_stop() in various places in the
      sync/reshape code to abort early.
      However some places set MD_RECOVERY_INTR but don't immediately call
      md_reap_sync_thread() (and we will shortly get another one).
      When this happens we are relying on md_check_recovery() to reap the
      thread and that only happen when it finishes normally.
      So MD_RECOVERY_INTR must lead to a normal finish without the
      kthread_should_stop() test.
      
      So replace all relevant tests, and be more careful when the thread is
      interrupted not to acknowledge that latest step in a reshape as it may
      not be fully committed yet.
      
      Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop
      so we don't wait have to wait for the speed to drop before we can abort.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c91abf5a
    • N
      md: fix some places where mddev_lock return value is not checked. · 29f097c4
      NeilBrown 提交于
      Sometimes we need to lock and mddev and cannot cope with
      failure due to interrupt.
      In these cases we should use mutex_lock, not mutex_lock_interruptible.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      29f097c4
    • B
      raid5: Retry R5_ReadNoMerge flag when hit a read error. · edfa1f65
      Bian Yu 提交于
      Because of block layer merge, one bio fails will cause other bios
      which belongs to the same request fails, so raid5_end_read_request
      will record all these bios as badblocks.
      If retry request with R5_ReadNoMerge flag to avoid bios merge,
      badblocks can only record sector which is bad exactly.
      
      test:
      hdparm --yes-i-know-what-i-am-doing --make-bad-sector 300000 /dev/sdb
      mdadm -C /dev/md0 -l5 -n3 /dev/sd[bcd] --assume-clean
      mdadm /dev/md0 -f /dev/sdd
      mdadm /dev/md0 -r /dev/sdd
      mdadm --zero-superblock /dev/sdd
      mdadm /dev/md0 -a /dev/sdd
      
      1. Without this patch:
      cat /sys/block/md0/md/rd*/bad_blocks
      299776 256
      299776 256
      
      2. With this patch:
      cat /sys/block/md0/md/rd*/bad_blocks
      300000 8
      300000 8
      Signed-off-by: NBian Yu <bianyu@kedacom.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      edfa1f65
    • S
      raid5: relieve lock contention in get_active_stripe() · 4bda556a
      Shaohua Li 提交于
      track empty inactive list count, so md_raid5_congested() can use it to make
      decision.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4bda556a
  2. 14 11月, 2013 4 次提交
    • S
      raid5: relieve lock contention in get_active_stripe() · 566c09c5
      Shaohua Li 提交于
      get_active_stripe() is the last place we have lock contention. It has two
      paths. One is stripe isn't found and new stripe is allocated, the other is
      stripe is found.
      
      The first path basically calls __find_stripe and init_stripe. It accesses
      conf->generation, conf->previous_raid_disks, conf->raid_disks,
      conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
      conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
      stripe_hashtbl and inactive_list, other fields are changed very rarely.
      
      With this patch, we split inactive_list and add new hash locks. Each free
      stripe belongs to a specific inactive list. Which inactive list is determined
      by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
      lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
      is determined by it's lock_hash too. The lock_hash is derivied from current
      stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
      to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
      list too. The goal of the new hash locks introduced is we can only use the new
      locks in the first path of get_active_stripe(). Since we have several hash
      locks, lock contention is relieved significantly.
      
      The first path of get_active_stripe() accesses other fields, since they are
      changed rarely, changing them now need take conf->device_lock and all hash
      locks. For a slow path, this isn't a problem.
      
      If we need lock device_lock and hash lock, we always lock hash lock first. The
      tricky part is release_stripe and friends. We need take device_lock first.
      Neil's suggestion is we put inactive stripes to a temporary list and readd it
      to inactive_list after device_lock is released. In this way, we add stripes to
      temporary list with device_lock hold and remove stripes from the list with hash
      lock hold. So we don't allow concurrent access to the temporary list, which
      means we need allocate temporary list for all participants of release_stripe.
      
      One downside is free stripes are maintained in their inactive list, they can't
      across between the lists. By default, we have total 256 stripes and 8 lists, so
      each list will have 32 stripes. It's possible one list has free stripe but
      other list hasn't. The chance should be rare because stripes allocation are
      even distributed. And we can always allocate more stripes for cache, several
      mega bytes memory isn't a big deal.
      
      This completely removes the lock contention of the first path of
      get_active_stripe(). It slows down the second code path a little bit though
      because we now need takes two locks, but since the hash lock isn't contended,
      the overhead should be quite small (several atomic instructions). The second
      path of get_active_stripe() (basically sequential write or big request size
      randwrite) still has lock contentions.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      566c09c5
    • N
      md/raid5.c: add proper locking to error path of raid5_start_reshape. · ba8805b9
      NeilBrown 提交于
      If raid5_start_reshape errors out, we need to reset all the fields
      that were updated (not just some), and need to use the seq_counter
      to ensure make_request() doesn't use an inconsitent state.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ba8805b9
    • N
      md: fix calculation of stacking limits on level change. · 02e5f5c0
      NeilBrown 提交于
      The various ->run routines of md personalities assume that the 'queue'
      has been initialised by the blk_set_stacking_limits() call in
      md_alloc().
      
      However when the level is changed (by level_store()) the ->run routine
      for the new level is called for an array which has already had the
      stacking limits modified.  This can result in incorrect final
      settings.
      
      So call blk_set_stacking_limits() before ->run in level_store().
      
      A specific consequence of this bug is that it causes
      discard_granularity to be set incorrectly when reshaping a RAID4 to a
      RAID0.
      
      This is suitable for any -stable kernel since 3.3 in which
      blk_set_stacking_limits() was introduced.
      
      Cc: stable@vger.kernel.org (3.3+)
      Reported-and-tested-by: N"Baldysiak, Pawel" <pawel.baldysiak@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      02e5f5c0
    • M
      raid5: Use slow_path to release stripe when mddev->thread is null · ad4068de
      majianpeng 提交于
      When release_stripe() is called in grow_one_stripe(), the
      mddev->thread is null. So it will omit one wakeup this thread to
      release stripe.
      For this condition, use slow_path to release stripe.
      
      Bug was introduced in 3.12
      
      Cc: stable@vger.kernel.org (3.12+)
      Fixes: 773ca82fSigned-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ad4068de
  3. 08 11月, 2013 1 次提交
    • H
      parisc: sticon - unbreak on 64bit kernel · 0219132f
      Helge Deller 提交于
      STI text console (sticon) was broken on 64bit machines with more than
      4GB RAM and this lead in some cases to a kernel crash.
      
      Since sticon uses the 32bit STI API it needs to keep pointers to memory
      below 4GB. But on a 64bit kernel some memory regions (e.g. the kernel
      stack) might be above 4GB which then may crash the kernel in the STI
      functions.
      
      Additionally sticon didn't selected the built-in framebuffer fonts by
      default. This is now fixed.
      
      On a side-note: Theoretically we could enhance the sticon driver to
      use the 64bit STI API. But - beside the fact that some machines don't
      provide a 64bit STI ROM - this would just add complexity.
      Signed-off-by: NHelge Deller <deller@gmx.de>
      Cc: stable@vger.kernel.org # 3.8+
      0219132f
  4. 07 11月, 2013 1 次提交
  5. 06 11月, 2013 1 次提交
  6. 04 11月, 2013 1 次提交
  7. 02 11月, 2013 13 次提交
  8. 01 11月, 2013 4 次提交
  9. 31 10月, 2013 5 次提交