1. 31 7月, 2012 8 次提交
    • S
      md/raid1: prevent merging too large request · 12cee5a8
      Shaohua Li 提交于
      For SSD, if request size exceeds specific value (optimal io size), request size
      isn't important for bandwidth. In such condition, if making request size bigger
      will cause some disks idle, the total throughput will actually drop. A good
      example is doing a readahead in a two-disk raid1 setup.
      
      So when should we split big requests? We absolutly don't want to split big
      request to very small requests. Even in SSD, big request transfer is more
      efficient. This patch only considers request with size above optimal io size.
      
      If all disks are busy, is it worth doing a split? Say optimal io size is 16k,
      two requests 32k and two disks. We can let each disk run one 32k request, or
      split the requests to 4 16k requests and each disk runs two. It's hard to say
      which case is better, depending on hardware.
      
      So only consider case where there are idle disks. For readahead, split is
      always better in this case. And in my test, below patch can improve > 30%
      thoughput. Hmm, not 100%, because disk isn't 100% busy.
      
      Such case can happen not just in readahead, for example, in directio. But I
      suppose directio usually will have bigger IO depth and make all disks busy, so
      I ignored it.
      
      Note: if the raid uses any hard disk, we don't prevent merging. That will make
      performace worse.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      12cee5a8
    • S
      md/raid1: read balance chooses idlest disk for SSD · 9dedf603
      Shaohua Li 提交于
      SSD hasn't spindle, distance between requests means nothing. And the original
      distance based algorithm sometimes can cause severe performance issue for SSD
      raid.
      
      Considering two thread groups, one accesses file A, the other access file B.
      The first group will access one disk and the second will access the other disk,
      because requests are near from one group and far between groups. In this case,
      read balance might keep one disk very busy but the other relative idle.  For
      SSD, we should try best to distribute requests to as many disks as possible.
      There isn't spindle move penality anyway.
      
      With below patch, I can see more than 50% throughput improvement sometimes
      depending on workloads.
      
      The only exception is small requests can be merged to a big request which
      typically can drive higher throughput for SSD too. Such small requests are
      sequential reads. Unlike hard disk, sequential read which can't be merged (for
      example direct IO, or read without readahead) can be ignored for SSD. Again
      there is no spindle move penality. readahead dispatches small requests and such
      requests can be merged.
      
      Last patch can help detect sequential read well, at least if concurrent read
      number isn't greater than raid disk number. In that case, distance based
      algorithm doesn't work well too.
      
      V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
      random IO too. This makes the algorithm generic for raid with SSD.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9dedf603
    • S
      md/raid1: make sequential read detection per disk based · be4d3280
      Shaohua Li 提交于
      Currently the sequential read detection is global wide. It's natural to make it
      per disk based, which can improve the detection for concurrent multiple
      sequential reads. And next patch will make SSD read balance not use distance
      based algorithm, where this change help detect truly sequential read for SSD.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      be4d3280
    • J
      MD RAID10: Export md_raid10_congested · cc4d1efd
      Jonathan Brassow 提交于
      md/raid10: Export is_congested test.
      
      In similar fashion to commits
      	11d8a6e3
      	1ed7242e
      we export the RAID10 congestion checking function so that dm-raid.c can
      make use of it and make use of the personality.  The 'queue' and 'gendisk'
      structures will not be available to the MD code when device-mapper sets
      up the device, so we conditionalize access to these fields also.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cc4d1efd
    • J
      MD: Move macros from raid1*.h to raid1*.c · 473e87ce
      Jonathan Brassow 提交于
      MD RAID1/RAID10: Move some macros from .h file to .c file
      
      There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined
      in both raid1.h and raid10.h.  They are only used in there respective .c files.
      However, if we wish to make RAID10 accessible to the device-mapper RAID
      target (dm-raid.c), then we need to move these macros into the .c files where
      they are used so that they do not conflict with each other.
      
      The macros from the two files are identical and could be moved into md.h, but
      I chose to leave the duplication and have them remain in the personality
      files.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      473e87ce
    • J
      MD RAID1: rename mirror_info structure · 0eaf822c
      Jonathan Brassow 提交于
      MD RAID1: Rename the structure 'mirror_info' to 'raid1_info'
      
      The same structure name ('mirror_info') is used by raid10.  Each of these
      structures are defined in there respective header files.  If dm-raid is
      to support both RAID1 and RAID10, the header files will be included and
      the structure names must not collide.  While only one of these structure
      names needs to change, this patch adds consistency to the naming of the
      structure.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0eaf822c
    • J
      MD RAID10: rename mirror_info structure · dc280d98
      Jonathan Brassow 提交于
      MD RAID10: Rename the structure 'mirror_info' to 'raid10_info'
      
      The same structure name ('mirror_info') is used by raid1.  Each of these
      structures are defined in there respective header files.  If dm-raid is
      to support both RAID1 and RAID10, the header files will be included and
      the structure names must not collide.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      dc280d98
    • J
      MD RAID10: Fix compiler warning. · 3bbae04b
      Jonathan Brassow 提交于
      MD RAID10:  Fix compiler warning.
      
      Initialize variable to prevent compiler warning.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3bbae04b
  2. 19 7月, 2012 7 次提交
    • S
      raid5: add a per-stripe lock · b17459c0
      Shaohua Li 提交于
      Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
      lock contention of conf->device_lock.
      
      stripe ->toread, ->towrite are protected by per-stripe lock.  Accessing bio
      list of the stripe is always serialized by this lock, so adding bio to the
      lists (add_stripe_bio()) and removing bio from the lists (like
      ops_run_biofill()) not race.
      
      If bio in ->read, ->written ... list are not shared by multiple stripes, we
      don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
      protect them. If the bio are shared,  there are two protections:
      1. bi_phys_segments acts as a reference count
      2. traverse the list uses r5_next_bio, which makes traverse never access bio
      not belonging to the stripe
      
      Let's have an example:
      |  stripe1 |  stripe2    |  stripe3  |
      ...bio1......|bio2|bio3|....bio4.....
      
      stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
      all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
      bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
      because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
      stripe3 will end_bio bio4.
      
      before add_stripe_bio() addes a bio to a stripe, we already increament the bio
      bi_phys_segments, so don't worry other stripes release the bio.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b17459c0
    • S
      raid5: remove unnecessary bitmap write optimization · 7eaf7e8e
      Shaohua Li 提交于
      Neil pointed out the bitmap write optimization in handle_stripe_clean_event()
      is unnecessary, because the chance one stripe gets written twice in the mean
      time is rare. We can always do a bitmap_startwrite when a write request is
      added to a stripe and bitmap_endwrite after write request is done.  Delete the
      optimization. With it, we can delete some cases of device_lock.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7eaf7e8e
    • S
      raid5: lockless access raid5 overrided bi_phys_segments · e7836bd6
      Shaohua Li 提交于
      Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold,
      which is unnecessary, We can make it lockless actually.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e7836bd6
    • S
      raid5: reduce chance release_stripe() taking device_lock · 4eb788df
      Shaohua Li 提交于
      release_stripe() is a place conf->device_lock is heavily contended. We take the
      lock even stripe count isn't 1, which isn't required.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4eb788df
    • N
      md/raid1: close some possible races on write errors during resync · 58e94ae1
      NeilBrown 提交于
      commit 4367af55
         md/raid1: clear bad-block record when write succeeds.
      
      Added a 'reschedule_retry' call possibility at the end of
      end_sync_write, but didn't add matching code at the end of
      sync_request_write.  So if the writes complete very quickly, or
      scheduling makes it seem that way, then we can miss rescheduling
      the request and the resync could hang.
      
      Also commit 73d5c38a
          md: avoid races when stopping resync.
      
      Fix a race condition in this same code in end_sync_write but didn't
      make the change in sync_request_write.
      
      This patch updates sync_request_write to fix both of those.
      Patch is suitable for 3.1 and later kernels.
      Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Original-version-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      58e94ae1
    • N
      md: avoid crash when stopping md array races with closing other open fds. · a05b7ea0
      NeilBrown 提交于
      md will refuse to stop an array if any other fd (or mounted fs) is
      using it.
      When any fs is unmounted of when the last open fd is closed all
      pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put)
      so there will be no pending IO to worry about when the array is
      stopped.
      
      However in order to send the STOP_ARRAY ioctl to stop the array one
      must first get and open fd on the block device.
      If some fd is being used to write to the block device and it is closed
      after mdadm open the block device, but before mdadm issues the
      STOP_ARRAY ioctl, then there will be no last-close on the md device so
      __blkdev_put will not call sync_blockdev.
      
      If this happens, then IO can still be in-flight while md tears down
      the array and bad things can happen (use-after-free and subsequent
      havoc).
      
      So in the case where do_md_stop is being called from an open file
      descriptor, call sync_block after taking the mutex to ensure there
      will be no new openers.
      
      This is needed when setting a read-write device to read-only too.
      
      Cc: stable@vger.kernel.org
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a05b7ea0
    • N
      md: fix bug in handling of new_data_offset · 25f7fd47
      NeilBrown 提交于
      commit c6563a8c
          md: add possibility to change data-offset for devices.
      
      introduced a 'new_data_offset' attribute which should normally
      be the same as 'data_offset', but can be explicitly set to a different
      value to allow a reshape operation to move the data.
      
      Unfortunately when the 'data_offset' is explicitly set through
      sysfs, the new_data_offset is not also set, so the two would become
      out-of-sync incorrectly.
      
      One result of this is that trying to set the 'size' after the
      'data_offset' would fail because it is not permitted to set the size
      when the 'data_offset' and 'new_data_offset' are different - as that
      can be confusing.
      Consequently when mdadm tried to do this while assembling an IMSM
      array it would fail.
      
      This bug was introduced in 3.5-rc1.
      Reported-by: NBrian Downing <bdowning@lavos.net>
      Bisected-by: NBrian Downing <bdowning@lavos.net>
      Tested-by: NBrian Downing <bdowning@lavos.net>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      25f7fd47
  3. 15 7月, 2012 8 次提交
  4. 14 7月, 2012 14 次提交
  5. 13 7月, 2012 3 次提交