1. 22 5月, 2012 25 次提交
    • N
      md/bitmap: create a 'struct bitmap_counts' substructure of 'struct bitmap' · 40cffcc0
      NeilBrown 提交于
      The new "struct bitmap_counts" contains all the fields that are
      related to counting the number of active writes in each bitmap chunk.
      
      Having this separate will make it easier to change the chunksize
      or overall size of a bitmap atomically.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      40cffcc0
    • N
      md/bitmap: make bitmap bitops atomic. · 63c68268
      NeilBrown 提交于
      This allows us to remove spinlock protection which is
      more heavy-weight than simple atomics.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      63c68268
    • N
      md/bitmap: make _page_attr bitops atomic. · bdfd1140
      NeilBrown 提交于
      Using e.g. set_bit instead of __set_bit and using test_and_clear_bit
      allow us to remove some locking and contract other locked ranges.
      
      It is rare that we set or clear a lot of these bits, so gain should
      outweigh any cost.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bdfd1140
    • N
      md/bitmap: merge bitmap_file_unmap and bitmap_file_put. · fae7d326
      NeilBrown 提交于
      There functions really do one thing together: release the
      'bitmap_storage'.  So make them just one function.
      
      Since we removed the locking (previous patch), we don't need to zero
      any fields before freeing them, so it all becomes a bit simpler.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fae7d326
    • N
      md/bitmap: remove async freeing of bitmap file. · 62f82faa
      NeilBrown 提交于
      There is no real value in freeing things the moment there is an error.
      It is just as good to free the bitmap file and pages when the bitmap
      is explicitly removed (and replaced?) or at shutdown.
      
      With this gone, the bitmap will only disappear when the array is
      quiescent, so we can remove some locking.
      
      As the 'filemap' doesn't disappear now, include extra checks before
      trying to write any of it out.
      Also remove the check for "has it disappeared" in
      bitmap_daemon_write().
      Signed-off-by: NNeilBrown <neilb@suse.de>
      62f82faa
    • N
      md/bitmap: convert some spin_lock_irqsave to spin_lock_irq · 74667123
      NeilBrown 提交于
      All of these sites can only be called from process context with
      irqs enabled, so using irqsave/irqrestore just adds noise.
      Remove it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      74667123
    • N
      md/bitmap: use set_bit, test_bit, etc for operation on bitmap->flags. · b405fe91
      NeilBrown 提交于
      We currently use '&' and '|' which isn't the norm in the kernel
      and doesn't allow easy atomicity.
      So change to bit numbers and {set,clear,test}_bit.
      This allows us to remove a spinlock/unlock (which was dubious anyway)
      and some other simplifications.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b405fe91
    • N
      md/bitmap: remove single-bit manipulation on sb->state · 84e92345
      NeilBrown 提交于
      Just do single-bit manipulations on bitmap->flags and copy whole
      value between that and sb->state.
      
      This will allow next patch which changes how bit manipulations are
      performed on bitmap->flags.
      
      This does result in BITMAP_STALE not being set in sb by
      bitmap_read_sb, however as the setting is determined by other
      information in the 'sb' we do not lose information this way.
      Normally, bitmap_load will be called shortly which will clear
      BITMAP_STALE anyway.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      84e92345
    • N
      md/bitmap: remove bitmap_mask_state · edbb79df
      NeilBrown 提交于
      This function isn't really needed.  It sets or clears a flag in both
      bitmap->flags and sb->state.
      However both times it is called, bitmap_update_sb is called soon
      afterwards which copies bitmap->flags to sb->state.
      So just make changes to bitmap->flags, and open-code those rather than
      hiding in a function.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      edbb79df
    • N
      md/bitmap: move storage allocation from bitmap_load to bitmap_create. · bc9891a8
      NeilBrown 提交于
      We should allocate memory for the storage-bitmap at create-time, not
      load time.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bc9891a8
    • N
      md/bitmap: separate bitmap file allocation to its own function. · d1244cb0
      NeilBrown 提交于
      This will allow allocation before swapping in a new bitmap.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d1244cb0
    • N
      md/bitmap: store bytes in file rather than just in last page. · 9b1215c1
      NeilBrown 提交于
      This number is more generally useful, and bytes-in-last-page is
      easily extracted from it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9b1215c1
    • N
      md/bitmap: move some fields of 'struct bitmap' into a 'storage' substruct. · 1ec885cd
      NeilBrown 提交于
      This new 'struct bitmap_storage' reflects the external storage of the
      bitmap.
      Having this clearly defined will make it easier to change the storage
      used while the array is active.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1ec885cd
    • N
      md/bitmap: change *_page_attr() to take a page number, not a page. · d189122d
      NeilBrown 提交于
      Most often we have the page number, not the page.  And that is what
      the  *_page_attr() functions really want.  So change the arguments to
      take that number.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d189122d
    • N
      md/bitmap: centralise allocation of bitmap file pages. · 27581e5a
      NeilBrown 提交于
      Instead of allocating pages in read_sb_page, read_page and
      bitmap_read_sb, allocate them all in bitmap_init_from disk.
      
      Also replace the hack of calling "attach_page_buffers(page, NULL)" to
      ensure that free_buffer() won't complain, by putting a test for
      PagePrivate in free_buffer().
      Signed-off-by: NNeilBrown <neilb@suse.de>
      27581e5a
    • N
      md/bitmap: allow a bitmap with no backing storage. · ef99bf48
      NeilBrown 提交于
      An md bitmap comprises two parts
       - internal counting of active writes per 'chunk'.
       - external storage of whether there are any active writes on
         each chunk
      
      The second requires the first, but the first doesn't require the
      second.
      
      Not having backing storage means that the bitmap cannot expedite
      resync after a crash, but it still allows us to expedite the recovery
      of a recently-removed device.
      
      So: allow a bitmap to exist even if there is no backing device.
      In that case we default to 128M chunks.
      
      A particular value of this is that we can remove and re-add a bitmap
      (possibly of a different granularity) on a degraded array, and not
      lose the information needed to fast-recover the missing device.
      
      We don't actually activate these bitmaps yet - that will come
      in a later patch.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ef99bf48
    • N
      md/bitmap: add new 'space' attribute for bitmaps. · 6409bb05
      NeilBrown 提交于
      If we are to allow bitmaps to be resized when the array is resized,
      we need to know how much space there is.
      
      So create an attribute to store this information and set appropriate
      defaults.
      
      It can be set more precisely via sysfs, or future metadata extensions
      may allow it to be recorded.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6409bb05
    • N
      md/bitmap: disentangle two different 'pending' flags. · bf07bb7d
      NeilBrown 提交于
      There are two different 'pending' concepts in the handling of the
      write intent bitmap.
      
      Firstly, a 'page' from the bitmap (which container PAGE_SIZE*8 bits)
      may have changes (bits cleared) that should be written in due course.
      There is no hurry for these and the page will transition from
      PENDING to NEEDWRITE and will then be written, though if it ever
      becomes DIRTY it will be written much sooner and PENDING will be
      cleared.
      
      Secondly, a page of counters - which contains PAGE_SIZE/2 counters, one
      for each bit, can usefully have a 'pending' flag which indicates if
      any of the counters are low (2 or 1) and ready to be processed by
      bitmap_daemon_work().  If this flag is clear we can skip the whole
      page.
      
      These two concepts are currently combined in the bitmap-file flag.
      This causes a tighter connection between the counters and the bitmap
      file than I would like - as I want to add some flexibility to the
      bitmap file.
      
      So introduce a new flag with the page-of-counters, and rewrite
      bitmap_daemon_work() so that it handles the two different 'pending'
      concepts separately.
      
      This also allows us to clear BITMAP_PAGE_PENDING when we write out
      a dirty page, which may occasionally reduce the number of times we
      write a page.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bf07bb7d
    • S
      raid5: support sync request · bc0934f0
      Shaohua Li 提交于
      REQ_SYNC is ignored in current raid5 code. Block layer does use it to do
      policy,
      for example ioscheduler. This patch adds it.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bc0934f0
    • S
      raid5: remove unused variables · cceeca43
      Shaohua Li 提交于
      The two variables are useless.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cceeca43
    • M
      md/raid10: Fix memleak in r10buf_pool_alloc · 5fdd2cf8
      majianpeng 提交于
      If the allocation of rep1_bio fails, we currently don't free the 'bio'
      of the same dev.
      
      Reported by kmemleak.
      Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5fdd2cf8
    • M
      md/raid1: allow fix_read_error to read from recovering device. · da8840a7
      majianpeng 提交于
      When attempting to fix a read error, it is acceptable to read from a
      device that is recovering, provided the recovery has got past the
      place we are reading from.  This makes the test for "can we read from
      here" the same as the test in read_balance.
      Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      da8840a7
    • N
      md: move freeing of badblocks.page into md_rdev_clear · 4fa2f327
      NeilBrown 提交于
      This ensures that it is always freed - there were case where
      we failed to free the page.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4fa2f327
    • N
      md: dm-raid should call helper function to clear rdev. · 545c8795
      NeilBrown 提交于
      dm-raid currently open-codes the freeing of some members of
      and rdev.  It is more maintainable to have it call common code
      from md.c which does this for all call-sites.
      
      So remove free_disk_sb to md_rdev_clear, export it, and use it in
      dm-raid.c
      Signed-off-by: NNeilBrown <neilb@suse.de>
      545c8795
    • N
      md/raid10: add reshape support · 3ea7daa5
      NeilBrown 提交于
      A 'near' or 'offset' lay RAID10 array can be reshaped to a different
      'near' or 'offset' layout, a different chunk size, and a different
      number of devices.
      However the number of copies cannot change.
      
      Unlike RAID5/6, we do not support having user-space backup data that
      is being relocated during a 'critical section'.  Rather, the
      data_offset of each device must change so that when writing any block
      to a new location, it will not over-write any data that is still
      'live'.
      
      This means that RAID10 reshape is not supportable on v0.90 metadata.
      
      The different between the old data_offset and the new_offset must be
      at least the larger of the chunksize multiplied by offset copies of
      each of the old and new layout. (for 'near' mode, offset_copies == 1).
      
      A larger difference of around 64M seems useful for in-place reshapes
      as more data can be moved between metadata updates.
      Very large differences (e.g. 512M) seem to slow the process down due
      to lots of long seeks (on oldish consumer graded devices at least).
      
      Metadata needs to be updated whenever the place we are about to write
      to is considered - by the current metadata - to still contain data in
      the old layout.
      
      [unbalanced locking fix from Dan Carpenter <dan.carpenter@oracle.com>]
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3ea7daa5
  2. 21 5月, 2012 10 次提交
    • N
      md/raid10: split out interpretation of layout to separate function. · deb200d0
      NeilBrown 提交于
      We will soon be interpreting the layout (and chunksize etc) from
      multiple places to support reshape.  So split it out into separate
      function.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      deb200d0
    • N
      md/raid10: Introduce 'prev' geometry to support reshape. · f8c9e74f
      NeilBrown 提交于
      When RAID10 supports reshape it will need a 'previous' and a 'current'
      geometry, so introduce that here.
      Use the 'prev' geometry when before the reshape_position, and the
      current 'geo' when beyond it.  At other times, use both as
      appropriate.
      
      For now, both are identical (And reshape_position is never set).
      
      When we use the 'prev' geometry, we must use the old data_offset.
      When we use the current (And a reshape is happening) we must use
      the new_data_offset.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f8c9e74f
    • N
      md: use resync_max_sectors for reshape as well as resync. · c804cdec
      NeilBrown 提交于
      Some resync type operations need to act on the address space of the
      device, others on the address space of the array.
      
      This only affects RAID10, so it sets resync_max_sectors to the array
      size (it defaults to the device size), and that is currently used for
      resync only.  However reshape of a RAID10 must be done against the
      array size, not device size, so change code to use resync_max_sectors
      for both the resync and the reshape cases.
      This does not affect RAID5 or RAID1, just RAID10.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c804cdec
    • N
      md: teach sync_page_io about new_data_offset. · 1fdd6fc9
      NeilBrown 提交于
      Some code in raid1 and raid10 use sync_page_io to
      read/write pages when responding to read errors.
      As we will shortly support changing data_offset for
      raid10, this function must understand new_data_offset.
      
      So add that understanding.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1fdd6fc9
    • N
      md/raid10: collect some geometry fields into a dedicated structure. · 5cf00fcd
      NeilBrown 提交于
      We will shortly be adding reshape support for RAID10 which will
      require it having 2 concurrent geometries (before and after).
      To make that easier, collect most geometry fields into 'struct geom'
      and access them from there.  Then we will more easily be able to add
      a second set of fields.
      
      Note that 'copies' is not in this struct and so cannot be changed.
      There is little need to change this number and doing so is a lot
      more difficult as it requires reallocating more things.
      So leave it out for now.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5cf00fcd
    • N
      md/raid5: allow for change in data_offset while managing a reshape. · b5254dd5
      NeilBrown 提交于
      The important issue here is incorporating the different in data_offset
      into calculations concerning when we might need to over-write data
      that is still thought to be valid.
      
      To this end we find the minimum offset difference across all devices
      and add that where appropriate.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b5254dd5
    • N
      md/raid5: Use correct data_offset for all IO. · 05616be5
      NeilBrown 提交于
      As there can now be two different data_offsets - an 'old' and
      a 'new' - we need to carefully choose between them.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      05616be5
    • N
      md: add possibility to change data-offset for devices. · c6563a8c
      NeilBrown 提交于
      When reshaping we can avoid costly intermediate backup by
      changing the 'start' address of the array on the device
      (if there is enough room).
      
      So as a first step, allow such a change to be requested
      through sysfs, and recorded in v1.x metadata.
      
      (As we didn't previous check that all 'pad' fields were zero,
       we need a new FEATURE flag for this.
       A (belatedly) check that all remaining 'pad' fields are
       zero to avoid a repeat of this)
      
      The new data offset must be requested separately for each device.
      This allows each to have a different change in the data offset.
      This is not likely to be used often but as data_offset can be
      set per-device, new_data_offset should be too.
      
      This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
      it is never used and never will be.  At the same time we add a new
      arg ('in_new') which is currently always zero but will be used more
      soon.
      
      When a reshape finishes we will need to update the data_offset
      and rdev->sectors.  So provide an exported function to do that.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c6563a8c
    • N
      md: allow a reshape operation to be reversed. · 2c810cdd
      NeilBrown 提交于
      Currently a reshape operation always progresses from the start
      of the array to the end unless the number of devices is being
      reduced, in which case it progressed in the opposite direction.
      
      To reverse a partial reshape which changes the number of devices
      you can stop the array and re-assemble with the raid-disks numbers
      reversed and it will undo.
      
      However for a reshape that does not change the number of devices
      it is not possible to reverse the reshape in the middle - you have to
      wait until it completes.
      
      So add a 'reshape_direction' attribute with is either 'forwards' or
      'backwards' and can be explicitly set when delta_disks is zero.
      
      This will become more important when we allow the data_offset to
      change in a reshape.  Then the explicit statement of what direction is
      being used will be more useful.
      
      This can be enabled in raid5 trivially as it already supports
      reverse reshape and just needs to use a different trigger to request it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2c810cdd
    • S
      md: using GFP_NOIO to allocate bio for flush request · b5e1b8ce
      Shaohua Li 提交于
      A flush request is usually issued in transaction commit code path, so
      using GFP_KERNEL to allocate memory for flush request bio falls into
      the classic deadlock issue.
      
      This is suitable for any -stable kernel to which it applies as it
      avoids a possible deadlock.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b5e1b8ce
  3. 19 5月, 2012 1 次提交
    • N
      md/raid10: fix transcription error in calc_sectors conversion. · b0d634d5
      NeilBrown 提交于
      The old code was
      		sector_div(stride, fc);
      the new code was
      		sector_dir(size, conf->near_copies);
      
      'size' is right (the stride various wasn't really needed), but
      'fc' means 'far_copies', and that is an important difference.
      
      Signed-off-by: NeilBrown <neilb@suse.de>       
      b0d634d5
  4. 17 5月, 2012 2 次提交
    • J
      MD: Add del_timer_sync to mddev_suspend (fix nasty panic) · 0d9f4f13
      Jonathan Brassow 提交于
      Use del_timer_sync to remove timer before mddev_suspend finishes.
      
      We don't want a timer going off after an mddev_suspend is called.  This is
      especially true with device-mapper, since it can call the destructor function
      immediately following a suspend.  This results in the removal (kfree) of the
      structures upon which the timer depends - resulting in a very ugly panic.
      Therefore, we add a del_timer_sync to mddev_suspend to prevent this.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0d9f4f13
    • N
      md/raid10: set dev_sectors properly when resizing devices in array. · 6508fdbf
      NeilBrown 提交于
      raid10 stores dev_sectors in 'conf' separately from the one in
      'mddev' because it can have a very significant effect on block
      addressing and so need to be updated carefully.
      
      However raid10_resize isn't updating it at all!
      
      To update it correctly, we need to make sure it is a proper
      multiple of the chunksize taking various details of the layout
      in to account.
      This calculation is currently done in setup_conf.   So split it
      out from there and call it from raid10_resize as well.
      Then set conf->dev_sectors properly.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6508fdbf
  5. 04 5月, 2012 1 次提交
  6. 24 4月, 2012 1 次提交
    • N
      md: fix possible corruption of array metadata on shutdown. · 30b8aa91
      NeilBrown 提交于
      commit c744a65c
        md: don't set md arrays to readonly on shutdown.
      
      removed the possibility of a 'BUG' when data is written to an array
      that has just been switched to read-only, but also introduced the
      possibility that the array metadata could be corrupted.
      
      If, when md_notify_reboot gets the mddev lock, the array is
      in a state where it is assembled but hasn't been started (as can
      happen if the personality module is not available, or in other unusual
      situations), then incorrect metadata will be written out making it
      impossible to re-assemble the array.
      
      So only call __md_stop_writes() if the array has actually been
      activated.
      
      This patch is needed for any stable kernel which has had the above
      commit applied.
      
      Cc: stable@vger.kernel.org
      Reported-by: NChristoph Nelles <evilazrael@evilazrael.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      30b8aa91