1. 11 12月, 2009 17 次提交
    • M
      dm log: add flush callback fn · 87a8f240
      Mikulas Patocka 提交于
      Introduce a callback pointer from the log to dm-raid1 layer.
      
      Before some region is set as "in-sync", we need to flush hardware cache on
      all the disks. But the log module doesn't have access to the mirror_set
      structure. So it will use this callback.
      
      So far the callback is unused, it will be used in further patches.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      87a8f240
    • M
      dm log: introduce flush_failed variable · 5adc78d0
      Mikulas Patocka 提交于
      Introduce "flush failed" variable.  When a flush before clearing a bit
      in the log fails, we don't know anything about which which regions are
      in-sync and which not.
      
      So we need to set all regions as not-in-sync and set the variable
      "flush_failed" to prevent setting the in-sync bit in the future.
      
      A target reload is the only way to get out of this situation.
      
      The variable will be set in following patches.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      5adc78d0
    • M
      dm log: add flush_header function · 20a34a8e
      Mikulas Patocka 提交于
      Introduce flush_header and use it to flush the log device.
      
      Note that we don't have to flush if all the regions transition
      from "dirty" to "clean" state.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      20a34a8e
    • M
      dm raid1: split touched state into two · b09acf1a
      Mikulas Patocka 提交于
      Split the variable "touched" into two, "touched_dirtied" and
      "touched_cleaned", set when some region was dirtied or cleaned.
      
      This will be used to optimize flushes.
      
      After a transition from "dirty" to "clean" state we don't have flush hardware
      cache on the log device. After a transition from "clean" to "dirty" the cache
      must be flushed.
      
      Before a transition from "clean" to "dirty" state we don't have to flush all
      the raid legs. Before a transition from "dirty" to "clean" we must flush all
      the legs to make sure that they are really in sync.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      b09acf1a
    • M
      dm raid1: support flush · 4184153f
      Mikulas Patocka 提交于
      Flush support for dm-raid1.
      
      When it receives an empty barrier, submit it to all the devices via dm-io.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      4184153f
    • M
      dm io: remove extra bi_io_vec region hack · f1e53987
      Mikulas Patocka 提交于
      Remove the hack where we allocate an extra bi_io_vec to store additional
      private data.  This hack prevents us from supporting barriers in
      dm-raid1 without first making another little block layer change.
      Instead of doing that, this patch eliminates the bi_io_vec abuse by
      storing the region number directly in the low bits of bi_private.
      
      We need to store two things for each bio, the pointer to the main io
      structure and, if parallel writes were requested, an index indicating
      which of these writes this bio belongs to.  There can be at most
      BITS_PER_LONG regions - 32 or 64.
      
      The index (region number) was stored in the last (hidden) bio vector and
      the pointer to struct io was stored in bi_private.
      
      This patch now aligns "struct io" on BITS_PER_LONG bytes and stores the
      region number in the low BITS_PER_LONG bits of bi_private.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f1e53987
    • M
      dm io: use slab for struct io · 952b3557
      Mikulas Patocka 提交于
      Allocate "struct io" from a slab.
      
      This patch changes dm-io, so that "struct io" is allocated from a slab cache.
      It used to be allocated with kmalloc. Allocating from a slab will be needed
      for the next patch, because it requires a special alignment of "struct io"
      and kmalloc cannot meet this alignment.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      952b3557
    • M
      dm crypt: make wipe message also wipe essiv key · 542da317
      Milan Broz 提交于
      The "wipe key" message is used to wipe the volume key from memory
      temporarily, for example when suspending to RAM.
      
      But the initialisation vector in ESSIV mode is calculated from the
      hashed volume key, so the wipe message should wipe this IV key too and
      reinitialise it when the volume key is reinstated.
      
      This patch adds an IV wipe method called from a wipe message callback.
      ESSIV is then reinitialised using the init function added by the
      last patch.
      
      Cc: stable@kernel.org
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      542da317
    • M
      dm crypt: separate essiv allocation from initialisation · b95bf2d3
      Milan Broz 提交于
      This patch separates the construction of IV from its initialisation.
      (For ESSIV it is a hash calculation based on volume key.)
      
      Constructor code now preallocates hash tfm and salt array
      and saves it in a private IV structure.
      
      The next patch requires this to reinitialise the wiped IV
      without reallocating memory when resuming a suspended device.
      
      Cc: stable@kernel.org
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      b95bf2d3
    • M
      dm crypt: restructure essiv error path · 5861f1be
      Milan Broz 提交于
      Use kzfree for salt deallocation because it is derived from the volume
      key.  Use a common error path in ESSIV constructor.
      
      Required by a later patch which fixes the way key material is wiped
      from memory.
      
      Cc: stable@kernel.org
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      5861f1be
    • M
      dm crypt: move private iv fields to structs · 60473592
      Milan Broz 提交于
      Define private structures for IV so it's easy to add further attributes
      in a following patch which fixes the way key material is wiped from
      memory.  Also move ESSIV destructor and remove unnecessary 'status'
      operation.
      
      There are no functional changes in this patch.
      
      Cc: stable@kernel.org
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      60473592
    • M
      dm crypt: make wipe message also wipe tfm key · 0b430958
      Milan Broz 提交于
      The "wipe key" message is used to wipe a volume key from memory
      temporarily, for example when suspending to RAM.
      
      There are two instances of the key in memory (inside crypto tfm)
      but only one got wiped.  This patch wipes them both.
      
      Cc: stable@kernel.org
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      0b430958
    • M
      dm snapshot: cope with chunk size larger than origin · 8e87b9b8
      Mikulas Patocka 提交于
      Under some special conditions the snapshot hash_size is calculated as zero.
      This patch instead sets a minimum value of 64, the same as for the
      pending exception table.
      
      rounddown_pow_of_two(0) is an undefined operation (it expands to shift
      by -1).  init_exception_table with an argument of 0 would fail with -ENOMEM.
      
      The way to trigger the problem is to create a snapshot with a chunk size
      that is larger than the origin device.
      
      Cc: stable@kernel.org
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      8e87b9b8
    • M
      dm snapshot: only take lock for statustype info not table · 94e76572
      Mikulas Patocka 提交于
      Take snapshot lock only for STATUSTYPE_INFO, not STATUSTYPE_TABLE.
      
      Commit 4c6fff44
      (dm-snapshot-lock-snapshot-while-supplying-status.patch)
      introduced this use of the lock, but userspace applications using
      libdevmapper have been found to request STATUSTYPE_TABLE while the device
      is suspended and the lock is already held, leading to deadlock.  Since
      the lock is not necessary in this case, don't try to take it.
      
      Cc: stable@kernel.org
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      94e76572
    • M
      dm: sysfs add empty release function to avoid debug warning · d2bb7df8
      Milan Broz 提交于
      This patch just removes an unnecessary warning:
       kobject: 'dm': does not have a release() function,
       it is broken and must be fixed.
      
      The kobject is embedded in mapped device struct, so
      code does not need to release memory explicitly here.
      
      Cc: stable@kernel.org
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      d2bb7df8
    • J
      dm exception store: free tmp_store on persistent flag error · 613978f8
      Julia Lawall 提交于
      Error handling code following a kmalloc should free the allocated data.
      
      Cc: stable@kernel.org
      Signed-off-by: NJulia Lawall <julia@diku.dk>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      613978f8
    • M
      dm: avoid _hash_lock deadlock · 6076905b
      Mikulas Patocka 提交于
      Fix a reported deadlock if there are still unprocessed multipath events
      on a device that is being removed.
      
      _hash_lock is held during dev_remove while trying to send the
      outstanding events.  Sending the events requests the _hash_lock
      again in dm_copy_name_and_uuid.
      
      This patch introduces a separate lock around regions that modify the
      link to the hash table (dm_set_mdptr) or the name or uuid so that
      dm_copy_name_and_uuid no longer needs _hash_lock.
      
      Additionally, dm_copy_name_and_uuid can only be called if md exists
      so we can drop the dm_get() and dm_put() which can lead to a BUG()
      while md is being freed.
      
      The deadlock:
       #0 [ffff8106298dfb48] schedule at ffffffff80063035
       #1 [ffff8106298dfc20] __down_read at ffffffff8006475d
       #2 [ffff8106298dfc60] dm_copy_name_and_uuid at ffffffff8824f740
       #3 [ffff8106298dfc90] dm_send_uevents at ffffffff88252685
       #4 [ffff8106298dfcd0] event_callback at ffffffff8824c678
       #5 [ffff8106298dfd00] dm_table_event at ffffffff8824dd01
       #6 [ffff8106298dfd10] __hash_remove at ffffffff882507ad
       #7 [ffff8106298dfd30] dev_remove at ffffffff88250865
       #8 [ffff8106298dfd60] ctl_ioctl at ffffffff88250d80
       #9 [ffff8106298dfee0] do_ioctl at ffffffff800418c4
      #10 [ffff8106298dff00] vfs_ioctl at ffffffff8002fab9
      #11 [ffff8106298dff40] sys_ioctl at ffffffff8004bdaf
      #12 [ffff8106298dff80] tracesys at ffffffff8005d28d (via system_call)
      
      Cc: stable@kernel.org
      Reported-by: Nguy keren <choo@actcom.co.il>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      6076905b
  2. 05 12月, 2009 1 次提交
  3. 01 12月, 2009 1 次提交
    • N
      md: revert incorrect fix for read error handling in raid1. · d0e26078
      NeilBrown 提交于
      commit 4706b349 was a forward port of a fix that was needed
      for SLES10.  But in fact it is not needed in mainline because
      the earlier commit dd00a99e fixes the same problem in a
      better way.
      Further, this commit introduces a bug in the way it interacts with
      the automatic read-error-correction.  If, after a read error is
      successfully corrected, the same disk is chosen to re-read - the
      re-read won't be attempted but an error will be returned instead.
      
      After reverting that commit, there is the possibility that a
      read error on a read-only array (where read errors cannot
      be corrected as that requires a write) will repeatedly read the same
      device and continue to get an error.
      So in the "Array is readonly" case, fail the drive immediately on
      a read error.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@kernel.org
      d0e26078
  4. 19 11月, 2009 1 次提交
  5. 13 11月, 2009 3 次提交
    • N
      md/raid5: Allow dirty-degraded arrays to be assembled when only party is degraded. · c148ffdc
      NeilBrown 提交于
      Normally is it not safe to allow a raid5 that is both dirty and
      degraded to be assembled without explicit request from that admin, as
      it can cause hidden data corruption.
      This is because 'dirty' means that the parity cannot be trusted, and
      'degraded' means that the parity needs to be used.
      
      However, if the device that is missing contains only parity, then
      there is no issue and assembly can continue.
      This particularly applies when a RAID5 is being converted to a RAID6
      and there is an unclean shutdown while the conversion is happening.
      
      So check for whether the degraded space only contains parity, and
      in that case, allow the assembly.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c148ffdc
    • N
      Don't unconditionally set in_sync on newly added device in raid5_reshape · 7ef90146
      NeilBrown 提交于
      When a reshape finds that it can add spare devices into the array,
      those devices might already be 'in_sync' if they are beyond the old
      size of the array, or they might not if they are within the array.
      
      The first case happens when we change an N-drive RAID5 to an
      N+1-drive RAID5.
      The second happens when we convert an N-drive RAID5 to an
      N+1-drive RAID6.
      
      So set the flag more carefully.
      Also, ->recovery_offset is only meaningful when the flag is clear,
      so only set it in that case.
      
      This change needs the preceding two to ensure that the non-in_sync
      device doesn't get evicted from the array when it is stopped, in the
      case where v0.90 metadata is used.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7ef90146
    • N
      md: allow v0.91 metadata to record devices as being active but not in-sync. · 0261cd9f
      NeilBrown 提交于
      This is a combination that didn't really make sense before.
      However when a reshape is converting e.g. raid5 -> raid6, the extra
      device is not fully in-sync, but is certainly active and contains
      important data.
      So allow that start to be meaningful and in particular get
      the 'recovery_offset' value (which is needed for any non-in-sync
      active device) from the reshape_position.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0261cd9f
  6. 12 11月, 2009 2 次提交
    • E
      sysctl drivers: Remove dead binary sysctl support · 894d2491
      Eric W. Biederman 提交于
      Now that sys_sysctl is a wrapper around /proc/sys all of
      the binary sysctl support elsewhere in the tree is
      dead code.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Corey Minyard <minyard@acm.org>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Neil Brown <neilb@suse.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@suse.de>
      Acked-by: Clemens Ladisch <clemens@ladisch.de> for drivers/char/hpet.c
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      894d2491
    • N
      md: factor out updating of 'recovery_offset'. · 5e865106
      NeilBrown 提交于
      Each device has its own 'recovery_offset' showing how far
      recovery has progressed on the device.
      As the only real significance of this is that fact that it can
      be stored in the metadata and recovered at restart, and as
      only 1.x metadata can do this, we were only updating
      'recovery_offset' to 'curr_resync_completed' when updating
      v1.x metadata.
      But this is wrong, and we will shortly make limited use of this
      field in v0.90 metadata.
      
      So move the update into common code.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5e865106
  7. 09 11月, 2009 1 次提交
  8. 06 11月, 2009 2 次提交
    • N
      md/raid5: make sure curr_sync_completes is uptodate when reshape starts · 8dee7211
      NeilBrown 提交于
      This value is visible through sysfs and is used by mdadm
      when it manages a reshape (backing up data that is about to be
      rearranged).  So it is important that it is always correct.
      Current it does not get updated properly when a reshape
      starts which can cause problems when assembling an array
      that is in the middle of being reshaped.
      
      This is suitable for 2.6.31.y stable kernels.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8dee7211
    • N
      md: don't clear endpoint for resync when resync is interrupted. · 24395a85
      NeilBrown 提交于
      If a 'sync_max' has been set (via sysfs), it is wrong to clear it
      until a resync (or reshape or recovery ...) actually reached that
      point.
      So if a resync is interrupted (e.g. by device failure),
      leave 'resync_max' unchanged.
      
      This is particularly important for 'reshape' operations that do not
      change the size of the array.  For such operations mdadm needs to
      monitor the reshape taking rolling backups of the section being
      reshaped.  If resync_max gets cleared, the reshape can get ahead of
      mdadm and then the backups that mdadm creates are useless.
      
      This is suitable for 2.6.31.y stable kernels.
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      24395a85
  9. 20 10月, 2009 1 次提交
  10. 17 10月, 2009 10 次提交
  11. 16 10月, 2009 1 次提交
    • N
      md/async: don't pass a memory pointer as a page pointer. · 5dd33c9a
      NeilBrown 提交于
      md/raid6 passes a list of 'struct page *' to the async_tx routines,
      which then either DMA map them for offload, or take the page_address
      for CPU based calculations.
      
      For RAID6 we sometime leave 'blanks' in the list of pages.
      For CPU based calcs, we want to treat theses as a page of zeros.
      For offloaded calculations, we simply don't pass a page to the
      hardware.
      
      Currently the 'blanks' are encoded as a pointer to
      raid6_empty_zero_page.  This is a 4096 byte memory region, not a
      'struct page'.  This is mostly handled correctly but is rather ugly.
      
      So change the code to pass and expect a NULL pointer for the blanks.
      When taking page_address of a page, we need to check for a NULL and
      in that case use raid6_empty_zero_page.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5dd33c9a