1. 17 3月, 2009 5 次提交
    • M
      dm crypt: wait for endio to complete before destruction · b35f8caa
      Milan Broz 提交于
      The following oops has been reported when dm-crypt runs over a loop device.
      
      ...
      [   70.381058] Process loop0 (pid: 4268, ti=cf3b2000 task=cf1cc1f0 task.ti=cf3b2000)
      ...
      [   70.381058] Call Trace:
      [   70.381058]  [<d0d76601>] ? crypt_dec_pending+0x5e/0x62 [dm_crypt]
      [   70.381058]  [<d0d767b8>] ? crypt_endio+0xa2/0xaa [dm_crypt]
      [   70.381058]  [<d0d76716>] ? crypt_endio+0x0/0xaa [dm_crypt]
      [   70.381058]  [<c01a2f24>] ? bio_endio+0x2b/0x2e
      [   70.381058]  [<d0806530>] ? dec_pending+0x224/0x23b [dm_mod]
      [   70.381058]  [<d08066e4>] ? clone_endio+0x79/0xa4 [dm_mod]
      [   70.381058]  [<d080666b>] ? clone_endio+0x0/0xa4 [dm_mod]
      [   70.381058]  [<c01a2f24>] ? bio_endio+0x2b/0x2e
      [   70.381058]  [<c02bad86>] ? loop_thread+0x380/0x3b7
      [   70.381058]  [<c02ba8a1>] ? do_lo_send_aops+0x0/0x165
      [   70.381058]  [<c013754f>] ? autoremove_wake_function+0x0/0x33
      [   70.381058]  [<c02baa06>] ? loop_thread+0x0/0x3b7
      
      When a table is being replaced, it waits for I/O to complete
      before destroying the mempool, but the endio function doesn't
      call mempool_free() until after completing the bio.
      
      Fix it by swapping the order of those two operations.
      
      The same problem occurs in dm.c with md referenced after dec_pending.
      Again, we swap the order.
      
      Cc: stable@kernel.org
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      b35f8caa
    • H
      dm crypt: fix kcryptd_async_done parameter · b2174eeb
      Huang Ying 提交于
      In the async encryption-complete function (kcryptd_async_done), the
      crypto_async_request passed in may be different from the one passed to
      crypto_ablkcipher_encrypt/decrypt.  Only crypto_async_request->data is
      guaranteed to be same as the one passed in.  The current
      kcryptd_async_done uses the passed-in crypto_async_request directly
      which may cause the AES-NI-based AES algorithm implementation to panic.
      
      This patch fixes this bug by only using crypto_async_request->data,
      which points to dm_crypt_request, the crypto_async_request passed in.
      The original data (convert_context) is gotten from dm_crypt_request.
      
      [mbroz@redhat.com: reworked]
      Cc: stable@kernel.org
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      b2174eeb
    • M
      dm io: respect BIO_MAX_PAGES limit · d659e6cc
      Mikulas Patocka 提交于
      dm-io calls bio_get_nr_vecs to get the maximum number of pages to use
      for a given device.  It allocates one additional bio_vec to use
      internally but failed to respect BIO_MAX_PAGES, so fix this.
      
      This was the likely cause of:
        https://bugzilla.redhat.com/show_bug.cgi?id=173153
      
      Cc: stable@kernel.org
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      d659e6cc
    • M
      dm table: rework reference counting fix · f80a5570
      Mikulas Patocka 提交于
      Fix an error introduced in dm-table-rework-reference-counting.patch.
      
      When there is failure after table initialization, we need to use
      dm_table_destroy, not dm_table_put, to free the table.
      
      dm_table_put may be used only after dm_table_get.
      
      Cc: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NJonathan Brassow <jbrassow@redhat.com>
      Reviewed-by: NAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f80a5570
    • M
      dm ioctl: validate name length when renaming · bc0fd67f
      Milan Broz 提交于
      When renaming a mapped device validate the length of the new name.
      
      The rename ioctl accepted any correctly-terminated string enclosed
      within the data passed from userspace.  The other ioctls enforce a
      size limit of DM_NAME_LEN.  If the name is changed and becomes longer
      than that, the device can no longer be addressed by name.
      
      Fix it by properly checking for device name length (including
      terminating zero).
      
      Cc: stable@kernel.org
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Reviewed-by: NJonathan Brassow <jbrassow@redhat.com>
      Reviewed-by: NAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      bc0fd67f
  2. 04 3月, 2009 1 次提交
    • D
      md: fix deadlock when stopping arrays · 5fd3a17e
      Dan Williams 提交于
      Resolve a deadlock when stopping redundant arrays, i.e. ones that
      require a call to sysfs_remove_group when shutdown.  The deadlock is
      summarized below:
      
      Thread1                Thread2
      -------                -------
      read sysfs attribute   stop array
                             take mddev lock
                             sysfs_remove_group
      sysfs_get_active
      wait for mddev lock
                             wait for active
      
      Sysrq-w:
      --------
      mdmon         S 00000017  2212  4163      1
        f1982ea8 00000046 2dcf6b85 00000017 c0b23100 f2f83ed0 c0b23100 f2f8413c
        c0b23100 c0b23100 c0b1fb98 f2f8413c 00000000 f2f8413c c0b23100 f2291ecc
        00000002 c0b23100 00000000 00000017 f2f83ed0 f1982eac 00000046 c044d9dd
      Call Trace:
        [<c044d9dd>] ? debug_mutex_add_waiter+0x1d/0x58
        [<c06ef451>] __mutex_lock_common+0x1d9/0x338
        [<c06ef451>] ? __mutex_lock_common+0x1d9/0x338
        [<c06ef5e3>] mutex_lock_interruptible_nested+0x33/0x3a
        [<c0634553>] ? mddev_lock+0x14/0x16
        [<c0634553>] mddev_lock+0x14/0x16
        [<c0634eda>] md_attr_show+0x2a/0x49
        [<c04e9997>] sysfs_read_file+0x93/0xf9
      mdadm         D 00000017  2812  4177      1
        f0401d78 00000046 430456f8 00000017 f0401d58 f0401d20 c0b23100 f2da2c4c
        c0b23100 c0b23100 c0b1fb98 f2da2c4c 0a10fc36 00000000 c0b23100 f0401d70
        00000003 c0b23100 00000000 00000017 f2da29e0 00000001 00000002 00000000
      Call Trace:
        [<c06eed1b>] schedule_timeout+0x1b/0x95
        [<c06eed1b>] ? schedule_timeout+0x1b/0x95
        [<c06eeb97>] ? wait_for_common+0x34/0xdc
        [<c044fa8a>] ? trace_hardirqs_on_caller+0x18/0x145
        [<c044fbc2>] ? trace_hardirqs_on+0xb/0xd
        [<c06eec03>] wait_for_common+0xa0/0xdc
        [<c0428c7c>] ? default_wake_function+0x0/0x12
        [<c06eeccc>] wait_for_completion+0x17/0x19
        [<c04ea620>] sysfs_addrm_finish+0x19f/0x1d1
        [<c04e920e>] sysfs_hash_and_remove+0x42/0x55
        [<c04eb4db>] sysfs_remove_group+0x57/0x86
        [<c0638086>] do_md_stop+0x13a/0x499
      
      This has been there for a while, but is easier to trigger now that mdmon
      is closely watching sysfs.
      
      Cc: <stable@kernel.org>
      Reported-by: NJacek Danecki <jacek.danecki@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      5fd3a17e
  3. 25 2月, 2009 3 次提交
    • N
      md: avoid races when stopping resync. · 73d5c38a
      NeilBrown 提交于
      There has been a race in raid10 and raid1 for a long time
      which has only recently started showing up due to a scheduler changed.
      
      When a sync_read request finishes, as soon as reschedule_retry
      is called, another thread can mark the resync request as having
      completed, so md_do_sync can finish, ->stop can be called, and
      ->conf can be freed.  So using conf after reschedule_retry is not
      safe.
      
      Similarly, when finishing a sync_write, calling md_done_sync must be
      the last thing we do, as it allows a chain of events which will free
      conf and other data structures.
      
      The first of these requires action in raid10.c
      The second requires action in raid1.c and raid10.c
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      73d5c38a
    • N
      md/raid10: Don't call bitmap_cond_end_sync when we are doing recovery. · 78200d45
      NeilBrown 提交于
      For raid1/4/5/6, resync (fixing inconsistencies between devices) is
      very similar to recovery (rebuilding a failed device onto a spare).
      The both walk through the device addresses in order.
      
      For raid10 it can be quite different.  resync follows the 'array'
      address, and makes sure all copies are the same.  Recover walks
      through 'device' addresses and recreates each missing block.
      
      The 'bitmap_cond_end_sync' function allows the write-intent-bitmap
      (When present) to be updated to reflect a partially completed resync.
      It makes assumptions which mean that it does not work correctly for
      raid10 recovery at all.
      
      In particularly, it can cause bitmap-directed recovery of a raid10 to
      not recovery some of the blocks that need to be recovered.
      
      So move the call to bitmap_cond_end_sync into the resync path, rather
      than being in the common "resync or recovery" path.
      
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      78200d45
    • N
      md/raid10: Don't skip more than 1 bitmap-chunk at a time during recovery. · 09b4068a
      NeilBrown 提交于
      When doing recovery on a raid10 with a write-intent bitmap, we only
      need to recovery chunks that are flagged in the bitmap.
      
      However if we choose to skip a chunk as it isn't flag, the code
      currently skips the whole raid10-chunk, thus it might not recovery
      some blocks that need recovering.
      
      This patch fixes it.
      
      In case that is confusing, it might help to understand that there
      is a 'raid10 chunk size' which guides how data is distributed across
      the devices, and a 'bitmap chunk size' which says how much data
      corresponds to a single bit in the bitmap.
      
      This bug only affects cases where the bitmap chunk size is smaller
      than the raid10 chunk size.
      
      
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      09b4068a
  4. 18 2月, 2009 1 次提交
  5. 06 2月, 2009 3 次提交
    • N
      md: Ensure an md array never has too many devices. · de01dfad
      NeilBrown 提交于
      Each different metadata format supported by md supports a
      different maximum number of devices.
      We really should be enforcing this maximum in the kernel, but
      we aren't quite doing that properly.
      
      We currently only enforce it at the 'hot_add' point, which is an
      older interface which is not used by current userspace.
      
      We need to also enforce it at 'add_new_disk' time for active arrays
      and at 'do_md_run' time when starting a new array.
      
      So move the test from 'hot_add' into 'bind_rdev_to_array' which is
      called from both 'hot_add' and 'add_new_disk, and add a new
      test in 'analyse_sbs' which is called from 'do_md_run'.
      
      This bug (or missing feature) has been around "forever" and so
      the patch is suitable for any -stable that is currently maintained.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      de01dfad
    • A
      md: Fix a bug in linear.c causing which_dev() to return the wrong device. · 852c8bf4
      Andre Noll 提交于
      ab5bd5cb introduced the following
      bug in linear software raid for large arrays on 32 bit machines:
      
      which_dev() computes the device holding a given sector by shifting
      down the sector number to a 32 bit range, dividing by the array
      spacing and looking up the resulting index in the hash table of
      the array.
      
      Because the computed index might be slightly too small, a loop at
      the end of which_dev() increases the index until the given sector
      actually falls into the range of the device associated with that index.
      
      The changes of the above mentioned commit caused this loop to check
      whether the _index_ rather than the sector number is small enough,
      effectively bypassing the loop and thus possibly returning the wrong
      device.
      
      As reported by Simon Kirby, this leads to errors such as
      
      	linear_make_request: Sector 2340486136 out of bounds on dev sdi: 156301312 sectors, offset 2109870464
      
      Fix this bug by introducing a local variable for the index so that
      the variable containing the passed sector is left unchanged.
      
      Cc: stable@kernel.org
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      852c8bf4
    • N
      md: Allow read error in a single drive raid1 to be passed up. · 4706b349
      NeilBrown 提交于
      If a raid1 only has a single working device and gets a read error, 
      we choose to simply return that error up to the filesystem (or whatever)
      rather than failing the whole array.
      
      However the codes doesn't quite do that.  We attempt a readbalance
      which allocates the same drive, so we retry the read - indefinitely. 
      
      Instead:  If read_balance in the error case chooses the same drive that just
      failed, treat it as a failure and don't retry.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4706b349
  6. 09 1月, 2009 18 次提交
    • N
      md: don't retry recovery of raid1 that fails due to error on source drive. · 4044ba58
      NeilBrown 提交于
      If a raid1 has only one working drive and it has a sector which
      gives an error on read, then an attempt to recover onto a spare will
      fail, but as the single remaining drive is not removed from the
      array, the recovery will be immediately re-attempted, resulting
      in an infinite recovery loop.
      
      So detect this situation and don't retry recovery once an error
      on the lone remaining drive is detected.
      
      Allow recovery to be retried once every time a spare is added
      in case the problem wasn't actually a media error.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4044ba58
    • N
      md: Allow md devices to be created by name. · efeb53c0
      NeilBrown 提交于
      Using sequential numbers to identify md devices is somewhat artificial.
      Using names can be a lot more user-friendly.
      
      Also, creating md devices by opening the device special file is a bit
      awkward.
      
      So this patch provides a new option for creating and naming devices.
      
      Writing a name such as "md_home" to
          /sys/modules/md_mod/parameters/new_array
      will cause an array with that name to be created.  It will appear in
      /sys/block/ /proc/partitions and /proc/mdstat as 'md_home'.
      It will have an arbitrary minor number allocated.
      
      md devices that a created by an open are destroyed on the last
      close when the device is inactive.
      For named md devices, they will not be destroyed until the array
      is explicitly stopped, either with the STOP_ARRAY ioctl or by
      writing 'clear' to /sys/block/md_XXXX/md/array_state.
      
      The name of the array must start 'md_' to avoid conflict with
      other devices.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      efeb53c0
    • N
      md: make devices disappear when they are no longer needed. · d3374825
      NeilBrown 提交于
      Currently md devices, once created, never disappear until the module
      is unloaded.  This is essentially because the gendisk holds a
      reference to the mddev, and the mddev holds a reference to the
      gendisk, this a circular reference.
      
      If we drop the reference from mddev to gendisk, then we need to ensure
      that the mddev is destroyed when the gendisk is destroyed.  However it
      is not possible to hook into the gendisk destruction process to enable
      this.
      
      So we drop the reference from the gendisk to the mddev and destroy the
      gendisk when the mddev gets destroyed.  However this has a
      complication.
      Between the call
         __blkdev_get->get_gendisk->kobj_lookup->md_probe
      and the call
         __blkdev_get->md_open
      
      there is no obvious way to hold a reference on the mddev any more, so
      unless something is done, it will disappear and gendisk will be
      destroyed prematurely.
      
      Also, once we decide to destroy the mddev, there will be an unlockable
      moment before the gendisk is unlinked (blk_unregister_region) during
      which a new reference to the gendisk can be created.  We need to
      ensure that this reference can not be used.  i.e. the ->open must
      fail.
      
      So:
       1/  in md_probe we set a flag in the mddev (hold_active) which
           indicates that the array should be treated as active, even
           though there are no references, and no appearance of activity.
           This is cleared by md_release when the device is closed if it
           is no longer needed.
           This ensures that the gendisk will survive between md_probe and
           md_open.
      
       2/  In md_open we check if the mddev we expect to open matches
           the gendisk that we did open.
           If there is a mismatch we return -ERESTARTSYS and modify
           __blkdev_get to retry from the top in that case.
           In the -ERESTARTSYS sys case we make sure to wait until
           the old gendisk (that we succeeded in opening) is really gone so
           we loop at most once.
      
      Some udev configurations will always open an md device when it first
      appears.   If we allow an md device that was just created by an open
      to disappear on an immediate close, then this can race with such udev
      configurations and result in an infinite loop the device being opened
      and closed, then re-open due to the 'ADD' even from the first open,
      and then close and so on.
      So we make sure an md device, once created by an open, remains active
      at least until some md 'ioctl' has been made on it.  This means that
      all normal usage of md devices will allow them to disappear promptly
      when not needed, but the worst that an incorrect usage will do it
      cause an inactive md device to be left in existence (it can easily be
      removed).
      
      As an array can be stopped by writing to a sysfs attribute
        echo clear > /sys/block/mdXXX/md/array_state
      we need to use scheduled work for deleting the gendisk and other
      kobjects.  This allows us to wait for any pending gendisk deletion to
      complete by simply calling flush_scheduled_work().
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d3374825
    • N
      md: centralise all freeing of an 'mddev' in 'md_free' · a21d1504
      NeilBrown 提交于
      md_free is the .release handler for the md kobj_type.
      So it makes sense to release all the objects referenced by
      the mddev in there, rather than just prior to calling kobject_put
      for what we think is the last time.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a21d1504
    • N
      md: move allocation of ->queue from mddev_find to md_probe · 8b765398
      NeilBrown 提交于
      It is more balanced to just do simple initialisation in mddev_find,
      which allocates and links a new md device, and leave all the
      more sophisticated allocation to md_probe (which calls mddev_find).
      md_probe already allocated the gendisk.  It should allocate the
      queue too.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8b765398
    • C
      md: need another print_sb for mdp_superblock_1 · cd2ac932
      Cheng Renquan 提交于
      md_print_devices is called in two code path: MD_BUG(...), and md_ioctl
      with PRINT_RAID_DEBUG.  it will dump out all in use md devices
      information;
      
      However, it wrongly processed two types of superblock in one:
      
      The header file <linux/raid/md_p.h> has defined two types of superblock,
      struct mdp_superblock_s (typedefed with mdp_super_t) according to md with
      metadata 0.90, and struct mdp_superblock_1 according to md with metadata
      1.0 and later,
      
      These two types of superblock are very different,
      
      The md_print_devices code processed them both in mdp_super_t, that would
      lead to wrong informaton dump like:
      
      	[ 6742.345877]
      	[ 6742.345887] md:	**********************************
      	[ 6742.345890] md:	* <COMPLETE RAID STATE PRINTOUT> *
      	[ 6742.345892] md:	**********************************
      	[ 6742.345896] md1: <ram7><ram6><ram5><ram4>
      	[ 6742.345907] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3
      	[ 6742.345909] md: rdev superblock:
      	[ 6742.345914] md:  SB: (V:0.90.0) ID:<42ef13c7.598c059a.5f9f1645.801e9ee6> CT:4919856d
      	[ 6742.345918] md:     L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536
      	[ 6742.345922] md:     UT:4919856d ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:b7992907 E:00000001
      	[ 6742.345924]      D  0:  DISK<N:0,(1,8),R:0,S:6>
      	[ 6742.345930]      D  1:  DISK<N:1,(1,10),R:1,S:6>
      	[ 6742.345933]      D  2:  DISK<N:2,(1,12),R:2,S:6>
      	[ 6742.345937]      D  3:  DISK<N:3,(1,14),R:3,S:6>
      	[ 6742.345942] md:     THIS:  DISK<N:3,(1,14),R:3,S:6>
      	...
      	[ 6742.346058] md0: <ram3><ram2><ram1><ram0>
      	[ 6742.346067] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3
      	[ 6742.346070] md: rdev superblock:
      	[ 6742.346073] md:  SB: (V:1.0.0) ID:<369aad81.00000000.00000000.00000000> CT:9a322a9c
      	[ 6742.346077] md:     L-1507699579 S976570180 ND:48 RD:0 md0 LO:65536 CS:196610
      	[ 6742.346081] md:     UT:00000018 ST:0 AD:131048 WD:0 FD:8 SD:0 CSUM:00000000 E:00000000
      	[ 6742.346084]      D  0:  DISK<N:-1,(-1,-1),R:-1,S:-1>
      	[ 6742.346089]      D  1:  DISK<N:-1,(-1,-1),R:-1,S:-1>
      	[ 6742.346092]      D  2:  DISK<N:-1,(-1,-1),R:-1,S:-1>
      	[ 6742.346096]      D  3:  DISK<N:-1,(-1,-1),R:-1,S:-1>
      	[ 6742.346102] md:     THIS:  DISK<N:0,(0,0),R:0,S:0>
      	...
      	[ 6742.346219] md:	**********************************
      	[ 6742.346221]
      
      Here md1 is metadata 0.90.0, and md0 is metadata 1.2
      
      After some more code to distinguish these two types of superblock, in this patch,
      
      it will generate dump information like:
      
      	[ 7906.755790]
      	[ 7906.755799] md:	**********************************
      	[ 7906.755802] md:	* <COMPLETE RAID STATE PRINTOUT> *
      	[ 7906.755804] md:	**********************************
      	[ 7906.755808] md1: <ram7><ram6><ram5><ram4>
      	[ 7906.755819] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3
      	[ 7906.755821] md: rdev superblock (MJ:0):
      	[ 7906.755826] md:  SB: (V:0.90.0) ID:<3fca7a0d.a612bfed.5f9f1645.801e9ee6> CT:491989f3
      	[ 7906.755830] md:     L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536
      	[ 7906.755834] md:     UT:491989f3 ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:00fb52ad E:00000001
      	[ 7906.755836]      D  0:  DISK<N:0,(1,8),R:0,S:6>
      	[ 7906.755842]      D  1:  DISK<N:1,(1,10),R:1,S:6>
      	[ 7906.755845]      D  2:  DISK<N:2,(1,12),R:2,S:6>
      	[ 7906.755849]      D  3:  DISK<N:3,(1,14),R:3,S:6>
      	[ 7906.755855] md:     THIS:  DISK<N:3,(1,14),R:3,S:6>
      	...
      	[ 7906.755972] md0: <ram3><ram2><ram1><ram0>
      	[ 7906.755981] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3
      	[ 7906.755984] md: rdev superblock (MJ:1):
      	[ 7906.755989] md:  SB: (V:1) (F:0) Array-ID:<5fbcf158:55aa:5fbe:9a79:1e939880dcbd>
      	[ 7906.755990] md:    Name: "DG5:0" CT:1226410480
      	[ 7906.755998] md:       L5 SZ130944 RD:4 LO:2 CS:128 DO:24 DS:131048 SO:8 RO:0
      	[ 7906.755999] md:     Dev:00000003 UUID: 9194d744:87f7:a448:85f2:7497b84ce30a
      	[ 7906.756001] md:       (F:0) UT:1226410480 Events:0 ResyncOffset:-1 CSUM:0dbcd829
      	[ 7906.756003] md:         (MaxDev:384)
      	...
      	[ 7906.756113] md:	**********************************
      	[ 7906.756116]
      
      this md0 (metadata 1.2) information dumping is exactly according to struct
      mdp_superblock_1.
      Signed-off-by: NCheng Renquan <crquan@gmail.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Dan Williams <dan.j.williams@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cd2ac932
    • C
      md: use list_for_each_entry macro directly · 159ec1fc
      Cheng Renquan 提交于
      The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to
      list_for_each_entry_safe, from <linux/list.h>, it should be defined to
      use list_for_each_entry_safe, instead of reinventing the wheel.
      
      But some calls to each_entry_safe don't really need a safe version,
      just a direct list_for_each_entry is enough, this could save a temp
      variable (tmp) in every function that used rdev_for_each.
      
      In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
      totally save many tmp vars; and only in the other situations that will call
      list_del to delete an entry, the safe version is used.
      Signed-off-by: NCheng Renquan <crquan@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      159ec1fc
    • A
      md: raid0: make hash_spacing and preshift sector-based. · ccacc7d2
      Andre Noll 提交于
      This patch renames the hash_spacing and preshift members of struct
      raid0_private_data to spacing and sector_shift respectively and
      changes the semantics as follows:
      
      We always have spacing = 2 * hash_spacing. In case
      sizeof(sector_t) > sizeof(u32) we also have sector_shift = preshift + 1
      while sector_shift = preshift = 0 otherwise.
      
      Note that the values of nb_zone and zone are unaffected by these changes
      because in the sector_div() preceeding the assignement of these two
      variables both arguments double.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ccacc7d2
    • A
      md: raid0: Represent the size of strip zones in sectors. · 83838ed8
      Andre Noll 提交于
      This completes the block -> sector conversion of struct strip_zone.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      83838ed8
    • A
      md: raid0 create_strip_zones(): Add KERN_INFO/KERN_ERR to printk's. · 0825b87a
      Andre Noll 提交于
      This patch consists only of these trivial changes.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0825b87a
    • A
      md: raid0 create_strip_zones(): Make two local variables sector-based. · 6b8796cc
      Andre Noll 提交于
      current_offset and curr_zone_offset stored the corresponding offsets
      as 1K quantities. Rename them to current_start and curr_zone_start
      to match the naming of struct strip_zone and store the offsets as
      sector counts.
      
      Also, add KERN_INFO to the printk() affected by this change to make
      checkpatch happy.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6b8796cc
    • A
      md: raid0: Represent zone->zone_offset in sectors. · 6199d3db
      Andre Noll 提交于
      For the same reason as in the previous patch, rename it from zone_offset
      to zone_start.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6199d3db
    • A
      md: raid0: Represent device offset in sectors. · 019c4e2f
      Andre Noll 提交于
      Rename zone->dev_offset to zone->dev_start to make sure all users
      have been converted.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      019c4e2f
    • A
      md: raid0_make_request(): Replace local variable block by sector. · e0f06868
      Andre Noll 提交于
      This change already simplifies the code a bit.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e0f06868
    • A
      md: raid0_make_request(): Remove local variable chunk_size. · a4712005
      Andre Noll 提交于
      We might as well use chunk_sects instead.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a4712005
    • A
      md: raid0_make_request(): Replace chunksize_bits by chunksect_bits. · 1b7fdf8f
      Andre Noll 提交于
      As ffz(~(2 * x)) = ffz(~x) + 1, we have
      
      	chunksect_bits = chunksize_bits + 1.
      
      Fixup all users accordingly.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1b7fdf8f
    • N
      md: use sysfs_notify_dirent to notify changes to md/sync_action. · 0c3573f1
      NeilBrown 提交于
      There is no compelling need for this, but sysfs_notify_dirent is a
      nicer interface and the change is good for consistency.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0c3573f1
    • N
      md: fix bitmap-on-external-file bug. · 53845270
      NeilBrown 提交于
      commit a2ed9615
      fixed a bug with 'internal' bitmaps, but in the process broke
      'in a file' bitmaps.  So they are broken in 2.6.28
      
      This fixes it, and needs to go in 2.6.28-stable.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@kernel.org
      53845270
  7. 06 1月, 2009 9 次提交
    • J
      dm snapshot: extend exception store functions · a159c1ac
      Jonathan Brassow 提交于
      Supply dm_add_exception as a callback to the read_metadata function.
      Add a status function ready for a later patch and name the functions
      consistently.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a159c1ac
    • A
      dm snapshot: split out exception store implementations · 4db6bfe0
      Alasdair G Kergon 提交于
      Move the existing snapshot exception store implementations out into
      separate files.  Later patches will place these behind a new
      interface in preparation for alternative implementations.
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      4db6bfe0
    • J
      dm snapshot: rename struct exception_store · 1ae25f9c
      Jonathan Brassow 提交于
      Rename struct exception_store to dm_exception_store.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      1ae25f9c
    • J
      dm snapshot: separate out exception store interface · aea53d92
      Jonathan Brassow 提交于
      Pull structures that bridge the gap between snapshot and
      exception store out of dm-snap.h and put them in a new
      .h file - dm-exception-store.h.  This file will define the
      API for new exception stores.
      
      Ultimately, dm-snap.h is unnecessary, since only dm-snap.c
      should be using it.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      aea53d92
    • A
      dm mpath: move trigger_event to system workqueue · fe9cf30e
      Alasdair G Kergon 提交于
      The same workqueue is used both for sending uevents and processing queued I/O.
      Deadlock has been reported in RHEL5 when sending a uevent was blocked waiting
      for the queued I/O to be processed.  Use scheduled_work() for the asynchronous
      uevents instead.
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      fe9cf30e
    • M
      dm: add name and uuid to sysfs · 784aae73
      Milan Broz 提交于
      Implement simple read-only sysfs entry for device-mapper block device.
      
      This patch adds a simple sysfs directory named "dm" under block device
      properties and implements
      	- name attribute (string containing mapped device name)
      	- uuid attribute (string containing UUID, or empty string if not set)
      
      The kobject is embedded in mapped_device struct, so no additional
      memory allocation is needed for initializing sysfs entry.
      
      During the processing of sysfs attribute we need to lock mapped device
      which is done by a new function dm_get_from_kobj, which returns the md
      associated with kobject and increases the usage count.
      
      Each 'show attribute' function is responsible for its own locking.
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      784aae73
    • M
      dm table: rework reference counting · d5816876
      Mikulas Patocka 提交于
      Rework table reference counting.
      
      The existing code uses a reference counter. When the last reference is
      dropped and the counter reaches zero, the table destructor is called.
      Table reference counters are acquired/released from upcalls from other
      kernel code (dm_any_congested, dm_merge_bvec, dm_unplug_all).
      If the reference counter reaches zero in one of the upcalls, the table
      destructor is called from almost random kernel code.
      
      This leads to various problems:
      * dm_any_congested being called under a spinlock, which calls the
        destructor, which calls some sleeping function.
      * the destructor attempting to take a lock that is already taken by the
        same process.
      * stale reference from some other kernel code keeps the table
        constructed, which keeps some devices open, even after successful
        return from "dmsetup remove". This can confuse lvm and prevent closing
        of underlying devices or reusing device minor numbers.
      
      The patch changes reference counting so that the table destructor can be
      called only at predetermined places.
      
      The table has always exactly one reference from either mapped_device->map
      or hash_cell->new_map. After this patch, this reference is not counted
      in table->holders.  A pair of dm_create_table/dm_destroy_table functions
      is used for table creation/destruction.
      
      Temporary references from the other code increase table->holders. A pair
      of dm_table_get/dm_table_put functions is used to manipulate it.
      
      When the table is about to be destroyed, we wait for table->holders to
      reach 0. Then, we call the table destructor.  We use active waiting with
      msleep(1), because the situation happens rarely (to one user in 5 years)
      and removing the device isn't performance-critical task: the user doesn't
      care if it takes one tick more or not.
      
      This way, the destructor is called only at specific points
      (dm_table_destroy function) and the above problems associated with lazy
      destruction can't happen.
      
      Finally remove the temporary protection added to dm_any_congested().
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      d5816876
    • A
      dm: support barriers on simple devices · ab4c1424
      Andi Kleen 提交于
      Implement barrier support for single device DM devices
      
      This patch implements barrier support in DM for the common case of dm linear
      just remapping a single underlying device. In this case we can safely
      pass the barrier through because there can be no reordering between
      devices.
      
       NB. Any DM device might cease to support barriers if it gets
           reconfigured so code must continue to allow for a possible
           -EOPNOTSUPP on every barrier bio submitted.  - agk
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      ab4c1424
    • K
      dm request: add caches · 8fbf26ad
      Kiyoshi Ueda 提交于
      This patch prepares some kmem_caches for request-based dm.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      8fbf26ad