1. 22 12月, 2012 14 次提交
    • J
      dm raid: round region_size to power of two · 3a0f9aae
      Jonathan Brassow 提交于
      If the user does not supply a bitmap region_size to the dm raid target,
      a reasonable size is computed automatically.  If this is not a power of 2,
      the md code will report an error later.
      
      This patch catches the problem early and rounds the region_size to the
      next power of two.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      3a0f9aae
    • J
      dm thin: cleanup dead code · 2aab3850
      Joe Thornber 提交于
      Remove unused @data_block parameter from cell_defer.
      Change thin_bio_map to use many returns rather than setting a variable.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      2aab3850
    • J
      dm thin: rename cell_defer_except to cell_defer_no_holder · f286ba0e
      Joe Thornber 提交于
      Rename cell_defer_except() to cell_defer_no_holder() which describes
      its function more clearly.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f286ba0e
    • M
      dm snapshot: optimize track_chunk · 9aa0c0e6
      Mikulas Patocka 提交于
      track_chunk is always called with interrupts enabled. Consequently, we
      do not need to save and restore interrupt state in "flags" variable.
      This patch changes spin_lock_irqsave to spin_lock_irq and
      spin_unlock_irqrestore to spin_unlock_irq.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      9aa0c0e6
    • M
      dm raid: use DM_ENDIO_INCOMPLETE · 19cbbc60
      Mikulas Patocka 提交于
      Use a defined macro DM_ENDIO_INCOMPLETE instead of a numeric constant.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      19cbbc60
    • M
      dm raid1: remove impossible mempool_alloc error test · 7c27213b
      Mikulas Patocka 提交于
      mempool_alloc can't fail if __GFP_WAIT is specified, so the condition
      that tests if read_record is non-NULL is always true.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      7c27213b
    • M
      dm thin: emit ignore_discard in status when discards disabled · 018debea
      Mike Snitzer 提交于
      If "ignore_discard" is specified when creating the thin pool device then
      discard support is disabled for that device.  The pool device's status
      should reflect this fact rather than stating "no_discard_passdown"
      (which implies discards are enabled but passdown is disabled).
      Reported-by: NZdenek Kabelac <zkabelac@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      018debea
    • J
      dm persistent data: fix nested btree deletion · e3cbf945
      Joe Thornber 提交于
      When deleting nested btrees, the code forgets to delete the innermost
      btree.  The thin-metadata code serendipitously compensates for this by
      claiming there is one extra layer in the tree.
      
      This patch corrects both problems.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      e3cbf945
    • J
      dm thin: wake worker when discard is prepared · 563af186
      Joe Thornber 提交于
      When discards are prepared it is best to directly wake the worker that
      will process them.  The worker will be woken anyway, via periodic
      commit, but there is no reason to not wake_worker here.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      563af186
    • J
      dm thin: fix race between simultaneous io and discards to same block · e8088073
      Joe Thornber 提交于
      There is a race when discard bios and non-discard bios are issued
      simultaneously to the same block.
      
      Discard support is expensive for all thin devices precisely because you
      have to be careful to quiesce the area you're discarding.  DM thin must
      handle this conflicting IO pattern (simultaneous non-discard vs discard)
      even though a sane application shouldn't be issuing such IO.
      
      The race manifests as follows:
      
      1. A non-discard bio is mapped in thin_bio_map.
         This doesn't lock out parallel activity to the same block.
      
      2. A discard bio is issued to the same block as the non-discard bio.
      
      3. The discard bio is locked in a dm_bio_prison_cell in process_discard
         to lock out parallel activity against the same block.
      
      4. The non-discard bio's mapping continues and its all_io_entry is
         incremented so the bio is accounted for in the thin pool's all_io_ds
         which is a dm_deferred_set used to track time locality of non-discard IO.
      
      5. The non-discard bio is finally locked in a dm_bio_prison_cell in
         process_bio.
      
      The race can result in deadlock, leaving the block layer hanging waiting
      for completion of a discard bio that never completes, e.g.:
      
      INFO: task ruby:15354 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      ruby            D ffffffff8160f0e0     0 15354  15314 0x00000000
       ffff8802fb08bc58 0000000000000082 ffff8802fb08bfd8 0000000000012900
       ffff8802fb08a010 0000000000012900 0000000000012900 0000000000012900
       ffff8802fb08bfd8 0000000000012900 ffff8803324b9480 ffff88032c6f14c0
      Call Trace:
       [<ffffffff814e5a19>] schedule+0x29/0x70
       [<ffffffff814e3d85>] schedule_timeout+0x195/0x220
       [<ffffffffa06b9bc1>] ? _dm_request+0x111/0x160 [dm_mod]
       [<ffffffff814e589e>] wait_for_common+0x11e/0x190
       [<ffffffff8107a170>] ? try_to_wake_up+0x2b0/0x2b0
       [<ffffffff814e59ed>] wait_for_completion+0x1d/0x20
       [<ffffffff81233289>] blkdev_issue_discard+0x219/0x260
       [<ffffffff81233e79>] blkdev_ioctl+0x6e9/0x7b0
       [<ffffffff8119a65c>] block_ioctl+0x3c/0x40
       [<ffffffff8117539c>] do_vfs_ioctl+0x8c/0x340
       [<ffffffff8119a547>] ? block_llseek+0x67/0xb0
       [<ffffffff811756f1>] sys_ioctl+0xa1/0xb0
       [<ffffffff810561f6>] ? sys_rt_sigprocmask+0x86/0xd0
       [<ffffffff814ef099>] system_call_fastpath+0x16/0x1b
      
      The thinp-test-suite's test_discard_random_sectors reliably hits this
      deadlock on fast SSD storage.
      
      The fix for this race is that the all_io_entry for a bio must be
      incremented whilst the dm_bio_prison_cell is held for the bio's
      associated virtual and physical blocks.  That cell locking wasn't
      occurring early enough in thin_bio_map.  This patch fixes this.
      
      Care is taken to always call the new function inc_all_io_entry() with
      the relevant cells locked, but they are generally unlocked before
      calling issue() to try to avoid holding the cells locked across
      generic_submit_request.
      
      Also, now that thin_bio_map may lock bios in a cell, process_bio() is no
      longer the only thread that will do so.  Because of this we must be sure
      to use cell_defer_except() to release all non-holder entries, that
      were added by the other thread, because they must be deferred.
      
      This patch depends on "dm thin: replace dm_cell_release_singleton with
      cell_defer_except".
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: stable@vger.kernel.org
      e8088073
    • J
      dm thin: replace dm_cell_release_singleton with cell_defer_except · b7ca9c92
      Joe Thornber 提交于
      Change existing users of the function dm_cell_release_singleton to share
      cell_defer_except instead, and then remove the now-unused function.
      
      Everywhere that calls dm_cell_release_singleton, the bio in question
      is the holder of the cell.
      
      If there are no non-holder entries in the cell then cell_defer_except
      behaves exactly like dm_cell_release_singleton.  Conversely, if there
      *are* non-holder entries then dm_cell_release_singleton must not be used
      because those entries would need to be deferred.
      
      Consequently, it is safe to replace use of dm_cell_release_singleton
      with cell_defer_except.
      
      This patch is a pre-requisite for "dm thin: fix race between
      simultaneous io and discards to same block".
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      b7ca9c92
    • M
      dm: disable WRITE SAME · c1a94672
      Mike Snitzer 提交于
      WRITE SAME bios are not yet handled correctly by device-mapper so
      disable their use on device-mapper devices by setting
      max_write_same_sectors to zero.
      
      As an example, a ciphertext device is incompatible because the data
      gets changed according to the location at which it written and so the
      dm crypt target cannot support it.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      Cc: Milan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      c1a94672
    • A
      dm ioctl: prevent unsafe change to dm_ioctl data_size · e910d7eb
      Alasdair G Kergon 提交于
      Abort dm ioctl processing if userspace changes the data_size parameter
      after we validated it but before we finished copying the data buffer
      from userspace.
      
      The dm ioctl parameters are processed in the following sequence:
       1. ctl_ioctl() calls copy_params();
       2. copy_params() makes a first copy of the fixed-sized portion of the
          userspace parameters into the local variable "tmp";
       3. copy_params() then validates tmp.data_size and allocates a new
          structure big enough to hold the complete data and copies the whole
          userspace buffer there;
       4. ctl_ioctl() reads userspace data the second time and copies the whole
          buffer into the pointer "param";
       5. ctl_ioctl() reads param->data_size without any validation and stores it
          in the variable "input_param_size";
       6. "input_param_size" is further used as the authoritative size of the
          kernel buffer.
      
      The problem is that userspace code could change the contents of user
      memory between steps 2 and 4.  In particular, the data_size parameter
      can be changed to an invalid value after the kernel has validated it.
      This lets userspace force the kernel to access invalid kernel memory.
      
      The fix is to ensure that the size has not changed at step 4.
      
      This patch shouldn't have a security impact because CAP_SYS_ADMIN is
      required to run this code, but it should be fixed anyway.
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      e910d7eb
    • M
      dm persistent data: rename node to btree_node · 550929fa
      Mikulas Patocka 提交于
      This patch fixes a compilation failure on sparc32 by renaming struct node.
      
      struct node is already defined in include/linux/node.h. On sparc32, it
      happens to be included through other dependencies and persistent-data
      doesn't compile because of conflicting declarations.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      550929fa
  2. 27 11月, 2012 1 次提交
    • N
      md/raid1{,0}: fix deadlock in bitmap_unplug. · 874807a8
      NeilBrown 提交于
      If the raid1 or raid10 unplug function gets called
      from a make_request function (which is very possible) when
      there are bios on the current->bio_list list, then it will not
      be able to successfully call bitmap_unplug() and it could
      need to submit more bios and wait for them to complete.
      But they won't complete while current->bio_list is non-empty.
      
      So detect that case and handle the unplugging off to another thread
      just like we already do when called from within the scheduler.
      
      RAID1 version of bug was introduced in 3.6, so that part of fix is
      suitable for 3.6.y.  RAID10 part won't apply.
      
      Cc: stable@vger.kernel.org
      Reported-by: NTorsten Kaiser <just.for.lkml@googlemail.com>
      Reported-by: NPeter Maloney <peter.maloney@brockmann-consult.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      874807a8
  3. 23 11月, 2012 1 次提交
  4. 22 11月, 2012 4 次提交
    • N
      md/raid10: decrement correct pending counter when writing to replacement. · 884162df
      NeilBrown 提交于
      When a write to a replacement device completes, we carefully
      and correctly found the rdev that the write actually went to
      and the blithely called rdev_dec_pending on the primary rdev,
      even if this write was to the replacement.
      
      This means that any writes to an array while a replacement
      was ongoing would cause the nr_pending count for the primary
      device to go negative, so it could never be removed.
      
      This bug has been present since replacement was introduced in
      3.3, so it is suitable for any -stable kernel since then.
      Reported-by: N"George Spelvin" <linux@horizon.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      884162df
    • N
      md/raid10: close race that lose writes lost when replacement completes. · e7c0c3fa
      NeilBrown 提交于
      When a replacement operation completes there is a small window
      when the original device is marked 'faulty' and the replacement
      still looks like a replacement.  The faulty should be removed and
      the replacement moved in place very quickly, bit it isn't instant.
      
      So the code write out to the array must handle the possibility that
      the only working device for some slot in the replacement - but it
      doesn't.  If the primary device is faulty it just gives up.  This
      can lead to corruption.
      
      So make the code more robust: if either  the primary or the
      replacement is present and working, write to them.  Only when
      neither are present do we give up.
      
      This bug has been present since replacement was introduced in
      3.3, so it is suitable for any -stable kernel since then.
      Reported-by: N"George Spelvin" <linux@horizon.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e7c0c3fa
    • N
      md/raid5: Make sure we clear R5_Discard when discard is finished. · ca64cae9
      NeilBrown 提交于
      commit 9e444768
          MD: raid5 avoid unnecessary zero page for trim
      
      change raid5 to clear R5_Discard when the complete request is
      handled rather than when submitting the per-device discard request.
      However it did not clear R5_Discard for the parity device.
      
      This means that if the stripe_head was reused before it expired from
      the cache, the setting would be wrong and a hang would result.
      
      Also if the R5_Uptodate bit happens to be set, R5_Discard again
      won't be cleared.  But R5_Uptodate really should be clear at this point.
      
      So make sure R5_Discard is cleared in all cases, and clear
      R5_Uptodate when a 'discard' completes.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ca64cae9
    • N
      md/raid5: move resolving of reconstruct_state earlier in · ef5b7c69
      NeilBrown 提交于
      stripe_handle.
      
      The chunk of code in stripe_handle which responds to a
      *_result value in reconstruct_state is really the completion
      of some processing that happened outside of handle_stripe
      (possibly asynchronously) and so should be one of the first
      things done in handle_stripe().
      
      After the next patch it will be important that it happens before
      handle_stripe_clean_event(), as that will clear some dev->flags
      bit that this code tests.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ef5b7c69
  5. 20 11月, 2012 4 次提交
  6. 31 10月, 2012 2 次提交
    • J
      MD RAID10: Fix oops when creating RAID10 arrays via dm-raid.c · ed30be07
      Jonathan Brassow 提交于
      Commit 2863b9eb didn't take into account the changes to add TRIM support to
      RAID10 (commit 532a2a3f).  That is, when using dm-raid.c to create the
      RAID10 arrays, there is no mddev->gendisk or mddev->queue.  The code added
      to support TRIM simply assumes that mddev->queue is available without
      checking.  The result is an oops any time dm-raid.c attempts to create a
      RAID10 device.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ed30be07
    • N
      md/raid1: Fix assembling of arrays containing Replacements. · 02b898f2
      NeilBrown 提交于
      setup_conf in raid1.c uses conf->raid_disks before assigning
      a value.  It is used when including 'Replacement' devices.
      
      The consequence is that assembling an array which contains a
      replacement will misbehave and either not include the replacement, or
      not include the device being replaced.
      
      Though this doesn't lead directly to data corruption, it could lead to
      reduced data safety.
      
      So use mddev->raid_disks, which is initialised, instead.
      
      Bug was introduced by commit c19d5798
            md/raid1: recognise replacements when assembling arrays.
      
      in 3.3, so fix is suitable for 3.3.y thru 3.6.y.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      02b898f2
  7. 22 10月, 2012 1 次提交
    • E
      md faulty: use disk_stack_limits() · 0be1fecd
      Eric Sandeen 提交于
      in:
      fe86cdce block: do not artificially constrain max_sectors for stacking drivers
      
      max_sectors defaults to UINT_MAX.  md faulty wasn't using
      disk_stack_limits(), so inherited this large value as well.
      This triggered a bug in XFS when stressed over md_faulty, when
      a very large bio_alloc() failed.
      
      That was on an older kernel, and I can't reproduce exactly the
      same thing upstream, but I think the fix is appropriate in any
      case.
      
      Thanks to Mike Snitzer for pointing out the problem.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0be1fecd
  8. 13 10月, 2012 4 次提交
  9. 12 10月, 2012 4 次提交
  10. 11 10月, 2012 5 次提交
    • N
      md: refine reporting of resync/reshape delays. · 72f36d59
      NeilBrown 提交于
      If 'resync_max' is set to 0 (as is often done when starting a
      reshape, so the mdadm can remain in control during a sensitive
      period), and if the reshape request is initially delayed because
      another array using the same array is resyncing or reshaping etc,
      when user-space cannot easily tell when the delay changes from being
      due to a conflicting reshape, to being due to resync_max = 0.
      
      So introduce a new state: (curr_resync == 3) to reflect this, make
      sure it is visible both via /proc/mdstat and via the "sync_completed"
      sysfs attribute, and ensure that the event transition from one delay
      state to the other is properly notified.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      72f36d59
    • N
      md/raid5: be careful not to resize_stripes too big. · e56108d6
      NeilBrown 提交于
      When a RAID5 is reshaping, conf->raid_disks is increased
      before mddev->delta_disks becomes zero.
      This can result in check_reshape calling resize_stripes with a
      number that is too large.  This particularly happens
      when md_check_recovery calls ->check_reshape().
      
      If we use ->previous_raid_disks, we don't risk this.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e56108d6
    • N
      md: make sure manual changes to recovery checkpoint are saved. · db07d85e
      NeilBrown 提交于
      If you make an array bigger but suppress resync of the new region with
        mdadm --grow /dev/mdX --size=max --assume-clean
      
      then stop the array before anything is written to it, the effect of
      the "--assume-clean" is lost and the array will resync the new space
      when restarted.
      So ensure that we update the metadata in the case.
      Reported-by: NSebastian Riemer <sebastian.riemer@profitbricks.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      db07d85e
    • D
      md/raid10: use correct limit variable · 91502f09
      Dan Carpenter 提交于
      Clang complains that we are assigning a variable to itself.  This should
      be using bad_sectors like the similar earlier check does.
      
      Bug has been present since 3.1-rc1.  It is minor but could
      conceivably cause corruption or other bad behaviour.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      91502f09
    • N
      md: writing to sync_action should clear the read-auto state. · 48c26ddc
      NeilBrown 提交于
      In some cases array are started in 'read-auto' state where in
      nothing gets written to any device until the array is written
      to.  The purpose of this is to make accidental auto-assembly
      of the wrong arrays less of a risk, and to allow arrays to be
      started to read suspend-to-disk images without actually changing
      anything (as might happen if the array were dirty and a
      resync seemed necessary).
      
      Explicitly writing the 'sync_action' for a read-auto array currently
      doesn't clear the read-auto state, so the sync action doesn't
      happen, which can be confusing.
      
      So allow any successful write to sync_action to clear any read-auto
      state.
      Reported-by: NAlexander Kühn <alexander.kuehn@nagilum.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      48c26ddc