1. 16 2月, 2011 1 次提交
    • N
      md: don't set_capacity before array is active. · cbe6ef1d
      NeilBrown 提交于
      If the desired size of an array is set (via sysfs) before the array is
      active (which is the normal sequence), we currrently call set_capacity
      immediately.
      This means that a subsequent 'open' (as can be caused by some
      udev-triggers program) will notice the new size and try to probe for
      partitions.  However as the array isn't quite ready yet the read will
      fail.  Then when the array is read, as the size doesn't change again
      we don't try to re-probe.
      
      So when setting array size via sysfs, only call set_capacity if the
      array is already active.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cbe6ef1d
  2. 08 2月, 2011 1 次提交
    • C
      md_make_request: don't touch the bio after calling make_request · e91ece55
      Chris Mason 提交于
      md_make_request was calling bio_sectors() for part_stat_add
      after it was calling the make_request function.  This is
      bad because the make_request function can free the bio and
      because the bi_size field can change around.
      
      The fix here was suggested by Jens Axboe.  It saves the
      sector count before the make_request call.  I hit this
      with CONFIG_DEBUG_PAGEALLOC turned on while trying to break
      his pretty fusionio card.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e91ece55
  3. 02 2月, 2011 1 次提交
  4. 31 1月, 2011 4 次提交
    • N
      md: don't clear curr_resync_completed at end of resync. · 7281f812
      NeilBrown 提交于
      There is no need to set this to zero at this point.  It will be
      set to zero by remove_and_add_spares or at the start of
      md_do_sync at the latest.
      And setting it to zero before MD_RECOVERY_RUNNING is cleared can
      make a 'zero' appear briefly in the 'sync_completed' sysfs attribute
      just as resync is finishing.
      
      So simply remove this setting to zero.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7281f812
    • N
      md: Don't use remove_and_add_spares to remove failed devices from a read-only array · a8c42c7f
      NeilBrown 提交于
      remove_and_add_spares is called in two places where the needs really
      are very different.
      remove_and_add_spares should not be called on an array which is about
      to be reshaped as some extra devices might have been manually added
      and that would remove them.  However if the array is 'read-auto',
      that will currently happen, which is bad.
      
      So in the 'ro != 0' case don't call remove_and_add_spares but simply
      remove the failed devices as the comment suggests is needed.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a8c42c7f
    • N
      md: Remove the AllReserved flag for component devices. · f21e9ff7
      NeilBrown 提交于
      This flag is not needed and is used badly.
      
      Devices that are included in a native-metadata array are reserved
      exclusively for that array - and currently have AllReserved set.
      They all are bd_claimed for the rdev and so cannot be shared.
      
      Devices that are included in external-metadata arrays can be shared
      among multiple arrays - providing there is no overlap.
      These are bd_claimed for md in general - not for a particular rdev.
      
      When changing the amount of a device that is used in an array we need
      to check for overlap.  This currently includes a check on AllReserved
      So even without overlap, sharing with an AllReserved device is not
      allowed.
      However the bd_claim usage already precludes sharing with these
      devices, so the test on AllReserved is not needed.  And in fact it is
      wrong.
      
      As this is the only use of AllReserved, simply remove all usage and
      definition of AllReserved.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f21e9ff7
    • N
      md: revert change to raid_disks on failure. · de171cb9
      NeilBrown 提交于
      If we try to update_raid_disks and it fails, we should put
      'delta_disks' back to zero.  This is important because some code,
      such as slot_store, assumes that delta_disks has been validated.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      de171cb9
  5. 15 1月, 2011 1 次提交
    • T
      block: restore multiple bd_link_disk_holder() support · 49731baa
      Tejun Heo 提交于
      Commit e09b457b (block: simplify holder symlink handling) incorrectly
      assumed that there is only one link at maximum.  dm may use multiple
      links and expects block layer to track reference count for each link,
      which is different from and unrelated to the exclusive device holder
      identified by @holder when the device is opened.
      
      Remove the single holder assumption and automatic removal of the link
      and revive the per-link reference count tracking.  The code
      essentially behaves the same as before commit e09b457b sans the
      unnecessary kobject reference count dancing.
      
      While at it, note that this facility should not be used by anyone else
      than the current ones.  Sysfs symlinks shouldn't be abused like this
      and the whole thing doesn't belong in the block layer at all.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NMilan Broz <mbroz@redhat.com>
      Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      49731baa
  6. 14 1月, 2011 12 次提交
    • N
      md: Fix removal of extra drives when converting RAID6 to RAID5 · bf2cb0da
      NeilBrown 提交于
      When a RAID6 is converted to a RAID5, the extra drive should
      be discarded.  However it isn't due to a typo in a comparison.
      
      This bug was introduced in commit e93f68a1 in 2.6.35-rc4
      and is suitable for any -stable since than.
      
      As the extra drive is not removed, the 'degraded' counter is wrong and
      so the RAID5 will not respond correctly to a subsequent failure.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bf2cb0da
    • N
      md: range check slot number when manually adding a spare. · ba1b41b6
      NeilBrown 提交于
      When adding a spare to an active array, we should check the slot
      number, but allow it to be larger than raid_disks if a reshape
      is being prepared.
      
      Apply the same test when adding a device to an
      array-under-construction.  It already had most of the test in place,
      but not quite all.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ba1b41b6
    • R
      md: fix sync_completed reporting for very large drives (>2TB) · 13ae864b
      Rémi Rérolle 提交于
      The values exported in the sync_completed file are unsigned long, which
      overflows with very large drives, resulting in wrong values reported.
      
      Since sync_completed uses sectors as unit, we'll start getting wrong
      values with components larger than 2TB.
      
      This patch simply replaces the use of unsigned long by unsigned long long.
      Signed-off-by: NRémi Rérolle <rrerolle@lacie.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      13ae864b
    • N
      md: allow suspend_lo and suspend_hi to decrease as well as increase. · 23ddff37
      NeilBrown 提交于
      The sysfs attributes 'suspend_lo' and 'suspend_hi' describe a region
      to which read/writes are suspended so that the under lying data can be
      manipulated without user-space noticing.
      Currently the window they describe can only move forwards along the
      device.  However this is an unnecessary restriction which will cause
      problems with planned developments.
      So relax this restriction and allow these endpoints to move
      arbitrarily.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      23ddff37
    • N
      md: Don't let implementation detail of curr_resync leak out through sysfs. · 75d3da43
      NeilBrown 提交于
      mddev->curr_resync has artificial values of '1' and '2' which are used
      by the code which ensures only one resync is happening at a time on
      any given device.
      
      These values are internal and should never be exposed to user-space
      (except when translated appropriately as in the 'pending' status in
      /proc/mdstat).
      
      Unfortunately they are as ->curr_resync is assigned to
      ->curr_resync_completed and that value is directly visible through
      sysfs.
      
      So change the assignments to ->curr_resync_completed to get the same
      valued from elsewhere in a form that doesn't have the magic '1' or '2'
      values.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      75d3da43
    • J
      md: separate meta and data devs · a6ff7e08
      Jonathan Brassow 提交于
      Allow the metadata to be on a separate device from the
      data.
      
      This doesn't mean the data and metadata will by on separate
      physical devices - it simply gives device-mapper and userspace
      tools more flexibility.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a6ff7e08
    • J
      md-new-param-to_sync_page_io · ccebd4c4
      Jonathan Brassow 提交于
      Add new parameter to 'sync_page_io'.
      
      The new parameter allows us to distinguish between metadata and data
      operations.  This becomes important later when we add the ability to
      use separate devices for data and metadata.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      ccebd4c4
    • J
      md-new-param-to-calc_dev_sboffset · 57b2caa3
      Jonathan Brassow 提交于
      When we allow for separate devices for data and metadata
      in a later patch, we will need to be able to calculate
      the superblock offset based on more than the bdev.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      57b2caa3
    • N
      md: Be more careful about clearing flags bit in ->recovery · 7ebc0be7
      NeilBrown 提交于
      Setting ->recovery to 0 is generally not a good idea as it could clear
      bits that shouldn't be cleared.  In particular, MD_RECOVERY_FROZEN
      should only be cleared on explicit request from user-space.
      
      So when we need to clear things, just clear the bits that need
      clearing.
      
      As there are a few different places which reap a resync process - and
      some do an incomplte job - factor out the code for doing the from
      md_check_recovery and call that function instead of open coding part
      of it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reported-by: NJonathan Brassow <jbrassow@redhat.com>
      7ebc0be7
    • N
      md: md_stop_writes requires mddev_lock. · defad61a
      NeilBrown 提交于
      As md_stop_writes manipulates the sync_thread and calls md_update_sb,
      it need to be called with mddev_lock held.
      
      In all internal cases it is, but the symbol is exported for dm-raid to
      call and in that case the lock won't be help.
      Do make an exported version which takes the lock, and an internal
      version which does not.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      defad61a
    • N
      md: Ensure no IO request to get md device before it is properly initialised. · 0ca69886
      NeilBrown 提交于
      When an md device is in the process of coming on line it is possible
      for an IO request (typically a partition table probe) to get through
      before the array is fully initialised, which can cause unexpected
      behaviour (e.g. a crash).
      
      So explicitly record when the array is ready for IO and don't allow IO
      through until then.
      
      There is no possibility for a similar problem when the array is going
      off-line as there must only be one 'open' at that time, and it is busy
      off-lining the array and so cannot send IO requests.  So no memory
      barrier is needed in md_stop()
      
      This has been a bug since commit 409c57f3 in 2.6.30 which
      introduced md_make_request.  Before then, each personality would
      register its own make_request_fn when it was ready.
      This is suitable for any stable kernel from 2.6.30.y onwards.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reported-by: N"Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
      0ca69886
    • N
      md: fix regression resulting in delays in clearing bits in a bitmap · 6c987910
      NeilBrown 提交于
      commit 589a594b (2.6.37-rc4) fixed a problem were md_thread would
      sometimes call the ->run function at a bad time.
      
      If an error is detected during array start up after the md_thread has
      been started, the md_thread is killed.  This resulted in the ->run
      function being called once.  However the array may not be in a state
      that it is safe to call ->run.
      
      However the fix imposed meant that  ->run was not called on a timeout.
      This means that when an array goes idle, bitmap bits do not get
      cleared promptly.  While the array is busy the bits will still be
      cleared when appropriate so this is not very serious.  There is no
      risk to data.
      
      Change the test so that we only avoid calling ->run when the thread
      is being stopped.  This more explicitly addresses the problem situation.
      
      This is suitable for 2.6.37-stable and any -stable kernel to which
      589a594b was applied.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6c987910
  7. 12 1月, 2011 1 次提交
    • N
      md: fix regression with re-adding devices to arrays with no metadata · bf572541
      NeilBrown 提交于
      Commit 1a855a06 (2.6.37-rc4) fixed a problem where devices were
      re-added when they shouldn't be but caused a regression in a less
      common case that means sometimes devices cannot be re-added when they
      should be.
      
      In particular, when re-adding a device to an array without metadata
      we should always access the device, but after the above commit we
      didn't.
      
      This patch sets the In_sync flag in that case so that the re-add
      succeeds.
      
      This patch is suitable for any -stable kernel to which 1a855a06 was
      applied.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bf572541
  8. 17 12月, 2010 1 次提交
    • M
      block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead · e692cb66
      Martin K. Petersen 提交于
      When stacking devices, a request_queue is not always available. This
      forced us to have a no_cluster flag in the queue_limits that could be
      used as a carrier until the request_queue had been set up for a
      metadevice.
      
      There were several problems with that approach. First of all it was up
      to the stacking device to remember to set queue flag after stacking had
      completed. Also, the queue flag and the queue limits had to be kept in
      sync at all times. We got that wrong, which could lead to us issuing
      commands that went beyond the max scatterlist limit set by the driver.
      
      The proper fix is to avoid having two flags for tracking the same thing.
      We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
      block layer merging functions. The queue_limit 'no_cluster' is turned
      into 'cluster' to avoid double negatives and to ease stacking.
      Clustering defaults to being enabled as before. The queue flag logic is
      removed from the stacking function, and explicitly setting the cluster
      flag is no longer necessary in DM and MD.
      Reported-by: NEd Lin <ed.lin@promise.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      e692cb66
  9. 09 12月, 2010 5 次提交
    • N
      md: protect against NULL reference when waiting to start a raid10. · 589a594b
      NeilBrown 提交于
      When we fail to start a raid10 for some reason, we call
      md_unregister_thread to kill the thread that was created.
      
      Unfortunately md_thread() will then make one call into the handler
      (raid10d) even though md_wakeup_thread has not been called.  This is
      not safe and as md_unregister_thread is called after mddev->private
      has been set to NULL, it will definitely cause a NULL dereference.
      
      So fix this at both ends:
       - md_thread should only call the handler if THREAD_WAKEUP has been
         set.
       - raid10 should call md_unregister_thread before setting things
         to NULL just like all the other raid modules do.
      
      This is applicable to 2.6.35 and later.
      
      Cc: stable@kernel.org
      Reported-by: N"Citizen" <citizen_lee@thecus.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      589a594b
    • N
      md: fix bug with re-adding of partially recovered device. · 1a855a06
      NeilBrown 提交于
      With v0.90 metadata, a hot-spare does not become a full member of the
      array until recovery is complete.  So if we re-add such a device to
      the array, we know that all of it is as up-to-date as the event count
      would suggest, and so it a bitmap-based recovery is possible.
      
      However with v1.x metadata, the hot-spare immediately becomes a full
      member of the array, but it record how much of the device has been
      recovered.  If the array is stopped and re-assembled recovery starts
      from this point.
      
      When such a device is hot-added to an array we currently lose the 'how
      much is recovered' information and incorrectly included it as a full
      in-sync member (after bitmap-based fixup).
      This is wrong and unsafe and could corrupt data.
      
      So be more careful about setting saved_raid_disk - which is what
      guides the re-adding of devices back into an array.
      The new code matches the code in slot_store which does a similar
      thing, which is encouraging.
      
      This is suitable for any -stable kernel.
      Reported-by: N"Dailey, Nate" <Nate.Dailey@stratus.com>
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1a855a06
    • N
      md: fix possible deadlock in handling flush requests. · a035fc3e
      NeilBrown 提交于
      As recorded in
          https://bugzilla.kernel.org/show_bug.cgi?id=24012
      
      it is possible for a flush request through md to hang.  This is due to
      an interaction between the recursion avoidance in
      generic_make_request, the insistence in md of only having one flush
      active at a time, and the possibility of dm (or md) submitting two
      flush requests to a device from the one generic_make_request.
      
      If a generic_make_request call into dm causes two flush requests to be
      queued (as happens if the dm table has two targets - they get one
      each), these two will be queued inside generic_make_request.
      
      Assume they are for the same md device.
      The first is processed and causes 1 or more flush requests to be sent
      to lower devices.  These get queued within generic_make_request too.
      Then the second flush to the md device gets handled and it blocks
      waiting for the first flush to complete.  But it won't complete until
      the two lower-device requests complete, and they haven't even been
      submitted yet as they are on the generic_make_request queue.
      
      The deadlock can be broken by using a separate thread to submit the
      requests to lower devices.  md has such a thread readily available:
      md_wq.
      
      So use it to submit these requests.
      Reported-by: NGiacomo Catenazzi <cate@cateee.net>
      Tested-by: NGiacomo Catenazzi <cate@cateee.net>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a035fc3e
    • N
      md: move code in to submit_flushes. · a7a07e69
      NeilBrown 提交于
      submit_flushes is called from exactly one place.
      Move the code that is before and after that call into
      submit_flushes.
      
      This has not functional change, but will make the next patch
      smaller and easier to follow.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a7a07e69
    • N
      md: remove handling of flush_pending in md_submit_flush_data · 2b74e12e
      NeilBrown 提交于
      None of the functions called between setting flush_pending to 1, and
      atomic_dec_and_test can change flush_pending, or will anything
      running in any other thread (as ->flush_bio is not NULL).  So the
      atomic_dec_and_test will always succeed.
      So remove the atomic_sec and the atomic_dec_and_test.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2b74e12e
  10. 24 11月, 2010 2 次提交
    • D
      md: Call blk_queue_flush() to establish flush/fua support · be20e6c6
      Darrick J. Wong 提交于
      Before 2.6.37, the md layer had a mechanism for catching I/Os with the
      barrier flag set, and translating the barrier into barriers for all
      the underlying devices.  With 2.6.37, I/O barriers have become plain
      old flushes, and the md code was updated to reflect this.  However,
      one piece was left out -- the md layer does not tell the block layer
      that it supports flushes or FUA access at all, which results in md
      silently dropping flush requests.
      
      Since the support already seems there, just add this one piece of
      bookkeeping.
      Signed-off-by: NDarrick J. Wong <djwong@us.ibm.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      be20e6c6
    • J
      md: fix return value of rdev_size_change() · c26a44ed
      Justin Maggard 提交于
      When trying to grow an array by enlarging component devices,
      rdev_size_store() expects the return value of rdev_size_change() to be
      in sectors, but the actual value is returned in KBs.
      
      This functionality was broken by commit
           dd8ac336
      so this patch is suitable for any kernel since 2.6.30.
      
      Cc: stable@kernel.org
      Signed-off-by: NJustin Maggard <jmaggard10@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c26a44ed
  11. 13 11月, 2010 3 次提交
    • T
      block: clean up blkdev_get() wrappers and their users · d4d77629
      Tejun Heo 提交于
      After recent blkdev_get() modifications, open_by_devnum() and
      open_bdev_exclusive() are simple wrappers around blkdev_get().
      Replace them with blkdev_get_by_dev() and blkdev_get_by_path().
      
      blkdev_get_by_dev() is identical to open_by_devnum().
      blkdev_get_by_path() is slightly different in that it doesn't
      automatically add %FMODE_EXCL to @mode.
      
      All users are converted.  Most conversions are mechanical and don't
      introduce any behavior difference.  There are several exceptions.
      
      * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
        reason to OR it explicitly on blkdev_put().
      
      * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
        sb->s_mode.
      
      * With the above changes, sb->s_mode now always should contain
        FMODE_EXCL.  WARN_ON_ONCE() added to kill_block_super() to detect
        errors.
      
      The new blkdev_get_*() functions are with proper docbook comments.
      While at it, add function description to blkdev_get() too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Joern Engel <joern@lazybastard.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: xfs-masters@oss.sgi.com
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      d4d77629
    • T
      block: make blkdev_get/put() handle exclusive access · e525fd89
      Tejun Heo 提交于
      Over time, block layer has accumulated a set of APIs dealing with bdev
      open, close, claim and release.
      
      * blkdev_get/put() are the primary open and close functions.
      
      * bd_claim/release() deal with exclusive open.
      
      * open/close_bdev_exclusive() are combination of open and claim and
        the other way around, respectively.
      
      * bd_link/unlink_disk_holder() to create and remove holder/slave
        symlinks.
      
      * open_by_devnum() wraps bdget() + blkdev_get().
      
      The interface is a bit confusing and the decoupling of open and claim
      makes it impossible to properly guarantee exclusive access as
      in-kernel open + claim sequence can disturb the existing exclusive
      open even before the block layer knows the current open if for another
      exclusive access.  Reorganize the interface such that,
      
      * blkdev_get() is extended to include exclusive access management.
        @holder argument is added and, if is @FMODE_EXCL specified, it will
        gain exclusive access atomically w.r.t. other exclusive accesses.
      
      * blkdev_put() is similarly extended.  It now takes @mode argument and
        if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
        the last exclusive claim is released, the holder/slave symlinks are
        removed automatically.
      
      * bd_claim/release() and close_bdev_exclusive() are no longer
        necessary and either made static or removed.
      
      * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
        is no longer necessary and removed.
      
      * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
        and blkdev_get().  It also has an unexpected extra bdev_read_only()
        test which probably should be moved into blkdev_get().
      
      * open_by_devnum() is modified to take @holder argument and pass it to
        blkdev_get().
      
      Most of bdev open/close operations are unified into blkdev_get/put()
      and most exclusive accesses are tested atomically at the open time (as
      it should).  This cleans up code and removes some, both valid and
      invalid, but unnecessary all the same, corner cases.
      
      open_bdev_exclusive() and open_by_devnum() can use further cleanup -
      rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
      special features.  Well, let's leave them for another day.
      
      Most conversions are straight-forward.  drbd conversion is a bit more
      involved as there was some reordering, but the logic should stay the
      same.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Brown <neilb@suse.de>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: dm-devel@redhat.com
      Cc: drbd-dev@lists.linbit.com
      Cc: Leo Chen <leochen@broadcom.com>
      Cc: Scott Branden <sbranden@broadcom.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      e525fd89
    • T
      block: simplify holder symlink handling · e09b457b
      Tejun Heo 提交于
      Code to manage symlinks in /sys/block/*/{holders|slaves} are overly
      complex with multiple holder considerations, redundant extra
      references to all involved kobjects, unused generic kobject holder
      support and unnecessary mixup with bd_claim/release functionalities.
      
      Strip it down to what's necessary (single gendisk holder) and make it
      use a separate interface.  This is a step for cleaning up
      bd_claim/release.  This patch makes dm-table slightly more complex but
      it will be simplified again with further changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Brown <neilb@suse.de>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      e09b457b
  12. 10 11月, 2010 1 次提交
  13. 28 10月, 2010 5 次提交
    • N
      md: use separate bio pool for each md device. · a167f663
      NeilBrown 提交于
      bio_clone and bio_alloc allocate from a common bio pool.
      If an md device is stacked with other devices that use this pool, or under
      something like swap which uses the pool, then the multiple calls on
      the pool can cause deadlocks.
      
      So allocate a local bio pool for each md array and use that rather
      than the common pool.
      
      This pool is used both for regular IO and metadata updates.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a167f663
    • N
      md: change type of first arg to sync_page_io. · 2b193363
      NeilBrown 提交于
      Currently sync_page_io takes a 'bdev'.
      Every caller passes 'rdev->bdev'.
      We will soon want another field out of the rdev in sync_page_io,
      So just pass the rdev instead of the bdev out of it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2b193363
    • T
      md: fix and update workqueue usage · e804ac78
      Tejun Heo 提交于
      Workqueue usage in md has two problems.
      
      * Flush can be used during or depended upon by memory reclaim, but md
        uses the system workqueue for flush_work which may lead to deadlock.
      
      * md depends on flush_scheduled_work() to achieve exclusion against
        completion of removal of previous instances.  flush_scheduled_work()
        may incur unexpected amount of delay and is scheduled to be removed.
      
      This patch adds two workqueues to md - md_wq and md_misc_wq.  The
      former is guaranteed to make forward progress under memory pressure
      and serves flush_work.  The latter serves as the flush domain for
      other works.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e804ac78
    • N
      md: remove md_mutex locking. · 4b532c9b
      NeilBrown 提交于
      lock_kernel calls were recently pushed down into open/release
      functions.
      md doesn't need that protection.
      Then the BKL calls were change to md_mutex.  We don't need those
      either.
      So remove it all.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4b532c9b
    • N
      md: Fix regression with raid1 arrays without persistent metadata. · d97a41dc
      NeilBrown 提交于
      A RAID1 which has no persistent metadata, whether internal or
      external, will hang on the first write.
      This is caused by commit  070dc6dd
      In that case, MD_CHANGE_PENDING never gets cleared.
      
      So during md_update_sb, is neither persistent or external,
      clear MD_CHANGE_PENDING.
      
      This is suitable for 2.6.36-stable.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@kernel.org
      d97a41dc
  14. 05 10月, 2010 1 次提交
    • A
      block: autoconvert trivial BKL users to private mutex · 2a48fc0a
      Arnd Bergmann 提交于
      The block device drivers have all gained new lock_kernel
      calls from a recent pushdown, and some of the drivers
      were already using the BKL before.
      
      This turns the BKL into a set of per-driver mutexes.
      Still need to check whether this is safe to do.
      
      file=$1
      name=$2
      if grep -q lock_kernel ${file} ; then
          if grep -q 'include.*linux.mutex.h' ${file} ; then
                  sed -i '/include.*<linux\/smp_lock.h>/d' ${file}
          else
                  sed -i 's/include.*<linux\/smp_lock.h>.*$/include <linux\/mutex.h>/g' ${file}
          fi
          sed -i ${file} \
              -e "/^#include.*linux.mutex.h/,$ {
                      1,/^\(static\|int\|long\)/ {
                           /^\(static\|int\|long\)/istatic DEFINE_MUTEX(${name}_mutex);
      
      } }"  \
          -e "s/\(un\)*lock_kernel\>[ ]*()/mutex_\1lock(\&${name}_mutex)/g" \
          -e '/[      ]*cycle_kernel_lock();/d'
      else
          sed -i -e '/include.*\<smp_lock.h\>/d' ${file}  \
                      -e '/cycle_kernel_lock()/d'
      fi
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      2a48fc0a
  15. 17 9月, 2010 1 次提交
    • N
      md: fix v1.x metadata update when a disk is missing. · ddcf3522
      NeilBrown 提交于
      If an array with 1.x metadata is assembled with the last disk missing,
      md doesn't properly record the fact that the disk was missing.
      
      This is unlikely to cause a real problem as the event count will be
      different to the count on the missing disk so it won't be included in
      the array.  However it could still cause confusion.
      
      So make sure we clear all the relevant slots, not just the early ones.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ddcf3522