1. 14 1月, 2011 1 次提交
    • N
      md: fix regression resulting in delays in clearing bits in a bitmap · 6c987910
      NeilBrown 提交于
      commit 589a594b (2.6.37-rc4) fixed a problem were md_thread would
      sometimes call the ->run function at a bad time.
      
      If an error is detected during array start up after the md_thread has
      been started, the md_thread is killed.  This resulted in the ->run
      function being called once.  However the array may not be in a state
      that it is safe to call ->run.
      
      However the fix imposed meant that  ->run was not called on a timeout.
      This means that when an array goes idle, bitmap bits do not get
      cleared promptly.  While the array is busy the bits will still be
      cleared when appropriate so this is not very serious.  There is no
      risk to data.
      
      Change the test so that we only avoid calling ->run when the thread
      is being stopped.  This more explicitly addresses the problem situation.
      
      This is suitable for 2.6.37-stable and any -stable kernel to which
      589a594b was applied.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6c987910
  2. 12 1月, 2011 1 次提交
    • N
      md: fix regression with re-adding devices to arrays with no metadata · bf572541
      NeilBrown 提交于
      Commit 1a855a06 (2.6.37-rc4) fixed a problem where devices were
      re-added when they shouldn't be but caused a regression in a less
      common case that means sometimes devices cannot be re-added when they
      should be.
      
      In particular, when re-adding a device to an array without metadata
      we should always access the device, but after the above commit we
      didn't.
      
      This patch sets the In_sync flag in that case so that the re-add
      succeeds.
      
      This patch is suitable for any -stable kernel to which 1a855a06 was
      applied.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bf572541
  3. 17 12月, 2010 2 次提交
    • M
      block: max hardware sectors limit wrapper · 72d4cd9f
      Mike Snitzer 提交于
      Implement blk_limits_max_hw_sectors() and make
      blk_queue_max_hw_sectors() a wrapper around it.
      
      DM needs this to avoid setting queue_limits' max_hw_sectors and
      max_sectors directly.  dm_set_device_limits() now leverages
      blk_limits_max_hw_sectors() logic to establish the appropriate
      max_hw_sectors minimum (PAGE_SIZE).  Fixes issue where DM was
      incorrectly setting max_sectors rather than max_hw_sectors (which
      caused dm_merge_bvec()'s max_hw_sectors check to be ineffective).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@kernel.org
      Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      72d4cd9f
    • M
      block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead · e692cb66
      Martin K. Petersen 提交于
      When stacking devices, a request_queue is not always available. This
      forced us to have a no_cluster flag in the queue_limits that could be
      used as a carrier until the request_queue had been set up for a
      metadevice.
      
      There were several problems with that approach. First of all it was up
      to the stacking device to remember to set queue flag after stacking had
      completed. Also, the queue flag and the queue limits had to be kept in
      sync at all times. We got that wrong, which could lead to us issuing
      commands that went beyond the max scatterlist limit set by the driver.
      
      The proper fix is to avoid having two flags for tracking the same thing.
      We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
      block layer merging functions. The queue_limit 'no_cluster' is turned
      into 'cluster' to avoid double negatives and to ease stacking.
      Clustering defaults to being enabled as before. The queue flag logic is
      removed from the stacking function, and explicitly setting the cluster
      flag is no longer necessary in DM and MD.
      Reported-by: NEd Lin <ed.lin@promise.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      e692cb66
  4. 09 12月, 2010 5 次提交
    • N
      md: protect against NULL reference when waiting to start a raid10. · 589a594b
      NeilBrown 提交于
      When we fail to start a raid10 for some reason, we call
      md_unregister_thread to kill the thread that was created.
      
      Unfortunately md_thread() will then make one call into the handler
      (raid10d) even though md_wakeup_thread has not been called.  This is
      not safe and as md_unregister_thread is called after mddev->private
      has been set to NULL, it will definitely cause a NULL dereference.
      
      So fix this at both ends:
       - md_thread should only call the handler if THREAD_WAKEUP has been
         set.
       - raid10 should call md_unregister_thread before setting things
         to NULL just like all the other raid modules do.
      
      This is applicable to 2.6.35 and later.
      
      Cc: stable@kernel.org
      Reported-by: N"Citizen" <citizen_lee@thecus.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      589a594b
    • N
      md: fix bug with re-adding of partially recovered device. · 1a855a06
      NeilBrown 提交于
      With v0.90 metadata, a hot-spare does not become a full member of the
      array until recovery is complete.  So if we re-add such a device to
      the array, we know that all of it is as up-to-date as the event count
      would suggest, and so it a bitmap-based recovery is possible.
      
      However with v1.x metadata, the hot-spare immediately becomes a full
      member of the array, but it record how much of the device has been
      recovered.  If the array is stopped and re-assembled recovery starts
      from this point.
      
      When such a device is hot-added to an array we currently lose the 'how
      much is recovered' information and incorrectly included it as a full
      in-sync member (after bitmap-based fixup).
      This is wrong and unsafe and could corrupt data.
      
      So be more careful about setting saved_raid_disk - which is what
      guides the re-adding of devices back into an array.
      The new code matches the code in slot_store which does a similar
      thing, which is encouraging.
      
      This is suitable for any -stable kernel.
      Reported-by: N"Dailey, Nate" <Nate.Dailey@stratus.com>
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1a855a06
    • N
      md: fix possible deadlock in handling flush requests. · a035fc3e
      NeilBrown 提交于
      As recorded in
          https://bugzilla.kernel.org/show_bug.cgi?id=24012
      
      it is possible for a flush request through md to hang.  This is due to
      an interaction between the recursion avoidance in
      generic_make_request, the insistence in md of only having one flush
      active at a time, and the possibility of dm (or md) submitting two
      flush requests to a device from the one generic_make_request.
      
      If a generic_make_request call into dm causes two flush requests to be
      queued (as happens if the dm table has two targets - they get one
      each), these two will be queued inside generic_make_request.
      
      Assume they are for the same md device.
      The first is processed and causes 1 or more flush requests to be sent
      to lower devices.  These get queued within generic_make_request too.
      Then the second flush to the md device gets handled and it blocks
      waiting for the first flush to complete.  But it won't complete until
      the two lower-device requests complete, and they haven't even been
      submitted yet as they are on the generic_make_request queue.
      
      The deadlock can be broken by using a separate thread to submit the
      requests to lower devices.  md has such a thread readily available:
      md_wq.
      
      So use it to submit these requests.
      Reported-by: NGiacomo Catenazzi <cate@cateee.net>
      Tested-by: NGiacomo Catenazzi <cate@cateee.net>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a035fc3e
    • N
      md: move code in to submit_flushes. · a7a07e69
      NeilBrown 提交于
      submit_flushes is called from exactly one place.
      Move the code that is before and after that call into
      submit_flushes.
      
      This has not functional change, but will make the next patch
      smaller and easier to follow.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a7a07e69
    • N
      md: remove handling of flush_pending in md_submit_flush_data · 2b74e12e
      NeilBrown 提交于
      None of the functions called between setting flush_pending to 1, and
      atomic_dec_and_test can change flush_pending, or will anything
      running in any other thread (as ->flush_bio is not NULL).  So the
      atomic_dec_and_test will always succeed.
      So remove the atomic_sec and the atomic_dec_and_test.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2b74e12e
  5. 24 11月, 2010 3 次提交
    • D
      md: Call blk_queue_flush() to establish flush/fua support · be20e6c6
      Darrick J. Wong 提交于
      Before 2.6.37, the md layer had a mechanism for catching I/Os with the
      barrier flag set, and translating the barrier into barriers for all
      the underlying devices.  With 2.6.37, I/O barriers have become plain
      old flushes, and the md code was updated to reflect this.  However,
      one piece was left out -- the md layer does not tell the block layer
      that it supports flushes or FUA access at all, which results in md
      silently dropping flush requests.
      
      Since the support already seems there, just add this one piece of
      bookkeeping.
      Signed-off-by: NDarrick J. Wong <djwong@us.ibm.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      be20e6c6
    • N
      md/raid1: really fix recovery looping when single good device fails. · 8f9e0ee3
      NeilBrown 提交于
      Commit 4044ba58 supposedly fixed a
      problem where if a raid1 with just one good device gets a read-error
      during recovery, the recovery would abort and immediately restart in
      an infinite loop.
      
      However it depended on raid1_remove_disk removing the spare device
      from the array.  But that does not happen in this case.  So add a test
      so that in the 'recovery_disabled' case, the device will be removed.
      
      This suitable for any kernel since 2.6.29 which is when
      recovery_disabled was introduced.
      
      Cc: stable@kernel.org
      Reported-by: NSebastian Färber <faerber@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8f9e0ee3
    • J
      md: fix return value of rdev_size_change() · c26a44ed
      Justin Maggard 提交于
      When trying to grow an array by enlarging component devices,
      rdev_size_store() expects the return value of rdev_size_change() to be
      in sectors, but the actual value is returned in KBs.
      
      This functionality was broken by commit
           dd8ac336
      so this patch is suitable for any kernel since 2.6.30.
      
      Cc: stable@kernel.org
      Signed-off-by: NJustin Maggard <jmaggard10@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c26a44ed
  6. 10 11月, 2010 1 次提交
  7. 29 10月, 2010 4 次提交
  8. 28 10月, 2010 9 次提交
    • N
      md: use separate bio pool for each md device. · a167f663
      NeilBrown 提交于
      bio_clone and bio_alloc allocate from a common bio pool.
      If an md device is stacked with other devices that use this pool, or under
      something like swap which uses the pool, then the multiple calls on
      the pool can cause deadlocks.
      
      So allocate a local bio pool for each md array and use that rather
      than the common pool.
      
      This pool is used both for regular IO and metadata updates.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a167f663
    • N
      md: change type of first arg to sync_page_io. · 2b193363
      NeilBrown 提交于
      Currently sync_page_io takes a 'bdev'.
      Every caller passes 'rdev->bdev'.
      We will soon want another field out of the rdev in sync_page_io,
      So just pass the rdev instead of the bdev out of it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2b193363
    • N
      md/raid1: perform mem allocation before disabling writes during resync. · 1c4588e9
      NeilBrown 提交于
      Though this mem alloc is GFP_NOIO an so will not deadlock, it seems
      better to do the allocation before 'raise_barrier' which stops any IO
      requests while the resync proceeds.
      
      raid10 always uses this order, so it is at least consistent to do the
      same in raid1.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1c4588e9
    • N
      md: use bio_kmalloc rather than bio_alloc when failure is acceptable. · 6746557f
      NeilBrown 提交于
      bio_alloc can never fail (as it uses a mempool) but an block
      indefinitely, especially if the caller is holding a reference to a
      previously allocated bio.
      
      So these to places which both handle failure and hold multiple bios
      should not use bio_alloc, they should use bio_kmalloc.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6746557f
    • N
      md: Fix possible deadlock with multiple mempool allocations. · 4e78064f
      NeilBrown 提交于
      It is not safe to allocate from a mempool while holding an item
      previously allocated from that mempool as that can deadlock when the
      mempool is close to exhaustion.
      
      So don't use a bio list to collect the bios to write to multiple
      devices in raid1 and raid10.
      Instead queue each bio as it becomes available so an unplug will
      activate all previously allocated bios and so a new bio has a chance
      of being allocated.
      
      This means we must set the 'remaining' count to '1' before submitting
      any requests, then when all are submitted, decrement 'remaining' and
      possible handle the write completion at that point.
      Reported-by: NTorsten Kaiser <just.for.lkml@googlemail.com>
      Tested-by: NTorsten Kaiser <just.for.lkml@googlemail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4e78064f
    • T
      md: fix and update workqueue usage · e804ac78
      Tejun Heo 提交于
      Workqueue usage in md has two problems.
      
      * Flush can be used during or depended upon by memory reclaim, but md
        uses the system workqueue for flush_work which may lead to deadlock.
      
      * md depends on flush_scheduled_work() to achieve exclusion against
        completion of removal of previous instances.  flush_scheduled_work()
        may incur unexpected amount of delay and is scheduled to be removed.
      
      This patch adds two workqueues to md - md_wq and md_misc_wq.  The
      former is guaranteed to make forward progress under memory pressure
      and serves flush_work.  The latter serves as the flush domain for
      other works.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e804ac78
    • N
      md: use sector_t in bitmap_get_counter · 57dab0bd
      NeilBrown 提交于
      bitmap_get_counter returns the number of sectors covered
      by the counter in a pass-by-reference variable.
      In some cases this can be very large, so make it a sector_t
      for safety.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      57dab0bd
    • N
      md: remove md_mutex locking. · 4b532c9b
      NeilBrown 提交于
      lock_kernel calls were recently pushed down into open/release
      functions.
      md doesn't need that protection.
      Then the BKL calls were change to md_mutex.  We don't need those
      either.
      So remove it all.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4b532c9b
    • N
      md: Fix regression with raid1 arrays without persistent metadata. · d97a41dc
      NeilBrown 提交于
      A RAID1 which has no persistent metadata, whether internal or
      external, will hang on the first write.
      This is caused by commit  070dc6dd
      In that case, MD_CHANGE_PENDING never gets cleared.
      
      So during md_update_sb, is neither persistent or external,
      clear MD_CHANGE_PENDING.
      
      This is suitable for 2.6.36-stable.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@kernel.org
      d97a41dc
  9. 27 10月, 2010 1 次提交
  10. 15 10月, 2010 1 次提交
    • A
      llseek: automatically add .llseek fop · 6038f373
      Arnd Bergmann 提交于
      All file_operations should get a .llseek operation so we can make
      nonseekable_open the default for future file operations without a
      .llseek pointer.
      
      The three cases that we can automatically detect are no_llseek, seq_lseek
      and default_llseek. For cases where we can we can automatically prove that
      the file offset is always ignored, we use noop_llseek, which maintains
      the current behavior of not returning an error from a seek.
      
      New drivers should normally not use noop_llseek but instead use no_llseek
      and call nonseekable_open at open time.  Existing drivers can be converted
      to do the same when the maintainer knows for certain that no user code
      relies on calling seek on the device file.
      
      The generated code is often incorrectly indented and right now contains
      comments that clarify for each added line why a specific variant was
      chosen. In the version that gets submitted upstream, the comments will
      be gone and I will manually fix the indentation, because there does not
      seem to be a way to do that using coccinelle.
      
      Some amount of new code is currently sitting in linux-next that should get
      the same modifications, which I will do at the end of the merge window.
      
      Many thanks to Julia Lawall for helping me learn to write a semantic
      patch that does all this.
      
      ===== begin semantic patch =====
      // This adds an llseek= method to all file operations,
      // as a preparation for making no_llseek the default.
      //
      // The rules are
      // - use no_llseek explicitly if we do nonseekable_open
      // - use seq_lseek for sequential files
      // - use default_llseek if we know we access f_pos
      // - use noop_llseek if we know we don't access f_pos,
      //   but we still want to allow users to call lseek
      //
      @ open1 exists @
      identifier nested_open;
      @@
      nested_open(...)
      {
      <+...
      nonseekable_open(...)
      ...+>
      }
      
      @ open exists@
      identifier open_f;
      identifier i, f;
      identifier open1.nested_open;
      @@
      int open_f(struct inode *i, struct file *f)
      {
      <+...
      (
      nonseekable_open(...)
      |
      nested_open(...)
      )
      ...+>
      }
      
      @ read disable optional_qualifier exists @
      identifier read_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      expression E;
      identifier func;
      @@
      ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
      {
      <+...
      (
         *off = E
      |
         *off += E
      |
         func(..., off, ...)
      |
         E = *off
      )
      ...+>
      }
      
      @ read_no_fpos disable optional_qualifier exists @
      identifier read_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      @@
      ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
      {
      ... when != off
      }
      
      @ write @
      identifier write_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      expression E;
      identifier func;
      @@
      ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
      {
      <+...
      (
        *off = E
      |
        *off += E
      |
        func(..., off, ...)
      |
        E = *off
      )
      ...+>
      }
      
      @ write_no_fpos @
      identifier write_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      @@
      ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
      {
      ... when != off
      }
      
      @ fops0 @
      identifier fops;
      @@
      struct file_operations fops = {
       ...
      };
      
      @ has_llseek depends on fops0 @
      identifier fops0.fops;
      identifier llseek_f;
      @@
      struct file_operations fops = {
      ...
       .llseek = llseek_f,
      ...
      };
      
      @ has_read depends on fops0 @
      identifier fops0.fops;
      identifier read_f;
      @@
      struct file_operations fops = {
      ...
       .read = read_f,
      ...
      };
      
      @ has_write depends on fops0 @
      identifier fops0.fops;
      identifier write_f;
      @@
      struct file_operations fops = {
      ...
       .write = write_f,
      ...
      };
      
      @ has_open depends on fops0 @
      identifier fops0.fops;
      identifier open_f;
      @@
      struct file_operations fops = {
      ...
       .open = open_f,
      ...
      };
      
      // use no_llseek if we call nonseekable_open
      ////////////////////////////////////////////
      @ nonseekable1 depends on !has_llseek && has_open @
      identifier fops0.fops;
      identifier nso ~= "nonseekable_open";
      @@
      struct file_operations fops = {
      ...  .open = nso, ...
      +.llseek = no_llseek, /* nonseekable */
      };
      
      @ nonseekable2 depends on !has_llseek @
      identifier fops0.fops;
      identifier open.open_f;
      @@
      struct file_operations fops = {
      ...  .open = open_f, ...
      +.llseek = no_llseek, /* open uses nonseekable */
      };
      
      // use seq_lseek for sequential files
      /////////////////////////////////////
      @ seq depends on !has_llseek @
      identifier fops0.fops;
      identifier sr ~= "seq_read";
      @@
      struct file_operations fops = {
      ...  .read = sr, ...
      +.llseek = seq_lseek, /* we have seq_read */
      };
      
      // use default_llseek if there is a readdir
      ///////////////////////////////////////////
      @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier readdir_e;
      @@
      // any other fop is used that changes pos
      struct file_operations fops = {
      ... .readdir = readdir_e, ...
      +.llseek = default_llseek, /* readdir is present */
      };
      
      // use default_llseek if at least one of read/write touches f_pos
      /////////////////////////////////////////////////////////////////
      @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read.read_f;
      @@
      // read fops use offset
      struct file_operations fops = {
      ... .read = read_f, ...
      +.llseek = default_llseek, /* read accesses f_pos */
      };
      
      @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier write.write_f;
      @@
      // write fops use offset
      struct file_operations fops = {
      ... .write = write_f, ...
      +	.llseek = default_llseek, /* write accesses f_pos */
      };
      
      // Use noop_llseek if neither read nor write accesses f_pos
      ///////////////////////////////////////////////////////////
      
      @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read_no_fpos.read_f;
      identifier write_no_fpos.write_f;
      @@
      // write fops use offset
      struct file_operations fops = {
      ...
       .write = write_f,
       .read = read_f,
      ...
      +.llseek = noop_llseek, /* read and write both use no f_pos */
      };
      
      @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier write_no_fpos.write_f;
      @@
      struct file_operations fops = {
      ... .write = write_f, ...
      +.llseek = noop_llseek, /* write uses no f_pos */
      };
      
      @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read_no_fpos.read_f;
      @@
      struct file_operations fops = {
      ... .read = read_f, ...
      +.llseek = noop_llseek, /* read uses no f_pos */
      };
      
      @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      @@
      struct file_operations fops = {
      ...
      +.llseek = noop_llseek, /* no read or write fn */
      };
      ===== End semantic patch =====
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Julia Lawall <julia@diku.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      6038f373
  11. 07 10月, 2010 3 次提交
  12. 05 10月, 2010 1 次提交
    • A
      block: autoconvert trivial BKL users to private mutex · 2a48fc0a
      Arnd Bergmann 提交于
      The block device drivers have all gained new lock_kernel
      calls from a recent pushdown, and some of the drivers
      were already using the BKL before.
      
      This turns the BKL into a set of per-driver mutexes.
      Still need to check whether this is safe to do.
      
      file=$1
      name=$2
      if grep -q lock_kernel ${file} ; then
          if grep -q 'include.*linux.mutex.h' ${file} ; then
                  sed -i '/include.*<linux\/smp_lock.h>/d' ${file}
          else
                  sed -i 's/include.*<linux\/smp_lock.h>.*$/include <linux\/mutex.h>/g' ${file}
          fi
          sed -i ${file} \
              -e "/^#include.*linux.mutex.h/,$ {
                      1,/^\(static\|int\|long\)/ {
                           /^\(static\|int\|long\)/istatic DEFINE_MUTEX(${name}_mutex);
      
      } }"  \
          -e "s/\(un\)*lock_kernel\>[ ]*()/mutex_\1lock(\&${name}_mutex)/g" \
          -e '/[      ]*cycle_kernel_lock();/d'
      else
          sed -i -e '/include.*\<smp_lock.h\>/d' ${file}  \
                      -e '/cycle_kernel_lock()/d'
      fi
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      2a48fc0a
  13. 17 9月, 2010 2 次提交
    • N
      md: fix v1.x metadata update when a disk is missing. · ddcf3522
      NeilBrown 提交于
      If an array with 1.x metadata is assembled with the last disk missing,
      md doesn't properly record the fact that the disk was missing.
      
      This is unlikely to cause a real problem as the event count will be
      different to the count on the missing disk so it won't be included in
      the array.  However it could still cause confusion.
      
      So make sure we clear all the relevant slots, not just the early ones.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ddcf3522
    • N
      md: call md_update_sb even for 'external' metadata arrays. · 126925c0
      NeilBrown 提交于
      Now that we depend on md_update_sb to clear variable bits in
      mddev->flags (rather than trying not to set them) it is important to
      always call md_update_sb when appropriate.
      
      md_check_recovery has this job but explicitly avoids it for ->external
      metadata arrays.  This is not longer appropraite, or needed.
      
      However we do want to avoid taking the mddev lock if only
      MD_CHANGE_PENDING is set as that is not cleared by md_update_sb for
      external-metadata arrays.
      Reported-by: N"Kwolek, Adam" <adam.kwolek@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      126925c0
  14. 11 9月, 2010 1 次提交
  15. 10 9月, 2010 5 次提交
    • M
      dm: convey that all flushes are processed as empty · b372d360
      Mike Snitzer 提交于
      Rename __clone_and_map_flush to __clone_and_map_empty_flush for added
      clarity.
      
      Simplify logic associated with REQ_FLUSH conditionals.
      
      Introduce a BUG_ON() and add a few more helpful comments to the code
      so that it is clear that all flushes are empty.
      
      Cleanup __split_and_process_bio() so that an empty flush isn't processed
      by a 'sector_count' focused while loop.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      b372d360
    • K
      dm: fix locking context in queue_io() · 05447420
      Kiyoshi Ueda 提交于
      Now queue_io() is called from dec_pending(), which may be called with
      interrupts disabled, so queue_io() must not enable interrupts
      unconditionally and must save/restore the current interrupts status.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      05447420
    • T
      dm: relax ordering of bio-based flush implementation · 6a8736d1
      Tejun Heo 提交于
      Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering
      against other bio's.  This patch relaxes ordering around flushes.
      
      * A flush bio is no longer deferred to workqueue directly.  It's
        processed like other bio's but __split_and_process_bio() uses
        md->flush_bio as the clone source.  md->flush_bio is initialized to
        empty flush during md initialization and shared for all flushes.
      
      * As a flush bio now travels through the same execution path as other
        bio's, there's no need for dedicated error handling path either.  It
        can use the same error handling path in dec_pending().  Dedicated
        error handling removed along with md->flush_error.
      
      * When dec_pending() detects that a flush has completed, it checks
        whether the original bio has data.  If so, the bio is queued to the
        deferred list w/ REQ_FLUSH cleared; otherwise, it's completed.
      
      * As flush sequencing is handled in the usual issue/completion path,
        dm_wq_work() no longer needs to handle flushes differently.  Now its
        only responsibility is re-issuing deferred bio's the same way as
        _dm_request() would.  REQ_FLUSH handling logic including
        process_flush() is dropped.
      
      * There's no reason for queue_io() and dm_wq_work() write lock
        dm->io_lock.  queue_io() now only uses md->deferred_lock and
        dm_wq_work() read locks dm->io_lock.
      
      * bio's no longer need to be queued on the deferred list while a flush
        is in progress making DMF_QUEUE_IO_TO_THREAD unncessary.  Drop it.
      
      This avoids stalling the device during flushes and simplifies the
      implementation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6a8736d1
    • T
      dm: implement REQ_FLUSH/FUA support for request-based dm · 29e4013d
      Tejun Heo 提交于
      This patch converts request-based dm to support the new REQ_FLUSH/FUA.
      
      The original request-based flush implementation depended on
      request_queue blocking other requests while a barrier sequence is in
      progress, which is no longer true for the new REQ_FLUSH/FUA.
      
      In general, request-based dm doesn't have infrastructure for cloning
      one source request to multiple targets, but the original flush
      implementation had a special mostly independent path which can issue
      flushes to multiple targets and sequence them.  However, the
      capability isn't currently in use and adds a lot of complexity.
      Moreoever, it's unlikely to be useful in its current form as it
      doesn't make sense to be able to send out flushes to multiple targets
      when write requests can't be.
      
      This patch rips out special flush code path and deals handles
      REQ_FLUSH/FUA requests the same way as other requests.  The only
      special treatment is that REQ_FLUSH requests use the block address 0
      when finding target, which is enough for now.
      
      * added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
        suggested by Mike Snitzer
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Tested-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      29e4013d
    • T
      dm: implement REQ_FLUSH/FUA support for bio-based dm · d87f4c14
      Tejun Heo 提交于
      This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
      now deprecated REQ_HARDBARRIER.
      
      * -EOPNOTSUPP handling logic dropped.
      
      * Preflush is handled as before but postflush is dropped and replaced
        with passing down REQ_FUA to member request_queues.  This replaces
        one array wide cache flush w/ member specific FUA writes.
      
      * __split_and_process_bio() now calls __clone_and_map_flush() directly
        for flushes and guarantees all FLUSH bio's going to targets are zero
      `  length.
      
      * It's now guaranteed that all FLUSH bio's which are passed onto dm
        targets are zero length.  bio_empty_barrier() tests are replaced
        with REQ_FLUSH tests.
      
      * Empty WRITE_BARRIERs are replaced with WRITE_FLUSHes.
      
      * Dropped unlikely() around REQ_FLUSH tests.  Flushes are not unlikely
        enough to be marked with unlikely().
      
      * Block layer now filters out REQ_FLUSH/FUA bio's if the request_queue
        doesn't support cache flushing.  Advertise REQ_FLUSH | REQ_FUA
        capability.
      
      * Request based dm isn't converted yet.  dm_init_request_based_queue()
        resets flush support to 0 for now.  To avoid disturbing request
        based dm code, dm->flush_error is added for bio based dm while
        requested based dm continues to use dm->barrier_error.
      
      Lightly tested linear, stripe, raid1, snap and crypt targets.  Please
      proceed with caution as I'm not familiar with the code base.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: dm-devel@redhat.com
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      d87f4c14