1. 06 11月, 2008 3 次提交
    • A
      md: linear: Fix a division by zero bug for very small arrays. · f1cd14ae
      Andre Noll 提交于
      We currently oops with a divide error on starting a linear software
      raid array consisting of at least two very small (< 500K) devices.
      
      The bug is caused by the calculation of the hash table size which
      tries to compute sector_div(sz, base) with "base" being zero due to
      the small size of the component devices of the array.
      
      Fix this by requiring the hash spacing to be at least one which
      implies that also "base" is non-zero.
      
      This bug has existed since about 2.6.14.
      
      Cc: stable@kernel.org
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f1cd14ae
    • N
      md: fix bug in raid10 recovery. · a53a6c85
      NeilBrown 提交于
      Adding a spare to a raid10 doesn't cause recovery to start.
      This is due to an silly type in
        commit 6c2fce2e
      and so is a bug in 2.6.27 and .28-rc.
      
      Thanks to Thomas Backlund for bisecting to find this.
      
      Cc: Thomas Backlund <tmb@mandriva.org>
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a53a6c85
    • N
      md: revert the recent addition of a call to the BLKRRPART ioctl. · cb3ac42b
      NeilBrown 提交于
      It turns out that it is only safe to call blkdev_ioctl when the device
      is actually open (as ->bd_disk is set to NULL on last close).  And it
      is quite possible for do_md_stop to be called when the device is not
      open.  So discard the call to blkdev_ioctl(BLKRRPART) which was
      added in
         commit 934d9c23
      
      It is just as easy to call this ioctl from userspace when needed (on
      mdadm -S) so leave it out of the kernel
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cb3ac42b
  2. 30 10月, 2008 3 次提交
    • M
      dm snapshot: wait for chunks in destructor · 879129d2
      Mikulas Patocka 提交于
      If there are several snapshots sharing an origin and one is removed
      while the origin is being written to, the snapshot's mempool may get
      deleted while elements are still referenced.
      
      Prior to dm-snapshot-use-per-device-mempools.patch the pending
      exceptions may still have been referenced after the snapshot was
      destroyed, but this was not a problem because the shared mempool
      was still there.
      
      This patch fixes the problem by tracking the number of mempool elements
      in use.
      
      The scenario:
      - You have an origin and two snapshots 1 and 2.
      - Someone writes to the origin.
      - It creates two exceptions in the snapshots, snapshot 1 will be primary
      exception, snapshot 2's pending_exception->primary_pe will point to the
      exception in snapshot 1.
      - The exceptions are being relocated, relocation of exception 1 finishes
      (but it's pending_exception is still allocated, because it is referenced
      by an exception from snapshot 2)
      - The user lvremoves snapshot 1 --- it calls just suspend (does nothing)
      and destructor. md->pending is zero (there is no I/O submitted to the
      snapshot by md layer), so it won't help us.
      - The destructor waits for kcopyd jobs to finish on snapshot 1 --- but
      there are none.
      - The destructor on snapshot 1 cleans up everything.
      - The relocation of exception on snapshot 2 finishes, it drops reference
      on primary_pe. This frees its primary_pe pointer. Primary_pe points to
      pending exception created for snapshot 1. So it frees memory into
      non-existing mempool.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      879129d2
    • M
      dm snapshot: fix register_snapshot deadlock · 60c856c8
      Mikulas Patocka 提交于
      register_snapshot() performs a GFP_KERNEL allocation while holding
      _origins_lock for write, but that could write out dirty pages onto a
      device that attempts to acquire _origins_lock for read, resulting in
      deadlock.
      
      So move the allocation up before taking the lock.
      
      This path is not performance-critical, so it doesn't matter that we
      allocate memory and free it if we find that we won't need it.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      60c856c8
    • I
      dm raid1: fix do_failures · b34578a4
      Ilpo Jarvinen 提交于
      Missing braces.  Commit 1f965b19 (dm raid1: separate region_hash interface
      part1) broke it.
      Signed-off-by: NIlpo Jarvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: Heinz Mauelshagen <hjm@redhat.com>
      b34578a4
  3. 28 10月, 2008 1 次提交
    • N
      md: destroy partitions and notify udev when md array is stopped. · 934d9c23
      NeilBrown 提交于
      md arrays are not currently destroyed when they are stopped - they
      remain in /sys/block.  Last time I tried this I tripped over locking
      too much.
      
      A consequence of this is that udev doesn't remove anything from /dev.
      This is rather ugly.
      
      As an interim measure until proper device removal can be achieved,
      make sure all partitions are removed using the BLKRRPART ioctl, and
      send a KOBJ_CHANGE when an md array is stopped.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      934d9c23
  4. 23 10月, 2008 1 次提交
  5. 22 10月, 2008 14 次提交
    • K
      dm: tidy local_init · 51157b4a
      Kiyoshi Ueda 提交于
      This patch tidies local_init() in preparation for request-based dm.
      No functional change.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      51157b4a
    • K
      dm: remove unused flush_all · f431d966
      Kiyoshi Ueda 提交于
      This patch removes the DM_WQ_FLUSH_ALL state that is unnecessary.
      
      The dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL) in dm_suspend()
      is never invoked because:
        - 'goto flush_and_out' is the same as 'goto out' because
          the 'goto flush_and_out' is called only when '!noflush'
        - If r is non-zero, then the code above will invoke 'goto out'
          and skip this code.
      
      No functional change.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f431d966
    • H
      dm raid1: separate region_hash interface part1 · 1f965b19
      Heinz Mauelshagen 提交于
      Separate the region hash code from raid1 so it can be shared by forthcoming
      targets.  Use BUG_ON() for failed async dm_io() calls.
      Signed-off-by: NHeinz Mauelshagen <hjm@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      1f965b19
    • M
      dm: mark split bio as cloned · f3e1d26e
      Martin K. Petersen 提交于
      When a bio gets split, mark its fragments with the BIO_CLONED flag.
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f3e1d26e
    • M
      dm crypt: remove waitqueue · 0a4a1047
      Milan Broz 提交于
      Remove waitqueue no longer needed with the async crypto interface.
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      0a4a1047
    • M
      dm crypt: fix async split · 393b47ef
      Milan Broz 提交于
      When writing io, dm-crypt has to allocate a new cloned bio
      and encrypt the data into newly-allocated pages attached to this bio.
      In rare cases, because of hw restrictions (e.g. physical segment limit)
      or memory pressure, sometimes more than one cloned bio has to be used,
      each processing a different fragment of the original.
      
      Currently there is one waitqueue which waits for one fragment to finish
      and continues processing the next fragment.
      
      But when using asynchronous crypto this doesn't work, because several
      fragments may be processed asynchronously or in parallel and there is
      only one crypt context that cannot be shared between the bio fragments.
      The result may be corruption of the data contained in the encrypted bio.
      
      The patch fixes this by allocating new dm_crypt_io structs (with new
      crypto contexts) and running them independently.
      
      The fragments contains a pointer to the base dm_crypt_io struct to
      handle reference counting, so the base one is properly deallocated
      after all the fragments are finished.
      
      In a low memory situation, this only uses one additional object from the
      mempool.  If the mempool is empty, the next allocation simple waits for
      previous fragments to complete.
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      393b47ef
    • M
      dm crypt: tidy sector · b635b00e
      Milan Broz 提交于
      Prepare local sector variable (offset) for later patch.
      Do not update io->sector for still-running I/O.
      
      No functional change.
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      b635b00e
    • M
      dm: remove dm header from targets · 586e80e6
      Mikulas Patocka 提交于
      Change #include "dm.h" to #include <linux/device-mapper.h> in all targets.
      Targets should not need direct access to internal DM structures.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      586e80e6
    • M
      dm: publish array_too_big · d63a5ce3
      Mikulas Patocka 提交于
      Move array_too_big to include/linux/device-mapper.h because it is
      used by targets.
      
      Remove the test from dm-raid1 as the number of mirror legs is limited
      such that it can never fail.  (Even for stripes it seems rather
      unlikely.)
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      d63a5ce3
    • M
      dm exception store: fix misordered writes · 7acedc5b
      Mikulas Patocka 提交于
      We must zero the next chunk on disk *before* writing out the current chunk, not
      after.  Otherwise if the machine crashes at the wrong time, the "end of metadata"
      marker may be missing.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      7acedc5b
    • A
      dm exception store: refactor zero_area · 7c9e6c17
      Alasdair G Kergon 提交于
      Use a separate buffer for writing zeroes to the on-disk snapshot
      exception store, make the updating of ps->current_area explicit and
      refactor the code in preparation for the fix in the next patch.
      
      No functional change.
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@kernel.org
      7c9e6c17
    • M
      dm snapshot: drop unused last_percent · f68d4f3d
      Mikulas Patocka 提交于
      The last_percent field is unused - remove it.
      (It dates from when events were triggered as each X% filled up.)
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f68d4f3d
    • M
      dm snapshot: fix primary_pe race · 7c5f78b9
      Mikulas Patocka 提交于
      Fix a race condition with primary_pe ref_count handling.
      
      put_pending_exception runs under dm_snapshot->lock, it does atomic_dec_and_test
      on primary_pe->ref_count, and later does atomic_read primary_pe->ref_count.
      
      __origin_write does atomic_dec_and_test on primary_pe->ref_count without holding
      dm_snapshot->lock.
      
      This opens the following race condition:
      Assume two CPUs, CPU1 is executing put_pending_exception (and holding
      dm_snapshot->lock). CPU2 is executing __origin_write in parallel.
      primary_pe->ref_count == 2.
      
      CPU1:
      if (primary_pe && atomic_dec_and_test(&primary_pe->ref_count))
      	origin_bios = bio_list_get(&primary_pe->origin_bios);
      ... decrements primary_pe->ref_count to 1. Doesn't load origin_bios
      
      CPU2:
      if (first && atomic_dec_and_test(&primary_pe->ref_count)) {
      	flush_bios(bio_list_get(&primary_pe->origin_bios));
      	free_pending_exception(primary_pe);
      	/* If we got here, pe_queue is necessarily empty. */
      	return r;
      }
      ... decrements primary_pe->ref_count to 0, submits pending bios, frees
      primary_pe.
      
      CPU1:
      if (!primary_pe || primary_pe != pe)
      	free_pending_exception(pe);
      ... this has no effect.
      if (primary_pe && !atomic_read(&primary_pe->ref_count))
      	free_pending_exception(primary_pe);
      ... sees ref_count == 0 (written by CPU 2), does double free !!
      
      This bug can happen only if someone is simultaneously writing to both the
      origin and the snapshot.
      
      If someone is writing only to the origin, __origin_write will submit kcopyd
      request after it decrements primary_pe->ref_count (so it can't happen that the
      finished copy races with primary_pe->ref_count decrementation).
      
      If someone is writing only to the snapshot, __origin_write isn't invoked at all
      and the race can't happen.
      
      The race happens when someone writes to the snapshot --- this creates
      pending_exception with primary_pe == NULL and starts copying. Then, someone
      writes to the same chunk in the snapshot, and __origin_write races with
      termination of already submitted request in pending_complete (that calls
      put_pending_exception).
      
      This race may be reason for bugs:
        http://bugzilla.kernel.org/show_bug.cgi?id=11636
        https://bugzilla.redhat.com/show_bug.cgi?id=465825
      
      The patch fixes the code to make sure that:
      1. If atomic_dec_and_test(&primary_pe->ref_count) returns false, the process
      must no longer dereference primary_pe (because someone else may free it under
      us).
      2. If atomic_dec_and_test(&primary_pe->ref_count) returns true, the process
      is responsible for freeing primary_pe.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      7c5f78b9
    • K
      dm kcopyd: avoid queue shuffle · b673c3a8
      Kazuo Ito 提交于
      Write throughput to LVM snapshot origin volume is an order
      of magnitude slower than those to LV without snapshots or
      snapshot target volumes, especially in the case of sequential
      writes with O_SYNC on.
      
      The following patch originally written by Kevin Jamieson and
      Jan Blunck and slightly modified for the current RCs by myself
      tries to improve the performance by modifying the behaviour
      of kcopyd, so that it pushes back an I/O job to the head of
      the job queue instead of the tail as process_jobs() currently
      does when it has to wait for free pages. This way, write
      requests aren't shuffled to cause extra seeks.
      
      I tested the patch against 2.6.27-rc5 and got the following results.
      The test is a dd command writing to snapshot origin followed by fsync
      to the file just created/updated.  A couple of filesystem benchmarks
      gave me similar results in case of sequential writes, while random
      writes didn't suffer much.
      
      dd if=/dev/zero of=<somewhere on snapshot origin> bs=4096 count=...
         [conv=notrunc when updating]
      
      1) linux 2.6.27-rc5 without the patch, write to snapshot origin,
      average throughput (MB/s)
                           10M     100M    1000M
      create,dd         511.46   610.72    11.81
      create,dd+fsync     7.10     6.77     8.13
      update,dd         431.63   917.41    12.75
      update,dd+fsync     7.79     7.43     8.12
      
      compared with write throughput to LV without any snapshots,
      all dd+fsync and 1000 MiB writes perform very poorly.
      
                           10M     100M    1000M
      create,dd         555.03   608.98   123.29
      create,dd+fsync   114.27    72.78    76.65
      update,dd         152.34  1267.27   124.04
      update,dd+fsync   130.56    77.81    77.84
      
      2) linux 2.6.27-rc5 with the patch, write to snapshot origin,
      average throughput (MB/s)
      
                           10M     100M    1000M
      create,dd         537.06   589.44    46.21
      create,dd+fsync    31.63    29.19    29.23
      update,dd         487.59   897.65    37.76
      update,dd+fsync    34.12    30.07    26.85
      
      Although still not on par with plain LV performance -
      cannot be avoided because it's copy on write anyway -
      this simple patch successfully improves throughtput
      of dd+fsync while not affecting the rest.
      Signed-off-by: NJan Blunck <jblunck@suse.de>
      Signed-off-by: NKazuo Ito <ito.kazuo@oss.ntt.co.jp>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      b673c3a8
  6. 21 10月, 2008 10 次提交
  7. 16 10月, 2008 3 次提交
  8. 15 10月, 2008 1 次提交
    • S
      md: build failure due to missing delay.h · 25570727
      Stephen Rothwell 提交于
      Today's linux-next build (powerpc ppc64_defconfig) failed like this:
      
      drivers/md/raid1.c: In function 'sync_request':
      drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible'
      make[3]: *** [drivers/md/raid1.o] Error 1
      make[3]: *** Waiting for unfinished jobs....
      drivers/md/raid10.c: In function 'sync_request':
      drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible'
      make[3]: *** [drivers/md/raid10.o] Error 1
      drivers/md/md.c: In function 'md_do_sync':
      drivers/md/md.c:5915: error: implicit declaration of function 'msleep'
      
      Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove
      unnecessary #includes, #defines, and function declarations").  I added
      the following patch.
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      25570727
  9. 13 10月, 2008 4 次提交
    • M
      [SCSI] block: separate failfast into multiple bits. · 6000a368
      Mike Christie 提交于
      Multipath is best at handling transport errors. If it gets a device
      error then there is not much the multipath layer can do. It will just
      access the same device but from a different path.
      
      This patch breaks up failfast into device, transport and driver errors.
      The multipath layers (md and dm mutlipath) only ask the lower levels to
      fast fail transport errors. The user of failfast, read ahead, will ask
      to fast fail on all errors.
      
      Note that blk_noretry_request will return true if any failfast bit
      is set. This allows drivers that do not support the multipath failfast
      bits to continue to fail on any failfast error like before. Drivers
      like scsi that are able to fail fast specific errors can check
      for the specific fail fast type. In the next patch I will convert
      scsi.
      Signed-off-by: NMike Christie <michaelc@cs.wisc.edu>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NJames Bottomley <James.Bottomley@HansenPartnership.com>
      6000a368
    • N
      md: Relax minimum size restrictions on chunk_size. · 4bbf3771
      NeilBrown 提交于
      Currently, the 'chunk_size' of an array must be at-least PAGE_SIZE.
      
      This makes moving an array to a machine with a larger PAGE_SIZE, or
      changing the kernel to use a larger PAGE_SIZE, can stop an array from
      working.
      
      For RAID10 and RAID4/5/6, this is non-trivial to fix as the resync
      process works on whole pages at a time, and assumes them to be wholly
      within a stripe.  For other raid personalities, this restriction is
      not needed at all and can be dropped.
      
      So remove the test on chunk_size from common can, and add it in just
      the places where it is needed: raid10 and raid4/5/6.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4bbf3771
    • N
      md: remove space after function name in declaration and call. · d710e138
      NeilBrown 提交于
      Having
         function (args)
      instead of
         function(args)
      
      make is harder to search for calls of particular functions.
      So remove all those spaces.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d710e138
    • N
      md: Remove unnecessary #includes, #defines, and function declarations. · fb4d8c76
      NeilBrown 提交于
      A lot of cruft has gathered over the years.  Time to remove it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fb4d8c76