1. 06 1月, 2009 7 次提交
  2. 29 12月, 2008 1 次提交
    • J
      bio: allow individual slabs in the bio_set · bb799ca0
      Jens Axboe 提交于
      Instead of having a global bio slab cache, add a reference to one
      in each bio_set that is created. This allows for personalized slabs
      in each bio_set, so that they can have bios of different sizes.
      
      This means we can personalize the bios we return. File systems may
      want to embed the bio inside another structure, to avoid allocation
      more items (and stuffing them in ->bi_private) after the get a bio.
      Or we may want to embed a number of bio_vecs directly at the end
      of a bio, to avoid doing two allocations to return a bio. This is now
      possible.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      bb799ca0
  3. 19 12月, 2008 1 次提交
    • N
      md: Don't read past end of bitmap when reading bitmap. · a2ed9615
      NeilBrown 提交于
      When we read the write-intent-bitmap off the device, we currently
      read a whole number of pages.
      When PAGE_SIZE is 4K, this works due to the alignment we enforce
      on the superblock and bitmap.
      When PAGE_SIZE is 64K, this case read past the end-of-device
      which causes an error.
      
      When we write the superblock, we ensure to clip the last page
      to just be the required size.  Copy that code into the read path
      to just read the required number of sectors.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Cc: stable@kernel.org
      a2ed9615
  4. 03 12月, 2008 1 次提交
    • M
      block: fix setting of max_segment_size and seg_boundary mask · 0e435ac2
      Milan Broz 提交于
      Fix setting of max_segment_size and seg_boundary mask for stacked md/dm
      devices.
      
      When stacking devices (LVM over MD over SCSI) some of the request queue
      parameters are not set up correctly in some cases by default, namely
      max_segment_size and and seg_boundary mask.
      
      If you create MD device over SCSI, these attributes are zeroed.
      
      Problem become when there is over this mapping next device-mapper mapping
      - queue attributes are set in DM this way:
      
      request_queue   max_segment_size  seg_boundary_mask
      SCSI                65536             0xffffffff
      MD RAID1                0                      0
      LVM                 65536                 -1 (64bit)
      
      Unfortunately bio_add_page (resp.  bio_phys_segments) calculates number of
      physical segments according to these parameters.
      
      During the generic_make_request() is segment cout recalculated and can
      increase bio->bi_phys_segments count over the allowed limit.  (After
      bio_clone() in stack operation.)
      
      Thi is specially problem in CCISS driver, where it produce OOPS here
      
          BUG_ON(creq->nr_phys_segments > MAXSGENTRIES);
      
      (MAXSEGENTRIES is 31 by default.)
      
      Sometimes even this command is enough to cause oops:
      
        dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10
      
      This command generates bios with 250 sectors, allocated in 32 4k-pages
      (last page uses only 1024 bytes).
      
      For LVM layer, it allocates bio with 31 segments (still OK for CCISS),
      unfortunatelly on lower layer it is recalculated to 32 segments and this
      violates CCISS restriction and triggers BUG_ON().
      
      The patch tries to fix it by:
      
       * initializing attributes above in queue request constructor
         blk_queue_make_request()
      
       * make sure that blk_queue_stack_limits() inherits setting
      
       (DM uses its own function to set the limits because it
       blk_queue_stack_limits() was introduced later.  It should probably switch
       to use generic stack limit function too.)
      
       * sets the default seg_boundary value in one place (blkdev.h)
      
       * use this mask as default in DM (instead of -1, which differs in 64bit)
      
      Bugs related to this:
      https://bugzilla.redhat.com/show_bug.cgi?id=471639
      http://bugzilla.kernel.org/show_bug.cgi?id=8672Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Reviewed-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Mike Miller <mike.miller@hp.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      0e435ac2
  5. 26 11月, 2008 2 次提交
  6. 14 11月, 2008 6 次提交
  7. 06 11月, 2008 3 次提交
    • A
      md: linear: Fix a division by zero bug for very small arrays. · f1cd14ae
      Andre Noll 提交于
      We currently oops with a divide error on starting a linear software
      raid array consisting of at least two very small (< 500K) devices.
      
      The bug is caused by the calculation of the hash table size which
      tries to compute sector_div(sz, base) with "base" being zero due to
      the small size of the component devices of the array.
      
      Fix this by requiring the hash spacing to be at least one which
      implies that also "base" is non-zero.
      
      This bug has existed since about 2.6.14.
      
      Cc: stable@kernel.org
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f1cd14ae
    • N
      md: fix bug in raid10 recovery. · a53a6c85
      NeilBrown 提交于
      Adding a spare to a raid10 doesn't cause recovery to start.
      This is due to an silly type in
        commit 6c2fce2e
      and so is a bug in 2.6.27 and .28-rc.
      
      Thanks to Thomas Backlund for bisecting to find this.
      
      Cc: Thomas Backlund <tmb@mandriva.org>
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a53a6c85
    • N
      md: revert the recent addition of a call to the BLKRRPART ioctl. · cb3ac42b
      NeilBrown 提交于
      It turns out that it is only safe to call blkdev_ioctl when the device
      is actually open (as ->bd_disk is set to NULL on last close).  And it
      is quite possible for do_md_stop to be called when the device is not
      open.  So discard the call to blkdev_ioctl(BLKRRPART) which was
      added in
         commit 934d9c23
      
      It is just as easy to call this ioctl from userspace when needed (on
      mdadm -S) so leave it out of the kernel
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cb3ac42b
  8. 30 10月, 2008 3 次提交
    • M
      dm snapshot: wait for chunks in destructor · 879129d2
      Mikulas Patocka 提交于
      If there are several snapshots sharing an origin and one is removed
      while the origin is being written to, the snapshot's mempool may get
      deleted while elements are still referenced.
      
      Prior to dm-snapshot-use-per-device-mempools.patch the pending
      exceptions may still have been referenced after the snapshot was
      destroyed, but this was not a problem because the shared mempool
      was still there.
      
      This patch fixes the problem by tracking the number of mempool elements
      in use.
      
      The scenario:
      - You have an origin and two snapshots 1 and 2.
      - Someone writes to the origin.
      - It creates two exceptions in the snapshots, snapshot 1 will be primary
      exception, snapshot 2's pending_exception->primary_pe will point to the
      exception in snapshot 1.
      - The exceptions are being relocated, relocation of exception 1 finishes
      (but it's pending_exception is still allocated, because it is referenced
      by an exception from snapshot 2)
      - The user lvremoves snapshot 1 --- it calls just suspend (does nothing)
      and destructor. md->pending is zero (there is no I/O submitted to the
      snapshot by md layer), so it won't help us.
      - The destructor waits for kcopyd jobs to finish on snapshot 1 --- but
      there are none.
      - The destructor on snapshot 1 cleans up everything.
      - The relocation of exception on snapshot 2 finishes, it drops reference
      on primary_pe. This frees its primary_pe pointer. Primary_pe points to
      pending exception created for snapshot 1. So it frees memory into
      non-existing mempool.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      879129d2
    • M
      dm snapshot: fix register_snapshot deadlock · 60c856c8
      Mikulas Patocka 提交于
      register_snapshot() performs a GFP_KERNEL allocation while holding
      _origins_lock for write, but that could write out dirty pages onto a
      device that attempts to acquire _origins_lock for read, resulting in
      deadlock.
      
      So move the allocation up before taking the lock.
      
      This path is not performance-critical, so it doesn't matter that we
      allocate memory and free it if we find that we won't need it.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      60c856c8
    • I
      dm raid1: fix do_failures · b34578a4
      Ilpo Jarvinen 提交于
      Missing braces.  Commit 1f965b19 (dm raid1: separate region_hash interface
      part1) broke it.
      Signed-off-by: NIlpo Jarvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: Heinz Mauelshagen <hjm@redhat.com>
      b34578a4
  9. 28 10月, 2008 1 次提交
    • N
      md: destroy partitions and notify udev when md array is stopped. · 934d9c23
      NeilBrown 提交于
      md arrays are not currently destroyed when they are stopped - they
      remain in /sys/block.  Last time I tried this I tripped over locking
      too much.
      
      A consequence of this is that udev doesn't remove anything from /dev.
      This is rather ugly.
      
      As an interim measure until proper device removal can be achieved,
      make sure all partitions are removed using the BLKRRPART ioctl, and
      send a KOBJ_CHANGE when an md array is stopped.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      934d9c23
  10. 23 10月, 2008 1 次提交
  11. 22 10月, 2008 14 次提交
    • K
      dm: tidy local_init · 51157b4a
      Kiyoshi Ueda 提交于
      This patch tidies local_init() in preparation for request-based dm.
      No functional change.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      51157b4a
    • K
      dm: remove unused flush_all · f431d966
      Kiyoshi Ueda 提交于
      This patch removes the DM_WQ_FLUSH_ALL state that is unnecessary.
      
      The dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL) in dm_suspend()
      is never invoked because:
        - 'goto flush_and_out' is the same as 'goto out' because
          the 'goto flush_and_out' is called only when '!noflush'
        - If r is non-zero, then the code above will invoke 'goto out'
          and skip this code.
      
      No functional change.
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f431d966
    • H
      dm raid1: separate region_hash interface part1 · 1f965b19
      Heinz Mauelshagen 提交于
      Separate the region hash code from raid1 so it can be shared by forthcoming
      targets.  Use BUG_ON() for failed async dm_io() calls.
      Signed-off-by: NHeinz Mauelshagen <hjm@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      1f965b19
    • M
      dm: mark split bio as cloned · f3e1d26e
      Martin K. Petersen 提交于
      When a bio gets split, mark its fragments with the BIO_CLONED flag.
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f3e1d26e
    • M
      dm crypt: remove waitqueue · 0a4a1047
      Milan Broz 提交于
      Remove waitqueue no longer needed with the async crypto interface.
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      0a4a1047
    • M
      dm crypt: fix async split · 393b47ef
      Milan Broz 提交于
      When writing io, dm-crypt has to allocate a new cloned bio
      and encrypt the data into newly-allocated pages attached to this bio.
      In rare cases, because of hw restrictions (e.g. physical segment limit)
      or memory pressure, sometimes more than one cloned bio has to be used,
      each processing a different fragment of the original.
      
      Currently there is one waitqueue which waits for one fragment to finish
      and continues processing the next fragment.
      
      But when using asynchronous crypto this doesn't work, because several
      fragments may be processed asynchronously or in parallel and there is
      only one crypt context that cannot be shared between the bio fragments.
      The result may be corruption of the data contained in the encrypted bio.
      
      The patch fixes this by allocating new dm_crypt_io structs (with new
      crypto contexts) and running them independently.
      
      The fragments contains a pointer to the base dm_crypt_io struct to
      handle reference counting, so the base one is properly deallocated
      after all the fragments are finished.
      
      In a low memory situation, this only uses one additional object from the
      mempool.  If the mempool is empty, the next allocation simple waits for
      previous fragments to complete.
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      393b47ef
    • M
      dm crypt: tidy sector · b635b00e
      Milan Broz 提交于
      Prepare local sector variable (offset) for later patch.
      Do not update io->sector for still-running I/O.
      
      No functional change.
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      b635b00e
    • M
      dm: remove dm header from targets · 586e80e6
      Mikulas Patocka 提交于
      Change #include "dm.h" to #include <linux/device-mapper.h> in all targets.
      Targets should not need direct access to internal DM structures.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      586e80e6
    • M
      dm: publish array_too_big · d63a5ce3
      Mikulas Patocka 提交于
      Move array_too_big to include/linux/device-mapper.h because it is
      used by targets.
      
      Remove the test from dm-raid1 as the number of mirror legs is limited
      such that it can never fail.  (Even for stripes it seems rather
      unlikely.)
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      d63a5ce3
    • M
      dm exception store: fix misordered writes · 7acedc5b
      Mikulas Patocka 提交于
      We must zero the next chunk on disk *before* writing out the current chunk, not
      after.  Otherwise if the machine crashes at the wrong time, the "end of metadata"
      marker may be missing.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      7acedc5b
    • A
      dm exception store: refactor zero_area · 7c9e6c17
      Alasdair G Kergon 提交于
      Use a separate buffer for writing zeroes to the on-disk snapshot
      exception store, make the updating of ps->current_area explicit and
      refactor the code in preparation for the fix in the next patch.
      
      No functional change.
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@kernel.org
      7c9e6c17
    • M
      dm snapshot: drop unused last_percent · f68d4f3d
      Mikulas Patocka 提交于
      The last_percent field is unused - remove it.
      (It dates from when events were triggered as each X% filled up.)
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f68d4f3d
    • M
      dm snapshot: fix primary_pe race · 7c5f78b9
      Mikulas Patocka 提交于
      Fix a race condition with primary_pe ref_count handling.
      
      put_pending_exception runs under dm_snapshot->lock, it does atomic_dec_and_test
      on primary_pe->ref_count, and later does atomic_read primary_pe->ref_count.
      
      __origin_write does atomic_dec_and_test on primary_pe->ref_count without holding
      dm_snapshot->lock.
      
      This opens the following race condition:
      Assume two CPUs, CPU1 is executing put_pending_exception (and holding
      dm_snapshot->lock). CPU2 is executing __origin_write in parallel.
      primary_pe->ref_count == 2.
      
      CPU1:
      if (primary_pe && atomic_dec_and_test(&primary_pe->ref_count))
      	origin_bios = bio_list_get(&primary_pe->origin_bios);
      ... decrements primary_pe->ref_count to 1. Doesn't load origin_bios
      
      CPU2:
      if (first && atomic_dec_and_test(&primary_pe->ref_count)) {
      	flush_bios(bio_list_get(&primary_pe->origin_bios));
      	free_pending_exception(primary_pe);
      	/* If we got here, pe_queue is necessarily empty. */
      	return r;
      }
      ... decrements primary_pe->ref_count to 0, submits pending bios, frees
      primary_pe.
      
      CPU1:
      if (!primary_pe || primary_pe != pe)
      	free_pending_exception(pe);
      ... this has no effect.
      if (primary_pe && !atomic_read(&primary_pe->ref_count))
      	free_pending_exception(primary_pe);
      ... sees ref_count == 0 (written by CPU 2), does double free !!
      
      This bug can happen only if someone is simultaneously writing to both the
      origin and the snapshot.
      
      If someone is writing only to the origin, __origin_write will submit kcopyd
      request after it decrements primary_pe->ref_count (so it can't happen that the
      finished copy races with primary_pe->ref_count decrementation).
      
      If someone is writing only to the snapshot, __origin_write isn't invoked at all
      and the race can't happen.
      
      The race happens when someone writes to the snapshot --- this creates
      pending_exception with primary_pe == NULL and starts copying. Then, someone
      writes to the same chunk in the snapshot, and __origin_write races with
      termination of already submitted request in pending_complete (that calls
      put_pending_exception).
      
      This race may be reason for bugs:
        http://bugzilla.kernel.org/show_bug.cgi?id=11636
        https://bugzilla.redhat.com/show_bug.cgi?id=465825
      
      The patch fixes the code to make sure that:
      1. If atomic_dec_and_test(&primary_pe->ref_count) returns false, the process
      must no longer dereference primary_pe (because someone else may free it under
      us).
      2. If atomic_dec_and_test(&primary_pe->ref_count) returns true, the process
      is responsible for freeing primary_pe.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      7c5f78b9
    • K
      dm kcopyd: avoid queue shuffle · b673c3a8
      Kazuo Ito 提交于
      Write throughput to LVM snapshot origin volume is an order
      of magnitude slower than those to LV without snapshots or
      snapshot target volumes, especially in the case of sequential
      writes with O_SYNC on.
      
      The following patch originally written by Kevin Jamieson and
      Jan Blunck and slightly modified for the current RCs by myself
      tries to improve the performance by modifying the behaviour
      of kcopyd, so that it pushes back an I/O job to the head of
      the job queue instead of the tail as process_jobs() currently
      does when it has to wait for free pages. This way, write
      requests aren't shuffled to cause extra seeks.
      
      I tested the patch against 2.6.27-rc5 and got the following results.
      The test is a dd command writing to snapshot origin followed by fsync
      to the file just created/updated.  A couple of filesystem benchmarks
      gave me similar results in case of sequential writes, while random
      writes didn't suffer much.
      
      dd if=/dev/zero of=<somewhere on snapshot origin> bs=4096 count=...
         [conv=notrunc when updating]
      
      1) linux 2.6.27-rc5 without the patch, write to snapshot origin,
      average throughput (MB/s)
                           10M     100M    1000M
      create,dd         511.46   610.72    11.81
      create,dd+fsync     7.10     6.77     8.13
      update,dd         431.63   917.41    12.75
      update,dd+fsync     7.79     7.43     8.12
      
      compared with write throughput to LV without any snapshots,
      all dd+fsync and 1000 MiB writes perform very poorly.
      
                           10M     100M    1000M
      create,dd         555.03   608.98   123.29
      create,dd+fsync   114.27    72.78    76.65
      update,dd         152.34  1267.27   124.04
      update,dd+fsync   130.56    77.81    77.84
      
      2) linux 2.6.27-rc5 with the patch, write to snapshot origin,
      average throughput (MB/s)
      
                           10M     100M    1000M
      create,dd         537.06   589.44    46.21
      create,dd+fsync    31.63    29.19    29.23
      update,dd         487.59   897.65    37.76
      update,dd+fsync    34.12    30.07    26.85
      
      Although still not on par with plain LV performance -
      cannot be avoided because it's copy on write anyway -
      this simple patch successfully improves throughtput
      of dd+fsync while not affecting the rest.
      Signed-off-by: NJan Blunck <jblunck@suse.de>
      Signed-off-by: NKazuo Ito <ito.kazuo@oss.ntt.co.jp>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      b673c3a8