- 06 1月, 2009 7 次提交
-
-
由 Milan Broz 提交于
Move log size validation from mirror target to log constructor. Removed PAGE_SIZE restriction we no longer think necessary. Signed-off-by: NMilan Broz <mbroz@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Takahiro Yasui 提交于
rw_header function updates three members of io_req data every time when I/O is processed. bi_rw and notify.fn are never modified once they get initialized, and so they can be set in advance. header_to_disk() can also be pulled out of write_header() since only one caller needs it and write_header() can be replaced by rw_header() directly. Signed-off-by: NTakahiro Yasui <tyasui@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Change dm_unregister_target to return void and use BUG() for error reporting. dm_unregister_target can only fail because of programming bug in the target driver. It can't fail because of user's behavior or disk errors. This patch changes unregister_target to return void and use BUG if someone tries to unregister non-registered target or unregister target that is in use. This patch removes code duplication (testing of error codes in all dm targets) and reports bugs in just one place, in dm_unregister_target. In some target drivers, these return codes were ignored, which could lead to a situation where bugs could be missed. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Jonathan Brassow 提交于
Always increase the error count when I/O on a leg of a mirror fails. The error count is used to decide whether to select an alternative mirror leg. If the target doesn't use the "handle_errors" feature, the error count is not updated and the bio can get requeued forever by the read callback. Fix it by increasing error_count before the handle_errors feature checking. Cc: stable@kernel.org Signed-off-by: NMilan Broz <mbroz@redhat.com> Signed-off-by: NJonathan Brassow <jbrassow@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Takahiro Yasui 提交于
In create_log_context function, dm_io_client_destroy function needs to be called, when memory allocation of disk_header, sync_bits and recovering_bits failed, but dm_io_client_destroy is not called. Cc: stable@kernel.org Signed-off-by: NTakahiro Yasui <tyasui@redhat.com> Acked-by: NJonathan Brassow <jbrassow@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Change yield() to msleep(1). If the thread had realtime priority, yield() doesn't really yield, so the yielding process would loop indefinitely and cause machine lockup. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Move one dm_table_put() so that the last reference in the thread gets dropped in __unbind(). This is required for a following patch, dm-table-rework-reference-counting.patch, which will change the logic in such a way that table destructor is called only at specific points in the code. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
- 29 12月, 2008 1 次提交
-
-
由 Jens Axboe 提交于
Instead of having a global bio slab cache, add a reference to one in each bio_set that is created. This allows for personalized slabs in each bio_set, so that they can have bios of different sizes. This means we can personalize the bios we return. File systems may want to embed the bio inside another structure, to avoid allocation more items (and stuffing them in ->bi_private) after the get a bio. Or we may want to embed a number of bio_vecs directly at the end of a bio, to avoid doing two allocations to return a bio. This is now possible. Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
- 19 12月, 2008 1 次提交
-
-
由 NeilBrown 提交于
When we read the write-intent-bitmap off the device, we currently read a whole number of pages. When PAGE_SIZE is 4K, this works due to the alignment we enforce on the superblock and bitmap. When PAGE_SIZE is 64K, this case read past the end-of-device which causes an error. When we write the superblock, we ensure to clip the last page to just be the required size. Copy that code into the read path to just read the required number of sectors. Signed-off-by: NNeil Brown <neilb@suse.de> Cc: stable@kernel.org
-
- 03 12月, 2008 1 次提交
-
-
由 Milan Broz 提交于
Fix setting of max_segment_size and seg_boundary mask for stacked md/dm devices. When stacking devices (LVM over MD over SCSI) some of the request queue parameters are not set up correctly in some cases by default, namely max_segment_size and and seg_boundary mask. If you create MD device over SCSI, these attributes are zeroed. Problem become when there is over this mapping next device-mapper mapping - queue attributes are set in DM this way: request_queue max_segment_size seg_boundary_mask SCSI 65536 0xffffffff MD RAID1 0 0 LVM 65536 -1 (64bit) Unfortunately bio_add_page (resp. bio_phys_segments) calculates number of physical segments according to these parameters. During the generic_make_request() is segment cout recalculated and can increase bio->bi_phys_segments count over the allowed limit. (After bio_clone() in stack operation.) Thi is specially problem in CCISS driver, where it produce OOPS here BUG_ON(creq->nr_phys_segments > MAXSGENTRIES); (MAXSEGENTRIES is 31 by default.) Sometimes even this command is enough to cause oops: dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10 This command generates bios with 250 sectors, allocated in 32 4k-pages (last page uses only 1024 bytes). For LVM layer, it allocates bio with 31 segments (still OK for CCISS), unfortunatelly on lower layer it is recalculated to 32 segments and this violates CCISS restriction and triggers BUG_ON(). The patch tries to fix it by: * initializing attributes above in queue request constructor blk_queue_make_request() * make sure that blk_queue_stack_limits() inherits setting (DM uses its own function to set the limits because it blk_queue_stack_limits() was introduced later. It should probably switch to use generic stack limit function too.) * sets the default seg_boundary value in one place (blkdev.h) * use this mask as default in DM (instead of -1, which differs in 64bit) Bugs related to this: https://bugzilla.redhat.com/show_bug.cgi?id=471639 http://bugzilla.kernel.org/show_bug.cgi?id=8672Signed-off-by: NMilan Broz <mbroz@redhat.com> Reviewed-by: NAlasdair G Kergon <agk@redhat.com> Cc: Neil Brown <neilb@suse.de> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Tejun Heo <htejun@gmail.com> Cc: Mike Miller <mike.miller@hp.com> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
- 26 11月, 2008 2 次提交
-
-
由 Ingo Molnar 提交于
Port to the new tracepoints API: split DEFINE_TRACE() and DECLARE_TRACE() sites. Spread them out to the usage sites, as suggested by Mathieu Desnoyers. Signed-off-by: NIngo Molnar <mingo@elte.hu> Acked-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
-
由 Arnaldo Carvalho de Melo 提交于
This was a forward port of work done by Mathieu Desnoyers, I changed it to encode the 'what' parameter on the tracepoint name, so that one can register interest in specific events and not on classes of events to then check the 'what' parameter. Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: NJens Axboe <jens.axboe@oracle.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 14 11月, 2008 6 次提交
-
-
由 Chandra Seetharaman 提交于
dm_any_congested() just checks for the DMF_BLOCK_IO and has no code to make sure that suspend waits for dm_any_congested() to complete. This patch adds such a check. Without it, a race can occur with dm_table_put() attempting to destroying the table in the wrong thread, the one running dm_any_congested() which is meant to be quick and return immediately. Two examples of problems: 1. Sleeping functions called from congested code, the caller of which holds a spin lock. 2. An ABBA deadlock between pdflush and multipathd. The two locks in contention are inode lock and kernel lock. Signed-off-by: NChandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
This doesn't fix any bug, just moves wake_up immediately after decrementing md->pending, for better code readability. It must be clear to anyone manipulating md->pending to wake up the queue if md->pending reaches zero, so move the wakeup as close to the decrementing as possible. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Chandra Seetharaman 提交于
Currently dm ignores the parameters provided to hardware handlers without providing any notifications to the user. This patch just prints a warning message so that the user knows that the arguments are ignored. Signed-off-by: NChandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Chandra Seetharaman 提交于
Path activation code is called even when the pgpath is NULL. This could lead to a panic in activate_path(). Such a panic is seen in -rt kernel. This problem has been there before the pg_init() was moved to a workqueue. Signed-off-by: NChandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Heinz Mauelshagen 提交于
Don't proceed if dm_stripe_init() fails to register itself as a dm target. Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
We queue work on keventd queue --- so this queue must be flushed in the destructor. Otherwise, keventd could access mirror_set after it was freed. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org
-
- 06 11月, 2008 3 次提交
-
-
由 Andre Noll 提交于
We currently oops with a divide error on starting a linear software raid array consisting of at least two very small (< 500K) devices. The bug is caused by the calculation of the hash table size which tries to compute sector_div(sz, base) with "base" being zero due to the small size of the component devices of the array. Fix this by requiring the hash spacing to be at least one which implies that also "base" is non-zero. This bug has existed since about 2.6.14. Cc: stable@kernel.org Signed-off-by: NAndre Noll <maan@systemlinux.org> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Adding a spare to a raid10 doesn't cause recovery to start. This is due to an silly type in commit 6c2fce2e and so is a bug in 2.6.27 and .28-rc. Thanks to Thomas Backlund for bisecting to find this. Cc: Thomas Backlund <tmb@mandriva.org> Cc: stable@kernel.org Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
It turns out that it is only safe to call blkdev_ioctl when the device is actually open (as ->bd_disk is set to NULL on last close). And it is quite possible for do_md_stop to be called when the device is not open. So discard the call to blkdev_ioctl(BLKRRPART) which was added in commit 934d9c23 It is just as easy to call this ioctl from userspace when needed (on mdadm -S) so leave it out of the kernel Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 30 10月, 2008 3 次提交
-
-
由 Mikulas Patocka 提交于
If there are several snapshots sharing an origin and one is removed while the origin is being written to, the snapshot's mempool may get deleted while elements are still referenced. Prior to dm-snapshot-use-per-device-mempools.patch the pending exceptions may still have been referenced after the snapshot was destroyed, but this was not a problem because the shared mempool was still there. This patch fixes the problem by tracking the number of mempool elements in use. The scenario: - You have an origin and two snapshots 1 and 2. - Someone writes to the origin. - It creates two exceptions in the snapshots, snapshot 1 will be primary exception, snapshot 2's pending_exception->primary_pe will point to the exception in snapshot 1. - The exceptions are being relocated, relocation of exception 1 finishes (but it's pending_exception is still allocated, because it is referenced by an exception from snapshot 2) - The user lvremoves snapshot 1 --- it calls just suspend (does nothing) and destructor. md->pending is zero (there is no I/O submitted to the snapshot by md layer), so it won't help us. - The destructor waits for kcopyd jobs to finish on snapshot 1 --- but there are none. - The destructor on snapshot 1 cleans up everything. - The relocation of exception on snapshot 2 finishes, it drops reference on primary_pe. This frees its primary_pe pointer. Primary_pe points to pending exception created for snapshot 1. So it frees memory into non-existing mempool. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
register_snapshot() performs a GFP_KERNEL allocation while holding _origins_lock for write, but that could write out dirty pages onto a device that attempts to acquire _origins_lock for read, resulting in deadlock. So move the allocation up before taking the lock. This path is not performance-critical, so it doesn't matter that we allocate memory and free it if we find that we won't need it. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Ilpo Jarvinen 提交于
Missing braces. Commit 1f965b19 (dm raid1: separate region_hash interface part1) broke it. Signed-off-by: NIlpo Jarvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: NAlasdair G Kergon <agk@redhat.com> Cc: Heinz Mauelshagen <hjm@redhat.com>
-
- 28 10月, 2008 1 次提交
-
-
由 NeilBrown 提交于
md arrays are not currently destroyed when they are stopped - they remain in /sys/block. Last time I tried this I tripped over locking too much. A consequence of this is that udev doesn't remove anything from /dev. This is rather ugly. As an interim measure until proper device removal can be achieved, make sure all partitions are removed using the BLKRRPART ioctl, and send a KOBJ_CHANGE when an md array is stopped. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 23 10月, 2008 1 次提交
-
-
由 Christoph Hellwig 提交于
Now that lookup_bdev is exported and used by dm just use it directly instead of through a trivial wrapper. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 22 10月, 2008 14 次提交
-
-
由 Kiyoshi Ueda 提交于
This patch tidies local_init() in preparation for request-based dm. No functional change. Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Kiyoshi Ueda 提交于
This patch removes the DM_WQ_FLUSH_ALL state that is unnecessary. The dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL) in dm_suspend() is never invoked because: - 'goto flush_and_out' is the same as 'goto out' because the 'goto flush_and_out' is called only when '!noflush' - If r is non-zero, then the code above will invoke 'goto out' and skip this code. No functional change. Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: NMilan Broz <mbroz@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Heinz Mauelshagen 提交于
Separate the region hash code from raid1 so it can be shared by forthcoming targets. Use BUG_ON() for failed async dm_io() calls. Signed-off-by: NHeinz Mauelshagen <hjm@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Martin K. Petersen 提交于
When a bio gets split, mark its fragments with the BIO_CLONED flag. Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Milan Broz 提交于
Remove waitqueue no longer needed with the async crypto interface. Signed-off-by: NMilan Broz <mbroz@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Milan Broz 提交于
When writing io, dm-crypt has to allocate a new cloned bio and encrypt the data into newly-allocated pages attached to this bio. In rare cases, because of hw restrictions (e.g. physical segment limit) or memory pressure, sometimes more than one cloned bio has to be used, each processing a different fragment of the original. Currently there is one waitqueue which waits for one fragment to finish and continues processing the next fragment. But when using asynchronous crypto this doesn't work, because several fragments may be processed asynchronously or in parallel and there is only one crypt context that cannot be shared between the bio fragments. The result may be corruption of the data contained in the encrypted bio. The patch fixes this by allocating new dm_crypt_io structs (with new crypto contexts) and running them independently. The fragments contains a pointer to the base dm_crypt_io struct to handle reference counting, so the base one is properly deallocated after all the fragments are finished. In a low memory situation, this only uses one additional object from the mempool. If the mempool is empty, the next allocation simple waits for previous fragments to complete. Signed-off-by: NMilan Broz <mbroz@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Milan Broz 提交于
Prepare local sector variable (offset) for later patch. Do not update io->sector for still-running I/O. No functional change. Signed-off-by: NMilan Broz <mbroz@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Change #include "dm.h" to #include <linux/device-mapper.h> in all targets. Targets should not need direct access to internal DM structures. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Move array_too_big to include/linux/device-mapper.h because it is used by targets. Remove the test from dm-raid1 as the number of mirror legs is limited such that it can never fail. (Even for stripes it seems rather unlikely.) Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
We must zero the next chunk on disk *before* writing out the current chunk, not after. Otherwise if the machine crashes at the wrong time, the "end of metadata" marker may be missing. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org
-
由 Alasdair G Kergon 提交于
Use a separate buffer for writing zeroes to the on-disk snapshot exception store, make the updating of ps->current_area explicit and refactor the code in preparation for the fix in the next patch. No functional change. Signed-off-by: NAlasdair G Kergon <agk@redhat.com> Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Cc: stable@kernel.org
-
由 Mikulas Patocka 提交于
The last_percent field is unused - remove it. (It dates from when events were triggered as each X% filled up.) Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Fix a race condition with primary_pe ref_count handling. put_pending_exception runs under dm_snapshot->lock, it does atomic_dec_and_test on primary_pe->ref_count, and later does atomic_read primary_pe->ref_count. __origin_write does atomic_dec_and_test on primary_pe->ref_count without holding dm_snapshot->lock. This opens the following race condition: Assume two CPUs, CPU1 is executing put_pending_exception (and holding dm_snapshot->lock). CPU2 is executing __origin_write in parallel. primary_pe->ref_count == 2. CPU1: if (primary_pe && atomic_dec_and_test(&primary_pe->ref_count)) origin_bios = bio_list_get(&primary_pe->origin_bios); ... decrements primary_pe->ref_count to 1. Doesn't load origin_bios CPU2: if (first && atomic_dec_and_test(&primary_pe->ref_count)) { flush_bios(bio_list_get(&primary_pe->origin_bios)); free_pending_exception(primary_pe); /* If we got here, pe_queue is necessarily empty. */ return r; } ... decrements primary_pe->ref_count to 0, submits pending bios, frees primary_pe. CPU1: if (!primary_pe || primary_pe != pe) free_pending_exception(pe); ... this has no effect. if (primary_pe && !atomic_read(&primary_pe->ref_count)) free_pending_exception(primary_pe); ... sees ref_count == 0 (written by CPU 2), does double free !! This bug can happen only if someone is simultaneously writing to both the origin and the snapshot. If someone is writing only to the origin, __origin_write will submit kcopyd request after it decrements primary_pe->ref_count (so it can't happen that the finished copy races with primary_pe->ref_count decrementation). If someone is writing only to the snapshot, __origin_write isn't invoked at all and the race can't happen. The race happens when someone writes to the snapshot --- this creates pending_exception with primary_pe == NULL and starts copying. Then, someone writes to the same chunk in the snapshot, and __origin_write races with termination of already submitted request in pending_complete (that calls put_pending_exception). This race may be reason for bugs: http://bugzilla.kernel.org/show_bug.cgi?id=11636 https://bugzilla.redhat.com/show_bug.cgi?id=465825 The patch fixes the code to make sure that: 1. If atomic_dec_and_test(&primary_pe->ref_count) returns false, the process must no longer dereference primary_pe (because someone else may free it under us). 2. If atomic_dec_and_test(&primary_pe->ref_count) returns true, the process is responsible for freeing primary_pe. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org
-
由 Kazuo Ito 提交于
Write throughput to LVM snapshot origin volume is an order of magnitude slower than those to LV without snapshots or snapshot target volumes, especially in the case of sequential writes with O_SYNC on. The following patch originally written by Kevin Jamieson and Jan Blunck and slightly modified for the current RCs by myself tries to improve the performance by modifying the behaviour of kcopyd, so that it pushes back an I/O job to the head of the job queue instead of the tail as process_jobs() currently does when it has to wait for free pages. This way, write requests aren't shuffled to cause extra seeks. I tested the patch against 2.6.27-rc5 and got the following results. The test is a dd command writing to snapshot origin followed by fsync to the file just created/updated. A couple of filesystem benchmarks gave me similar results in case of sequential writes, while random writes didn't suffer much. dd if=/dev/zero of=<somewhere on snapshot origin> bs=4096 count=... [conv=notrunc when updating] 1) linux 2.6.27-rc5 without the patch, write to snapshot origin, average throughput (MB/s) 10M 100M 1000M create,dd 511.46 610.72 11.81 create,dd+fsync 7.10 6.77 8.13 update,dd 431.63 917.41 12.75 update,dd+fsync 7.79 7.43 8.12 compared with write throughput to LV without any snapshots, all dd+fsync and 1000 MiB writes perform very poorly. 10M 100M 1000M create,dd 555.03 608.98 123.29 create,dd+fsync 114.27 72.78 76.65 update,dd 152.34 1267.27 124.04 update,dd+fsync 130.56 77.81 77.84 2) linux 2.6.27-rc5 with the patch, write to snapshot origin, average throughput (MB/s) 10M 100M 1000M create,dd 537.06 589.44 46.21 create,dd+fsync 31.63 29.19 29.23 update,dd 487.59 897.65 37.76 update,dd+fsync 34.12 30.07 26.85 Although still not on par with plain LV performance - cannot be avoided because it's copy on write anyway - this simple patch successfully improves throughtput of dd+fsync while not affecting the rest. Signed-off-by: NJan Blunck <jblunck@suse.de> Signed-off-by: NKazuo Ito <ito.kazuo@oss.ntt.co.jp> Signed-off-by: NAlasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org
-