- 19 3月, 2014 16 次提交
-
-
由 Kent Overstreet 提交于
This changes the bucket allocation reserves to use _real_ reserves - separate freelists - instead of watermarks, which if nothing else makes the current code saner to reason about and is going to be important in the future when we add support for multiple btrees. It also adds btree_check_reserve(), which checks (and locks) the reserves for both bucket allocation and memory allocation for btree nodes; the old code just kinda sorta assumed that since (e.g. for btree node splits) it had the root locked and that meant no other threads could try to make use of the same reserve; this technically should have been ok for memory allocation (we should always have a reserve for memory allocation (the btree node cache is used as a reserve and we preallocate it)), but multiple btrees will mean that locking the root won't be sufficient anymore, and for the bucket allocation reserve it was technically possible for the old code to deadlock. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
With the locking rework in the last patch, this shouldn't be needed anymore - btree_node_write_work() only takes b->write_lock which is never held for very long. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
Add a new lock, b->write_lock, which is required to actually modify - or write - a btree node; this lock is only held for short durations. This means we can write out a btree node without taking b->lock, which _is_ held for long durations - solving a deadlock when btree_flush_write() (from the journalling code) is called with a btree node locked. Right now just occurs in bch_btree_set_root(), but with an upcoming journalling rework is going to happen a lot more. This also turns b->lock is now more of a read/intent lock instead of a read/write lock - but not completely, since it still blocks readers. May turn it into a real intent lock at some point in the future. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
This isn't a bulletproof fix; btree_node_free() -> bch_bucket_free() puts the bucket on the unused freelist, where it can be reused right away without any ordering requirements. It would be better to wait on at least a journal write to go down before reusing the bucket. bch_btree_set_root() does this, and inserting into non leaf nodes is completely synchronous so we should be ok, but future patches are just going to get rid of the unused freelist - it was needed in the past for various reasons but shouldn't be anymore. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
This means the garbage collection code can better check for data and metadata pointers to the same buckets. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
This will potentially save us an allocation when we've got inode/dirent bkeys that don't fit in the keylist's inline keys. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
Break down data into clean data/dirty data/metadata. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
Change the invalidate tracepoint to indicate how much data we're invalidating, and change the alloc tracepoints to indicate what offset they're for. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
This hasn't been used or even enabled in ages. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Nicholas Swenson 提交于
Signed-off-by: NNicholas Swenson <nks@daterainc.com>
-
由 Kent Overstreet 提交于
Avoid a potential null pointer deref (e.g. from check keys for cache misses) Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Nicholas Swenson 提交于
Deadlock happened because a foreground write slept, waiting for a bucket to be allocated. Normally the gc would mark buckets available for invalidation. But the moving_gc was stuck waiting for outstanding writes to complete. These writes used the bcache_wq, the same queue foreground writes used. This fix gives moving_gc its own work queue, so it was still finish moving even if foreground writes are stuck waiting for allocation. It also makes work queue a parameter to the data_insert path, so moving_gc can use its workqueue for writes. Signed-off-by: NNicholas Swenson <nks@daterainc.com> Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
blk_stack_limits() doesn't like a discard granularity of 0. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
The on disk bucket gens are allowed to be out of date, when we reuse buckets that didn't have any live data in them. To deal with this, the initial gc has to update the bucket gen when we find a pointer gen newer than the bucket's gen. Unfortunately we weren't doing this for pointers in the journal that we're about to replay. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
The code to fixup incorrect bucket prios incorrectly did not skip btree node freeing keys Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
On recovery we weren't correctly keeping track of what journal buckets had open journal entries, thus it was possible for them to be overwritten until we'd written all new journal entries. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
- 18 3月, 2014 2 次提交
-
-
由 Kent Overstreet 提交于
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
- 26 2月, 2014 2 次提交
-
-
由 Kent Overstreet 提交于
Shutdown wasn't cancelling/waiting on journal_write_work() Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
The code was using sectors to count the number of sectors it was zeroing... but then it passed it to bio_advance()... after it had been set to 0. Amusing... Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
- 19 2月, 2014 1 次提交
-
-
由 Kent Overstreet 提交于
Use a bigger hammer this time Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org>
-
- 13 2月, 2014 1 次提交
-
-
由 Oleg Nesterov 提交于
Subsystems that want to register CPU hotplug callbacks, as well as perform initialization for the CPUs that are already online, often do it as shown below: get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); register_cpu_notifier(&foobar_cpu_notifier); put_online_cpus(); This is wrong, since it is prone to ABBA deadlocks involving the cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently with CPU hotplug operations). Interestingly, the raid5 code can actually prevent double initialization and hence can use the following simplified form of callback registration: register_cpu_notifier(&foobar_cpu_notifier); get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); put_online_cpus(); A hotplug operation that occurs between registering the notifier and calling get_online_cpus(), won't disrupt anything, because the code takes care to perform the memory allocations only once. So reorganize the code in raid5 this way to fix the deadlock with callback registration. Cc: linux-raid@vger.kernel.org Cc: stable@vger.kernel.org (v2.6.32+) Fixes: 36d1c647Signed-off-by: NOleg Nesterov <oleg@redhat.com> [Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the free_scratch_buffer() helper to condense code further and wrote the changelog.] Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 11 2月, 2014 1 次提交
-
-
由 Geert Uytterhoeven 提交于
drivers/md/bcache/extents.c: In function `btree_ptr_bad_expensive': drivers/md/bcache/extents.c:196: warning: format `%li' expects type `long int', but argument 4 has type `size_t' Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 2月, 2014 1 次提交
-
-
由 NeilBrown 提交于
commit 30bc9b53 md/raid1: fix bio handling problems in process_checks() Move the bio_reset() to a point before where BIO_UPTODATE is checked, so that check now always report that the bio is uptodate, even if it is not. This causes process_check() to sometimes treat read-errors as successful matches so the good data isn't written out. This patch preserves the flag until it is needed. Bug was introduced in 3.11, but backported to 3.10-stable (as it fixed an even worse bug). So suitable for any -stable since 3.10. Reported-and-tested-by: NMichael Tokarev <mjt@tls.msk.ru> Cc: stable@vger.kernel.org (3.10+) Fixed: 30bc9b53Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 30 1月, 2014 3 次提交
-
-
由 Nicholas Swenson 提交于
Signed-off-by: NNicholas Swenson <nks@daterainc.com>
-
由 Kent Overstreet 提交于
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Darrick J. Wong 提交于
The BUG_ON at the end of __bch_btree_mark_key can be triggered due to an integer overflow error: BITMASK(GC_SECTORS_USED, struct bucket, gc_mark, 2, 13); ... SET_GC_SECTORS_USED(g, min_t(unsigned, GC_SECTORS_USED(g) + KEY_SIZE(k), (1 << 14) - 1)); BUG_ON(!GC_SECTORS_USED(g)); In bcache.h, the SECTORS_USED bitfield is defined to be 13 bits wide. While the SET_ code tries to ensure that the field doesn't overflow by clamping it to (1<<14)-1 == 16383, this is incorrect because 16383 requires 14 bits. Therefore, if GC_SECTORS_USED() + KEY_SIZE() = 8192, the SET_ statement tries to store 8192 into a 13-bit field. In a 13-bit field, 8192 becomes zero, thus triggering the BUG_ON. Therefore, create a field width constant and a max value constant, and use those to create the bitfield and check the inputs to SET_GC_SECTORS_USED. Arguably the BITMASK() template ought to have BUG_ON checks for too-large values, but that's a separate patch. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 22 1月, 2014 3 次提交
-
-
由 Dongmao Zhang 提交于
In the cluster evironment, cluster write has poor performance because userspace_flush() has to contact a userspace program (cmirrord) for clear/mark/flush requests. But both mark and flush requests require cmirrord to communicate the message to all the cluster nodes for each flush call. This behaviour is really slow. To address this we now merge mark and flush requests together to reduce the kernel-userspace-kernel time. We allow a new directive, "integrated_flush" that can be used to instruct the kernel log code to combine flush and mark requests when directed by userspace. If not directed by userspace (due to an older version of the userspace code perhaps), the kernel will function as it did previously - preserving backwards compatibility. Additionally, flush requests are performed lazily when only clear requests exist. Signed-off-by: NDongmao Zhang <dmzhang@suse.com> Signed-off-by: NJonathan Brassow <jbrassow@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com>
-
由 NeilBrown 提交于
As release_stripe and __release_stripe decrement ->count and then manipulate ->lru both under ->device_lock, it is important that get_active_stripe() increments ->count and clears ->lru also under ->device_lock. However we currently list_del_init ->lru under the lock, but increment the ->count outside the lock. This can lead to races and list corruption. So move the atomic_inc(&sh->count) up inside the ->device_lock protected region. Note that we still increment ->count without device lock in the case where get_free_stripe() was called, and in fact don't take ->device_lock at all in that path. This is safe because if the stripe_head can be found by get_free_stripe, then the hash lock assures us the no-one else could possibly be calling release_stripe() at the same time. Fixes: 566c09c5 Cc: stable@vger.kernel.org (3.13) Reported-and-tested-by: NIan Kumlien <ian.kumlien@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Joe Thornber 提交于
This bug was introduced in commit 7e664b3d ("dm space map metadata: fix extending the space map"). When extending a dm-thin metadata volume we: - Switch the space map into a simple bootstrap mode, which allocates all space linearly from the newly added space. - Add new bitmap entries for the new space - Increment the reference counts for those newly allocated bitmap entries - Commit changes to disk - Switch back out of bootstrap mode. But, the disk commit may allocate space itself, if so this fact will be lost when switching out of bootstrap mode. The bug exhibited itself as an error when the bitmap_root, with an erroneous ref count of 0, was subsequently decremented as part of a later disk commit. This would cause the disk commit to fail, and thinp to enter read_only mode. The metadata was not damaged (thin_check passed). The fix is to put the increments + commit into a loop, running until the commit has not allocated extra space. In practise this loop only runs twice. With this fix the following device mapper testsuite test passes: dmtest run --suite thin-provisioning -n thin_remove_works_after_resize Signed-off-by: NJoe Thornber <ejt@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # depends on commit 7e664b3d
-
- 17 1月, 2014 1 次提交
-
-
由 Mike Snitzer 提交于
The cache's policy may have been established using the "default" alias, which is currently the "mq" policy but the default policy may change in the future. It is useful to know exactly which policy is being used. Add a 'real' member to the dm_cache_policy_type structure and have the "default" dm_cache_policy_type point to the real "mq" dm_cache_policy_type. Update dm_cache_policy_get_name() to check if real is set, if so report the name of the real policy (not the alias). Requested-by: NJonathan Brassow <jbrassow@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com>
-
- 16 1月, 2014 3 次提交
-
-
由 Mike Snitzer 提交于
Commit 787a996c ("dm thin: add error_if_no_space feature") mistakenly forgot to increase the number of feature args supported. Signed-off-by: NMike Snitzer <snitzer@redhat.com>
-
由 NeilBrown 提交于
Before a write starts we set a bit in the write-intent bitmap. When the write completes we clear that bit if the write was successful to all devices. However if the write wasn't fully successful we should not clear the bit. If the faulty drive is subsequently re-added, the fact that the bit is still set ensure that we will re-write the data that is missing. This logic is mediated by the STRIPE_DEGRADED flag - we only clear the bitmap bit when this flag is not set. Currently we correctly set the flag if a write starts when some devices are failed or missing. But we do *not* set the flag if some device failed during the write attempt. This is wrong and can result in clearing the bit inappropriately. So: set the flag when a write fails. This bug has been present since bitmaps were introduces, so the fix is suitable for any -stable kernel. Reported-by: NEthan Wilson <ethan.wilson@shiftmail.org> Cc: stable@vger.kernel.org Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Nicolas Schichan 提交于
Verify that the cmd parameter passed to md_ioctl() is valid before doing anything. This fixes mddev->hold_active being set to 0 when an invalid ioctl command is passed to md_ioctl() before the array has been configured. Clearing mddev->hold_active in that case can lead to a livelock situation when an invalid ioctl number is given to md_ioctl() by a process when the mddev is currently being opened by another process: Process 1 Process 2 --------- --------- md_alloc() mddev_find() -> returns a new mddev with hold_active == UNTIL_IOCTL add_disk() -> sends KOBJ_ADD uevent (sees KOBJ_ADD uevent for device) md_open() md_ioctl(INVALID_IOCTL) -> returns ENODEV and clears mddev->hold_active md_release() md_put() -> deletes the mddev as hold_active is 0 md_open() mddev_find() -> returns a newly allocated mddev with mddev->gendisk == NULL -> returns with ERESTARTSYS (kernel restarts the open syscall) Signed-off-by: NNicolas Schichan <nschichan@freebox.fr> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 15 1月, 2014 5 次提交
-
-
由 Mikulas Patocka 提交于
This reverts commit be35f486 ("dm: wait until embedded kobject is released before destroying a device") and provides an improved fix. The kobject release code that calls the completion must be placed in a non-module file, otherwise there is a module unload race (if the process calling dm_kobject_release is preempted and the DM module unloaded after the completion is triggered, but before dm_kobject_release returns). To fix this race, this patch moves the completion code to dm-builtin.c which is always compiled directly into the kernel if BLK_DEV_DM is selected. The patch introduces a new dm_kobject_holder structure, its purpose is to keep the completion and kobject in one place, so that it can be accessed from non-module code without the need to export the layout of struct mapped_device to that code. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
-
由 Mikulas Patocka 提交于
This patch modifies dm-snapshot so that it prefetches the buffers when loading the exceptions. The number of buffers read ahead is specified in the DM_PREFETCH_CHUNKS macro. The current value for DM_PREFETCH_CHUNKS (12) was found to provide the best performance on a single 15k SCSI spindle. In the future we may modify this default or make it configurable. Also, introduce the function dm_bufio_set_minimum_buffers to setup bufio's number of internal buffers before freeing happens. dm-bufio may hold more buffers if enough memory is available. There is no guarantee that the specified number of buffers will be available - if you need a guarantee, use the argument reserved_buffers for dm_bufio_client_create. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com>
-
由 Mikulas Patocka 提交于
Use dm-bufio for initial loading of the exceptions. Introduce a new function dm_bufio_forget that frees the given buffer. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com>
-
由 Mikulas Patocka 提交于
Change the functions get_exception, read_exception and insert_exceptions so that ps->area is passed as an argument. This patch doesn't change any functionality, but it refactors the code to allow for a cleaner switch over to using dm-bufio. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com>
-
由 Mikulas Patocka 提交于
The list of initial exceptions is loaded in the target constructor. We are allowed to allocate memory with GFP_KERNEL at this point. So, change alloc_completed_exception to use GFP_KERNEL when being called from the constructor. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com>
-
- 14 1月, 2014 1 次提交
-
-
由 NeilBrown 提交于
level_store() currently does not make sure the metadata is updates to reflect the new raid level. It simply sets MD_CHANGE_DEVS. Any level with a ->thread will quickly notice this and update the metadata. However RAID0 and Linear do not have a thread so no metadata update happens until the array is stopped. At that point the metadata is written. This is later that we would like. While the delay doesn't risk any data it can cause confusion. So if there is no md thread, immediately update the metadata after a level change. Reported-by: NRichard Michael <rmichael@edgeofthenet.org> Signed-off-by: NNeilBrown <neilb@suse.de>
-