- 11 11月, 2013 25 次提交
-
-
由 Kent Overstreet 提交于
There was some looping in submit_partial_cache_hit() and submit_partial_cache_hit() that isn't needed anymore - originally, we wouldn't necessarily process the full hit or miss all at once because when splitting the bio, we took into account the restrictions of the device we were sending it to. But, device bio size restrictions are now handled elsewhere, with a wrapper around generic_make_request() - so that looping has been unnecessary for awhile now and we can now do quite a bit of cleanup. And if we trim the key we're reading from to match the subset we're actually reading, we don't have to explicitly calculate bi_sector anymore. Neat. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
This is a fairly straightforward conversion, mostly reshuffling - op->lookup_done goes away, replaced by MAP_DONE/MAP_CONTINUE. And the code for handling cache hits and misses wasn't really btree code, so it gets moved to request.c. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
With the new btree_map() functions, we don't need to export the stuff needed for traversing the btree anymore. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
Lots of stuff has been open coding its own btree traversal - which is generally pretty simple code, but there are a few subtleties. This adds new new functions, bch_btree_map_nodes() and bch_btree_map_keys(), which do the traversal for you. Everything that's open coding btree traversal now (with the exception of garbage collection) is slowly going to be converted to these two functions; being able to write other code at a higher level of abstraction is a big improvement w.r.t. overall code quality. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
This simplifies the writeback flow control quite a bit - previously, it was conceptually two coroutines, refill_dirty() and read_dirty(). This makes the code quite a bit more straightforward. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
We needed a dedicated rescuer workqueue for gc anyways... and gc was conceptually a dedicated thread, just one that wasn't running all the time. Switch it to a dedicated thread to make the code a bit more straightforward. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
At one point we did do fancy asynchronous waiting stuff with bucket_wait, but that's all gone (and bucket_wait is used a lot less than it used to be). So use the standard primitives. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
We never waited on c->try_wait asynchronously, so just use the standard primitives. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
Slowly working on pruning struct btree_op - the aim is for it to only contain things that are actually necessary for traversing the btree. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
Making things less asynchronous that don't need to be - bch_journal() only has to block when the journal or journal entry is full, which is emphatically not a fast path. So make it a normal function that just returns when it finishes, to make the code and control flow easier to follow. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
More refactoring, and renaming. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
Try to improve some of the naming a bit to be more consistent, and also improve the flow of control in request_write() a bit. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
More random refactoring. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
Some refactoring - better to explicitly pass stuff around instead of having it all in the "big bag of state", struct btree_op. Going to prune struct btree_op quite a bit over time. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
This was the main point of all this refactoring - now, btree_insert_check_key() won't fail just because the leaf node happened to be full. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
We'll often end up with a list of adjacent keys to insert - because bch_data_insert() may have to fragment the data it writes. Originally, to simplify things and avoid having to deal with corner cases bch_btree_insert() would pass keys from this list one at a time to btree_insert_recurse() - mainly because the list of keys might span leaf nodes, so it was easier this way. With the btree_insert_node() refactoring, it's now a lot easier to just pass down the whole list and have btree_insert_recurse() iterate over leaf nodes until it's done. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
The flow of control in the old btree insertion code was rather - backwards; we'd recurse down the btree (in btree_insert_recurse()), and then if we needed to split the keys to be inserted into the parent node would be effectively returned up to btree_insert_recurse(), which would notice there was more work to do and finish the insertion. The main problem with this was that the full logic for btree insertion could only be used by calling btree_insert_recurse; if you'd gotten to a btree leaf some other way and had a key to insert, if it turned out that node needed to be split you were SOL. This inverts the flow of control so btree_insert_node() does _full_ btree insertion, including splitting - and takes a (leaf) btree node to insert into as a parameter. This means we can now _correctly_ handle cache misses - for cache misses, we need to insert a fake "check" key into the btree when we discover we have a cache miss - while we still have the btree locked. Previously, if the btree node was full inserting a cache miss would just fail. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
This is prep work for the reworked btree insertion code. The way we set b->parent is ugly and hacky... the problem is, when btree_split() or garbage collection splits or rewrites a btree node, the parent changes for all its (potentially already cached) children. I may change this later and add some code to look through the btree node cache and find all our cached child nodes and change the parent pointer then... Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
Checking i->seq was redundant, because since ages ago we always initialize the new bset when advancing b->written Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
Originally I got this right... except that the divides didn't use do_div(), which broke 32 bit kernels. When I went to fix that, I forgot that the raid stripe size usually isn't a power of two... doh Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
Works kind of like the ext4 setting, to panic or remount read only on errors. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
The old asynchronous discard code was really a relic from when all the allocation code was asynchronous - now that allocation runs out of a dedicated thread there's no point in keeping around all that complicated machinery. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
bch_keybuf_del() takes a spinlock that can't be taken in interrupt context - whoops. Fortunately, this code isn't enabled by default (you have to toggle a sysfs thing). Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
-
由 Kent Overstreet 提交于
Dirty data accounting wasn't quite right - firstly, we were adding the key we're inserting after it could have merged with another dirty key already in the btree, and secondly we could sometimes pass the wrong offset to bcache_dev_sectors_dirty_add() for dirty data we were overwriting - which is important when tracking dirty data by stripe. Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
-
- 09 11月, 2013 1 次提交
-
-
由 Kent Overstreet 提交于
Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on, we should know better than this. Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 24 10月, 2013 4 次提交
-
-
由 Shaohua Li 提交于
SCSI discard will damage discard stripe bio setting, eg, some fields are changed. If the stripe is reused very soon, we have wrong bios setting. We remove discard stripe from hash list, so next time the strip will be fully initialized. Suitable for backport to 3.7+. Cc: <stable@vger.kernel.org> (3.7+) Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Shaohua Li 提交于
SCSI layer will add new payload for discard request. If two bios are merged to one, the second bio has bi_vcnt 1 which is set in raid5. This will confuse SCSI and cause oops. Suitable for backport to 3.7+ Cc: stable@vger.kernel.org (v3.7+) Reported-by: NJes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de> Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
-
由 Bian Yu 提交于
When operate harddisk and hit errors, md_set_badblocks is called after scsi_restart_operations which already disabled the irq. but md_set_badblocks will call write_sequnlock_irq and enable irq. so softirq can preempt the current thread and that may cause a deadlock. I think this situation should use write_sequnlock_irqsave/irqrestore instead. I met the situation and the call trace is below: [ 638.919974] BUG: spinlock recursion on CPU#0, scsi_eh_13/1010 [ 638.921923] lock: 0xffff8800d4d51fc8, .magic: dead4ead, .owner: scsi_eh_13/1010, .owner_cpu: 0 [ 638.923890] CPU: 0 PID: 1010 Comm: scsi_eh_13 Not tainted 3.12.0-rc5+ #37 [ 638.925844] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS 4.6.5 03/05/2013 [ 638.927816] ffff880037ad4640 ffff880118c03d50 ffffffff8172ff85 0000000000000007 [ 638.929829] ffff8800d4d51fc8 ffff880118c03d70 ffffffff81730030 ffff8800d4d51fc8 [ 638.931848] ffffffff81a72eb0 ffff880118c03d90 ffffffff81730056 ffff8800d4d51fc8 [ 638.933884] Call Trace: [ 638.935867] <IRQ> [<ffffffff8172ff85>] dump_stack+0x55/0x76 [ 638.937878] [<ffffffff81730030>] spin_dump+0x8a/0x8f [ 638.939861] [<ffffffff81730056>] spin_bug+0x21/0x26 [ 638.941836] [<ffffffff81336de4>] do_raw_spin_lock+0xa4/0xc0 [ 638.943801] [<ffffffff8173f036>] _raw_spin_lock+0x66/0x80 [ 638.945747] [<ffffffff814a73ed>] ? scsi_device_unbusy+0x9d/0xd0 [ 638.947672] [<ffffffff8173fb1b>] ? _raw_spin_unlock+0x2b/0x50 [ 638.949595] [<ffffffff814a73ed>] scsi_device_unbusy+0x9d/0xd0 [ 638.951504] [<ffffffff8149ec47>] scsi_finish_command+0x37/0xe0 [ 638.953388] [<ffffffff814a75e8>] scsi_softirq_done+0xa8/0x140 [ 638.955248] [<ffffffff8130e32b>] blk_done_softirq+0x7b/0x90 [ 638.957116] [<ffffffff8104fddd>] __do_softirq+0xfd/0x330 [ 638.958987] [<ffffffff810b964f>] ? __lock_release+0x6f/0x100 [ 638.960861] [<ffffffff8174a5cc>] call_softirq+0x1c/0x30 [ 638.962724] [<ffffffff81004c7d>] do_softirq+0x8d/0xc0 [ 638.964565] [<ffffffff8105024e>] irq_exit+0x10e/0x150 [ 638.966390] [<ffffffff8174ad4a>] smp_apic_timer_interrupt+0x4a/0x60 [ 638.968223] [<ffffffff817499af>] apic_timer_interrupt+0x6f/0x80 [ 638.970079] <EOI> [<ffffffff810b964f>] ? __lock_release+0x6f/0x100 [ 638.971899] [<ffffffff8173fa6a>] ? _raw_spin_unlock_irq+0x3a/0x50 [ 638.973691] [<ffffffff8173fa60>] ? _raw_spin_unlock_irq+0x30/0x50 [ 638.975475] [<ffffffff81562393>] md_set_badblocks+0x1f3/0x4a0 [ 638.977243] [<ffffffff81566e07>] rdev_set_badblocks+0x27/0x80 [ 638.978988] [<ffffffffa00d97bb>] raid5_end_read_request+0x36b/0x4e0 [raid456] [ 638.980723] [<ffffffff811b5a1d>] bio_endio+0x1d/0x40 [ 638.982463] [<ffffffff81304ff3>] req_bio_endio.isra.65+0x83/0xa0 [ 638.984214] [<ffffffff81306b9f>] blk_update_request+0x7f/0x350 [ 638.985967] [<ffffffff81306ea1>] blk_update_bidi_request+0x31/0x90 [ 638.987710] [<ffffffff813085e0>] __blk_end_bidi_request+0x20/0x50 [ 638.989439] [<ffffffff8130862f>] __blk_end_request_all+0x1f/0x30 [ 638.991149] [<ffffffff81308746>] blk_peek_request+0x106/0x250 [ 638.992861] [<ffffffff814a62a9>] ? scsi_kill_request.isra.32+0xe9/0x130 [ 638.994561] [<ffffffff814a633a>] scsi_request_fn+0x4a/0x3d0 [ 638.996251] [<ffffffff813040a7>] __blk_run_queue+0x37/0x50 [ 638.997900] [<ffffffff813045af>] blk_run_queue+0x2f/0x50 [ 638.999553] [<ffffffff814a5750>] scsi_run_queue+0xe0/0x1c0 [ 639.001185] [<ffffffff814a7721>] scsi_run_host_queues+0x21/0x40 [ 639.002798] [<ffffffff814a2e87>] scsi_restart_operations+0x177/0x200 [ 639.004391] [<ffffffff814a4fe9>] scsi_error_handler+0xc9/0xe0 [ 639.005996] [<ffffffff814a4f20>] ? scsi_unjam_host+0xd0/0xd0 [ 639.007600] [<ffffffff81072f6b>] kthread+0xdb/0xe0 [ 639.009205] [<ffffffff81072e90>] ? flush_kthread_worker+0x170/0x170 [ 639.010821] [<ffffffff81748cac>] ret_from_fork+0x7c/0xb0 [ 639.012437] [<ffffffff81072e90>] ? flush_kthread_worker+0x170/0x170 This bug was introduce in commit 2e8ac303 (the first time rdev_set_badblock was call from interrupt context), so this patch is appropriate for 3.5 and subsequent kernels. Cc: <stable@vger.kernel.org> (3.5+) Signed-off-by: NBian Yu <bianyu@kedacom.com> Reviewed-by: NJianpeng Ma <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Lukasz Dorau 提交于
Since: commit 7ceb17e8 md: Allow devices to be re-added to a read-only array. spares are activated on a read-only array. In case of raid1 and raid10 personalities it causes that not-in-sync devices are marked in-sync without checking if recovery has been finished. If a read-only array is degraded and one of its devices is not in-sync (because the array has been only partially recovered) recovery will be skipped. This patch adds checking if recovery has been finished before marking a device in-sync for raid1 and raid10 personalities. In case of raid5 personality such condition is already present (at raid5.c:6029). Bug was introduced in 3.10 and causes data corruption. Cc: stable@vger.kernel.org Signed-off-by: NPawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: NLukasz Dorau <lukasz.dorau@intel.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 23 10月, 2013 1 次提交
-
-
由 Kent Overstreet 提交于
Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 10月, 2013 1 次提交
-
-
由 Mikulas Patocka 提交于
This patch fixes a particular type of data corruption that has been encountered when loading a snapshot's metadata from disk. When we allocate a new chunk in persistent_prepare, we increment ps->next_free and we make sure that it doesn't point to a metadata area by further incrementing it if necessary. When we load metadata from disk on device activation, ps->next_free is positioned after the last used data chunk. However, if this last used data chunk is followed by a metadata area, ps->next_free is positioned erroneously to the metadata area. A newly-allocated chunk is placed at the same location as the metadata area, resulting in data or metadata corruption. This patch changes the code so that ps->next_free skips the metadata area when metadata are loaded in function read_exceptions. The patch also moves a piece of code from persistent_prepare_exception to a separate function skip_metadata to avoid code duplication. CVE-2013-4299 Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
- 11 10月, 2013 1 次提交
-
-
由 Kent Overstreet 提交于
Commit c0f04d88 ("bcache: Fix flushes in writeback mode") was fixing a reported data corruption bug, but it seems some last minute refactoring or rebasing introduced a null pointer deref. Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Reported-by: NGabriel de Perthuis <g2p.code@gmail.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 9月, 2013 7 次提交
-
-
由 Kent Overstreet 提交于
In writeback mode, when we get a cache flush we need to make sure we issue a flush to the backing device. The code for sending down an extra flush was wrong - by cloning the bio we were probably getting flags that didn't make sense for a bare flush, and also the old code was firing for FUA bios, for which we don't need to send a flush to the backing device. This was causing data corruption somehow - the mechanism was never determined, but this patch fixes it for the users that were seeing it. Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kent Overstreet 提交于
btree_sort_fixup() was overly clever, because it was trying to avoid pulling a key off the btree iterator in more than one place. This led to a really obscure bug where we'd break early from the loop in btree_sort_fixup() if the current key overlapped with keys in more than one older set, and the next key it overlapped with was zero size. Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kent Overstreet 提交于
GFP_NOIO means we could be getting called recursively - mca_alloc() -> mca_data_alloc() - definitely can't use mutex_lock(bucket_lock) then. Whoops. Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kent Overstreet 提交于
schedule_timeout() != schedule_timeout_uninterruptible() Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kent Overstreet 提交于
bch_journal_meta() was missing the flush to make the journal write actually go down (instead of waiting up to journal_delay_ms)... Whoops Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kent Overstreet 提交于
Background writeback works by scanning the btree for dirty data and adding those keys into a fixed size buffer, then for each dirty key in the keybuf writing it to the backing device. When read_dirty() finishes and it's time to scan for more dirty data, we need to wait for the outstanding writeback IO to finish - they still take up slots in the keybuf (so that foreground writes can check for them to avoid races) - without that wait, we'll continually rescan when we'll be able to add at most a key or two to the keybuf, and that takes locks that starves foreground IO. Doh. Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Geert Uytterhoeven 提交于
Fix drivers/md/bcache/btree.c: In function ‘bch_btree_node_read’: drivers/md/bcache/btree.c:259: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘size_t’ Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NKent Overstreet <kmo@daterainc.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-