- 25 11月, 2015 1 次提交
-
-
由 Christoph Hellwig 提交于
We only added the request to the request list for the !blk-mq case, so we should only delete it in that case as well. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 24 11月, 2015 3 次提交
-
-
由 Ming Lei 提交于
We had seen lots of reports of this kind issue, so add one warnning in blk-merge, then it can be triggered easily and avoid to depend on warning/bug from drivers. Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
Commit bdced438(block: setup bi_phys_segments after splitting) introduces function of computing bio->bi_phys_segments during bio splitting. Unfortunately both bio->bi_seg_front_size and bio->bi_seg_back_size arn't computed, so too many physical segments may be obtained for one request since both the two are used to check if one segment across two bios can be possible. This patch fixes the issue by computing the two variables in blk_bio_segment_split(). Fixes: bdced438(block: setup bi_phys_segments after splitting) Reported-by: NMichael Ellerman <mpe@ellerman.id.au> Reported-by: NMark Salter <msalter@redhat.com> Tested-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: NMark Salter <msalter@redhat.com> Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
Inside blk_bio_segment_split(), previous bvec pointer(bvprvp) always points to the iterator local variable, which is obviously wrong, so fix it by pointing to the local variable of 'bvprv'. Fixes: 5014c311(block: fix bogus compiler warnings in blk-merge.c) Cc: stable@kernel.org #4.3 Reported-by: NMichael Ellerman <mpe@ellerman.id.au> Reported-by: NMark Salter <msalter@redhat.com> Tested-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: NMark Salter <msalter@redhat.com> Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 21 11月, 2015 1 次提交
-
-
由 Jens Axboe 提交于
Liu reported that running certain parts of xfstests threw the following error: BUG: sleeping function called from invalid context at mm/page_alloc.c:3190 in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0 3 locks held by kworker/u16:0/6: #0: ("writeback"){++++.+}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730 #1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730 #2: (&type->s_umount_key#44){+++++.}, at: [<ffffffff811e6805>] trylock_super+0x25/0x60 CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G OE 4.3.0+ #3 Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-btrfs-108) ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab 0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8 ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76 Call Trace: [<ffffffff8130191b>] dump_stack+0x4f/0x74 [<ffffffff8108ed95>] ___might_sleep+0x185/0x240 [<ffffffff8108eea2>] __might_sleep+0x52/0x90 [<ffffffff811817e8>] __alloc_pages_nodemask+0x268/0x410 [<ffffffff8109a43c>] ? sched_clock_local+0x1c/0x90 [<ffffffff8109a6d1>] ? local_clock+0x21/0x40 [<ffffffff810b9eb0>] ? __lock_release+0x420/0x510 [<ffffffff810b534c>] ? __lock_acquired+0x16c/0x3c0 [<ffffffff811ca265>] alloc_pages_current+0xc5/0x210 [<ffffffffa0577105>] ? rbio_is_full+0x55/0x70 [btrfs] [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0 [<ffffffff81666d50>] ? _raw_spin_unlock_irqrestore+0x40/0x60 [<ffffffffa0578c0a>] full_stripe_write+0x5a/0xc0 [btrfs] [<ffffffffa0578ca9>] __raid56_parity_write+0x39/0x60 [btrfs] [<ffffffffa0578deb>] run_plug+0x11b/0x140 [btrfs] [<ffffffffa0578e33>] btrfs_raid_unplug+0x23/0x70 [btrfs] [<ffffffff812d36c2>] blk_flush_plug_list+0x82/0x1f0 [<ffffffff812e0349>] blk_sq_make_request+0x1f9/0x740 [<ffffffff812ceba2>] ? generic_make_request_checks+0x222/0x7c0 [<ffffffff812cf264>] ? blk_queue_enter+0x124/0x310 [<ffffffff812cf1d2>] ? blk_queue_enter+0x92/0x310 [<ffffffff812d0ae2>] generic_make_request+0x172/0x2c0 [<ffffffff812d0ad4>] ? generic_make_request+0x164/0x2c0 [<ffffffff812d0ca0>] submit_bio+0x70/0x140 [<ffffffffa0577b29>] ? rbio_add_io_page+0x99/0x150 [btrfs] [<ffffffffa0578a89>] finish_rmw+0x4d9/0x600 [btrfs] [<ffffffffa0578c4c>] full_stripe_write+0x9c/0xc0 [btrfs] [<ffffffffa057ab7f>] raid56_parity_write+0xef/0x160 [btrfs] [<ffffffffa052bd83>] btrfs_map_bio+0xe3/0x2d0 [btrfs] [<ffffffffa04fbd6d>] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs] [<ffffffffa05173c4>] submit_one_bio+0x74/0xb0 [btrfs] [<ffffffffa0517f55>] submit_extent_page+0xe5/0x1c0 [btrfs] [<ffffffffa0519b18>] __extent_writepage_io+0x408/0x4c0 [btrfs] [<ffffffffa05179c0>] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs] [<ffffffffa051dc88>] __extent_writepage+0x218/0x3a0 [btrfs] [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0 [<ffffffffa051e2c9>] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs] [<ffffffffa051e422>] extent_writepages+0x52/0x70 [btrfs] [<ffffffffa05001f0>] ? btrfs_set_inode_index+0x70/0x70 [btrfs] [<ffffffffa04fcc17>] btrfs_writepages+0x27/0x30 [btrfs] [<ffffffff81184df3>] do_writepages+0x23/0x40 [<ffffffff81212229>] __writeback_single_inode+0x89/0x4d0 [<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480 [<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480 [<ffffffff8121295f>] ? writeback_sb_inodes+0x15f/0x480 [<ffffffff81212ad2>] writeback_sb_inodes+0x2d2/0x480 [<ffffffff810b1397>] ? down_read_trylock+0x57/0x60 [<ffffffff811e6805>] ? trylock_super+0x25/0x60 [<ffffffff810d629f>] ? rcu_read_lock_sched_held+0x4f/0x90 [<ffffffff81212d0c>] __writeback_inodes_wb+0x8c/0xc0 [<ffffffff812130b5>] wb_writeback+0x2b5/0x500 [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0 [<ffffffff810660a8>] ? __local_bh_enable_ip+0x68/0xc0 [<ffffffff81213362>] ? wb_do_writeback+0x62/0x310 [<ffffffff812133c1>] wb_do_writeback+0xc1/0x310 [<ffffffff8107c3d9>] ? set_worker_desc+0x79/0x90 [<ffffffff81213842>] wb_workfn+0x92/0x330 [<ffffffff8107f133>] process_one_work+0x223/0x730 [<ffffffff8107f083>] ? process_one_work+0x173/0x730 [<ffffffff8108035f>] ? worker_thread+0x18f/0x430 [<ffffffff810802ed>] worker_thread+0x11d/0x430 [<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0 [<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0 [<ffffffff810858df>] kthread+0xef/0x110 [<ffffffff8108f74e>] ? schedule_tail+0x1e/0xd0 [<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70 [<ffffffff816673bf>] ret_from_fork+0x3f/0x70 [<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70 The issue is that we've got the software context pinned while calling blk_flush_plug_list(), which flushes callbacks that are allowed to sleep. btrfs and raid has such callbacks. Flip the checks around a bit, so we can enable preempt a bit earlier and flush plugs without having preempt disabled. This only affects blk-mq driven devices, and only those that register a single queue. Reported-by: NLiu Bo <bo.li.liu@oracle.com> Tested-by: NLiu Bo <bo.li.liu@oracle.com> Cc: stable@kernel.org Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 20 11月, 2015 2 次提交
-
-
由 Kees Cook 提交于
If md->signature == MAC_DRIVER_MAGIC and md->block_size == 1023, a single 512 byte sector would be read (secsize / 512). However the partition structure would be located past the end of the buffer (secsize % 512). Signed-off-by: NKees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Dan Williams 提交于
Fix use after free crashes like the following: general protection fault: 0000 [#1] SMP Call Trace: [<ffffffffa0050216>] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem] [<ffffffffa0050ba2>] pmem_rw_page+0x42/0x80 [nd_pmem] [<ffffffff8128fd90>] bdev_read_page+0x50/0x60 [<ffffffff812972f0>] do_mpage_readpage+0x510/0x770 [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20 [<ffffffff811d86dc>] ? lru_cache_add+0x1c/0x50 [<ffffffff81297657>] mpage_readpages+0x107/0x170 [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20 [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20 [<ffffffff8129058d>] blkdev_readpages+0x1d/0x20 [<ffffffff811d615f>] __do_page_cache_readahead+0x28f/0x310 [<ffffffff811d6039>] ? __do_page_cache_readahead+0x169/0x310 [<ffffffff811c5abd>] ? pagecache_get_page+0x2d/0x1d0 [<ffffffff811c76f6>] filemap_fault+0x396/0x530 [<ffffffff811f816e>] __do_fault+0x4e/0xf0 [<ffffffff811fce7d>] handle_mm_fault+0x11bd/0x1b50 Cc: <stable@vger.kernel.org> Cc: Jens Axboe <axboe@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Reported-by: Nkbuild test robot <lkp@intel.com> Acked-by: NMatthew Wilcox <willy@linux.intel.com> [willy: symmetry fixups] Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 17 11月, 2015 2 次提交
-
-
由 Jan Kara 提交于
Currently blk_insert_flush() just adds flush request to q->queue_head when flush is not required. That completely bypasses IO scheduler so e.g. CFQ can be idling waiting for new request to arrive and will idle through the whole window unnecessarily. Luckily this only happens in rare cases as usually checks in generic_make_request_checks() clear FLUSH and FUA flags early if they are not needed. When no flushing is actually required, we can easily fix the problem by properly queueing the request through the IO scheduler. Ideally IO scheduler should be also made aware of requests queued via blk_flush_queue_rq(). However inserting flush request through IO scheduler can have unwanted side-effects since due to flush batching delaying the flush request in IO scheduler will delay all flush requests possibly coming from other processes. So we keep adding the request directly to q->queue_head. Signed-off-by: NJan Kara <jack@suse.com> Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Geliang Tang 提交于
To make the intention clearer, use list_{first,prev,next}_entry instead of list_entry. Signed-off-by: NGeliang Tang <geliangtang@163.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 12 11月, 2015 2 次提交
-
-
由 Randy Dunlap 提交于
Fix kernel-doc warning in blk-core.c: Warning(..//block/blk-core.c:1549): No description found for parameter 'same_queue_rq' Signed-off-by: NRandy Dunlap <rdunlap@infradead.org> Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
It's no longer used outside of blk-mq core. Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 08 11月, 2015 3 次提交
-
-
由 Jens Axboe 提交于
Add basic support for polling for specific IO to complete. This uses the cookie that blk-mq passes back, which enables the block layer to pass this cookie to the driver to spin for a specific request. This will be combined with request latency tracking, so we can make qualified decisions about when to poll and when not to. For now, for benchmark purposes, we add a sysfs file that controls whether polling is enabled or not. Signed-off-by: NJens Axboe <axboe@fb.com> Acked-by: NChristoph Hellwig <hch@lst.de> Acked-by: NKeith Busch <keith.busch@intel.com>
-
由 Jens Axboe 提交于
Return a cookie, blk_qc_t, from the blk-mq make request functions, that allows a later caller to uniquely identify a specific IO. The cookie doesn't mean anything to the caller, but the caller can use it to later pass back to the block layer. The block layer can then identify the hardware queue and request from that cookie. Signed-off-by: NJens Axboe <axboe@fb.com> Acked-by: NChristoph Hellwig <hch@lst.de> Acked-by: NKeith Busch <keith.busch@intel.com>
-
由 Jens Axboe 提交于
No functional changes in this patch, but it prepares us for returning a more useful cookie related to the IO that was queued up. Signed-off-by: NJens Axboe <axboe@fb.com> Acked-by: NChristoph Hellwig <hch@lst.de> Acked-by: NKeith Busch <keith.busch@intel.com>
-
- 07 11月, 2015 3 次提交
-
-
由 Ben Segall 提交于
setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of the current pid namespace. This is in contrast to both the other modes of setpriority and the example of kill(-1). Fix this. getpriority and ioprio have the same failure mode, fix them too. Eric said: : After some more thinking about it this patch sounds justifiable. : : My goal with namespaces is not to build perfect isolation mechanisms : as that can get into ill defined territory, but to build well defined : mechanisms. And to handle the corner cases so you can use only : a single namespace with well defined results. : : In this case you have found the two interfaces I am aware of that : identify processes by uid instead of by pid. Which quite frankly is : weird. Unfortunately the weird unexpected cases are hard to handle : in the usual way. : : I was hoping for a little more information. Changes like this one we : have to be careful of because someone might be depending on the current : behavior. I don't think they are and I do think this make sense as part : of the pid namespace. Signed-off-by: NBen Segall <bsegall@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ambrose Feinstein <ambrose@google.com> Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
__GFP_WAIT was used to signal that the caller was in atomic context and could not sleep. Now it is possible to distinguish between true atomic context and callers that are not willing to sleep. The latter should clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT behaves differently, there is a risk that people will clear the wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate what it does -- setting it allows all reclaim activity, clearing them prevents it. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 11月, 2015 1 次提交
-
-
由 Jeff Moyer 提交于
Hi, Zhangqing Luo reported long boot times on a system with thousands of LUNs when scsi-mq was enabled. He narrowed the problem down to blk_mq_add_queue_tag_set, where every queue is frozen in order to set the BLK_MQ_F_TAG_SHARED flag. Each added device will freeze all queues added before it in sequence, which involves waiting for an RCU grace period for each one. We don't need to do this. After the second queue is added, only new queues need to be initialized with the shared tag. We can do that by percolating the flag up to the blk_mq_tag_set, and updating the newly added queue's hctxs if the flag is set. This problem was introduced by commit 0d2602ca (blk-mq: improve support for shared tags maps). Reported-and-tested-by: NJason Luo <zhangqing.luo@oracle.com> Reviewed-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 28 10月, 2015 1 次提交
-
-
由 Ming Lin 提交于
In commit b49a0871("block: remove split code in blkdev_issue_{discard,write_same}"), discard_granularity and alignment checks were removed. Ideally, with bio late splitting, the upper layers shouldn't need to depend on device's limits. Christoph reported a discard regression on the HGST Ultrastar SN100 NVMe device when mkfs.xfs. We have not found the root cause yet. This patch re-adds discard_granularity and alignment checks by reverting the related changes in commit b49a0871. The good thing is now we can remove the 2G discard size cap and just use UINT_MAX to avoid bi_size overflow. Reviewed-by: NChristoph Hellwig <hch@lst.de> Tested-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMing Lin <ming.l@ssi.samsung.com> Reviewed-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 22 10月, 2015 19 次提交
-
-
由 Tejun Heo 提交于
The stat files on the root cgroup shows stats for the whole system and usually don't contain any information which isn't available through the usual system monitoring mechanisms. Some controllers skip collecting these duplicate stats to optimize cases where cgroup isn't used and later try to emulate the result on demand. This leads to complexities and subtle differences in the information shown through different channels. This is entirely unnecessary and cgroup v2 is dropping stat files which are duplicate from all controllers. This patch removes "io.stat" from the root hierarchy. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NJens Axboe <axboe@kernel.dk> Cc: Vivek Goyal <vgoyal@redhat.com>
-
由 Ming Lei 提交于
Most of times, flush plug should be the hottest I/O path, so mark ctx as pending after all requests in the list are inserted. Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
The trace point is for tracing plug event of each request queue instead of each task, so we should check the request count in the plug list from current queue instead of current task. Signed-off-by: NMing Lei <ming.lei@canonical.com> Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
After bio splitting is introduced, one bio can be splitted and it is marked as NOMERGE because it is too fat to be merged, so check bio_mergeable() earlier to avoid to try to merge it unnecessarily. Signed-off-by: NMing Lei <ming.lei@canonical.com> Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
It isn't necessary to try to merge the bio which is marked as NOMERGE. Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
The splitted bio has been already too fat to merge, so mark it as NOMERGE. Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
The number of bio->bi_phys_segments is always obtained during bio splitting, so it is natural to setup it just after bio splitting, then we can avoid to compute nr_segment again during merge. Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jeff Moyer 提交于
Request queues with merging disabled will not flush the plug list after BLK_MAX_REQUEST_COUNT requests have been queued, since the code relies on blk_attempt_plug_merge to compute the request_count. Fix this by computing the number of queued requests even for nomerge queues. Signed-off-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
This commits adds a driver API and ioctls for controlling Persistent Reservations s/genericly/generically/ at the block layer. Persistent Reservations are supported by SCSI and NVMe and allow controlling who gets access to a device in a shared storage setup. Note that we add a pr_ops structure to struct block_device_operations instead of adding the members directly to avoid bloating all instances of devices that will never support Persistent Reservations. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
Split out helpers for all non-trivial ioctls to make this function simpler, and also start passing around a pointer version of the argument, as that's what most ioctl handlers actually need. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Dan Williams 提交于
The libnvidmm-btt and nvme drivers use blk_integrity to reserve space for per-sector metadata, but sometimes without protection checksums. This property is generically useful, so teach the block core to internally specify a nop profile if one is not provided at registration time. Cc: Keith Busch <keith.busch@intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Suggested-by: NChristoph Hellwig <hch@lst.de> [hch: kill the local nvme nop profile as well] Acked-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Dan Williams 提交于
Since they lack requests to pin the request_queue active, synchronous bio-based drivers may have in-flight integrity work from bio_integrity_endio() that is not flushed by blk_freeze_queue(). Flush that work to prevent races to free the queue and the final usage of the blk_integrity profile. This is temporary unless/until bio-based drivers start to generically take a q_usage_counter reference while a bio is in-flight. Cc: Martin K. Petersen <martin.petersen@oracle.com> [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case] Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Dan Williams 提交于
A trace like the following proceeds a crash in bio_integrity_process() when it goes to use an already freed blk_integrity profile. BUG: unable to handle kernel paging request at ffff8800d31b10d8 IP: [<ffff8800d31b10d8>] 0xffff8800d31b10d8 PGD 2f65067 PUD 21fffd067 PMD 80000000d30001e3 Oops: 0011 [#1] SMP Dumping ftrace buffer: --------------------------------- ndctl-2222 2.... 44526245us : disk_release: pmem1s systemd--2223 4.... 44573945us : bio_integrity_endio: pmem1s <...>-409 4.... 44574005us : bio_integrity_process: pmem1s --------------------------------- [..] Call Trace: [<ffffffff8144e0f9>] ? bio_integrity_process+0x159/0x2d0 [<ffffffff8144e4f6>] bio_integrity_verify_fn+0x36/0x60 [<ffffffff810bd2dc>] process_one_work+0x1cc/0x4e0 Given that a request_queue is pinned while i/o is in flight and that a gendisk is allowed to have a shorter lifetime, move blk_integrity to request_queue to satisfy requests arriving after the gendisk has been torn down. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case] Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Dan Williams 提交于
Allow pmem, and other synchronous/bio-based block drivers, to fallback on a per-cpu reference count managed by the core for tracking queue live/dead state. The existing per-cpu reference count for the blk_mq case is promoted to be used in all block i/o scenarios. This involves initializing it by default, waiting for it to drop to zero at exit, and holding a live reference over the invocation of q->make_request_fn() in generic_make_request(). The blk_mq code continues to take its own reference per blk_mq request and retains the ability to freeze the queue, but the check that the queue is frozen is moved to generic_make_request(). This fixes crash signatures like the following: BUG: unable to handle kernel paging request at ffff880140000000 [..] Call Trace: [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70 [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem] [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem] [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0 [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110 [<ffffffff8141f966>] submit_bio+0x76/0x170 [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160 [<ffffffff81286e62>] submit_bh+0x12/0x20 [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170 [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90 [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270 [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30 [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250 [<ffffffff81303494>] ext4_put_super+0x64/0x360 [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0 Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <keith.busch@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChristoph Hellwig <hch@lst.de> Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Martin K. Petersen 提交于
Up until now the_integrity profile has been dynamically allocated and attached to struct gendisk after the disk has been made active. This causes problems because NVMe devices need to register the profile prior to the partition table being read due to a mandatory metadata buffer requirement. In addition, DM goes through hoops to deal with preallocating, but not initializing integrity profiles. Since the integrity profile is small (4 bytes + a pointer), Christoph suggested moving it to struct gendisk proper. This requires several changes: - Moving the blk_integrity definition to genhd.h. - Inlining blk_integrity in struct gendisk. - Removing the dynamic allocation code. - Adding helper functions which allow gendisk to set up and tear down the integrity sysfs dir when a disk is added/deleted. - Adding a blk_integrity_revalidate() callback for updating the stable pages bdi setting. - The calls that depend on whether a device has an integrity profile or not now key off of the bi->profile pointer. - Simplifying the integrity support routines in DM (Mike Snitzer). Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com> Reported-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagig@mellanox.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Martin K. Petersen 提交于
The size of the data interval was not exported in the sysfs integrity directory. Export it so that userland apps can tell whether the interval is different from the device's logical block size. Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com> Reviewed-by: NSagi Grimberg <sagig@mellanox.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Martin K. Petersen 提交于
The per-device properties in the blk_integrity structure were previously unsigned short. However, most of the values fit inside a char. The only exception is the data interval size and we can work around that by storing it as a power of two. This cuts the size of the dynamic portion of blk_integrity in half. Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com> Reported-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagig@mellanox.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Martin K. Petersen 提交于
We previously made a complete copy of a device's data integrity profile even though several of the fields inside the blk_integrity struct are pointers to fixed template entries in t10-pi.c. Split the static and per-device portions so that we can reference the template directly. Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com> Reported-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagig@mellanox.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Martin K. Petersen 提交于
The integrity kobject purely exists to support the integrity subdirectory in sysfs and doesn't really have anything to do with the blk_integrity data structure. Move the kobject to struct gendisk where it belongs. Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com> Reported-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagig@mellanox.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 15 10月, 2015 2 次提交
-
-
由 Tejun Heo 提交于
bdi's are initialized in two steps, bdi_init() and bdi_register(), but destroyed in a single step by bdi_destroy() which, for a bdi embedded in a request_queue, is called during blk_cleanup_queue() which makes the queue invisible and starts the draining of remaining usages. A request_queue's user can access the congestion state of the embedded bdi as long as it holds a reference to the queue. As such, it may access the congested state of a queue which finished blk_cleanup_queue() but hasn't reached blk_release_queue() yet. Because the congested state was embedded in backing_dev_info which in turn is embedded in request_queue, accessing the congested state after bdi_destroy() was called was fine. The bdi was destroyed but the memory region for the congested state remained accessible till the queue got released. a13f35e8 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") changed the situation. Now, the root congested state which is expected to be pinned while request_queue remains accessible is separately reference counted and the base ref is put during bdi_destroy(). This means that the root congested state may go away prematurely while the queue is between bdi_dstroy() and blk_cleanup_queue(), which was detected by Andrey's KASAN tests. The root cause of this problem is that bdi doesn't distinguish the two steps of destruction, unregistration and release, and now the root congested state actually requires a separate release step. To fix the issue, this patch separates out bdi_unregister() and bdi_exit() from bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue() and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a simple wrapper calling the two steps back-to-back. While at it, the prototype of bdi_destroy() is moved right below bdi_setup_and_register() so that the counterpart operations are located together. Signed-off-by: NTejun Heo <tj@kernel.org> Fixes: a13f35e8 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") Cc: stable@vger.kernel.org # v4.2+ Reported-and-tested-by: NAndrey Konovalov <andreyknvl@google.com> Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.comReviewed-by: NJan Kara <jack@suse.com> Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Junichi Nomura 提交于
tags is freed in blk_mq_free_rq_map() and should not be used after that. The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because free_cpumask_var() is nop. tags->cpumask is allocated in blk_mq_init_tags() so it's natural to free cpumask in its counter part, blk_mq_free_tags(). Fixes: f26cdc85 ("blk-mq: Shared tag enhancements") Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Keith Busch <keith.busch@intel.com> Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-