- 23 12月, 2015 1 次提交
-
-
由 Junichi Nomura 提交于
blk_queue_bio() does split then bounce, which makes the segment counting based on pages before bouncing and could go wrong. Move the split to after bouncing, like we do for blk-mq, and the we fix the issue of having the bio count for segments be wrong. Fixes: 54efd50b ("block: make generic_make_request handle arbitrarily sized bios") Cc: stable@vger.kernel.org Tested-by: NArtem S. Tashkinov <t.artem@lycos.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 04 12月, 2015 1 次提交
-
-
由 Ken Xue 提交于
The routines in scsi_pm.c assume that if a runtime-PM callback is invoked for a SCSI device, it can only mean that the device's driver has asked the block layer to handle the runtime power management (by calling blk_pm_runtime_init(), which among other things sets q->dev). However, this assumption turns out to be wrong for things like the ses driver. Normally ses devices are not allowed to do runtime PM, but userspace can override this setting. If this happens, the kernel gets a NULL pointer dereference when blk_post_runtime_resume() tries to use the uninitialized q->dev pointer. This patch fixes the problem by checking q->dev in block layer before handle runtime PM. Since ses doesn't define any PM callbacks and call blk_pm_runtime_init(), the crash won't occur. This fixes Bugzilla #101371. https://bugzilla.kernel.org/show_bug.cgi?id=101371 More discussion can be found from below link. http://marc.info/?l=linux-scsi&m=144163730531875&w=2Signed-off-by: NKen Xue <Ken.Xue@amd.com> Acked-by: NAlan Stern <stern@rowland.harvard.edu> Cc: Xiangliang Yu <Xiangliang.Yu@amd.com> Cc: James E.J. Bottomley <JBottomley@odin.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Michael Terry <Michael.terry@canonical.com> Cc: stable@vger.kernel.org Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 03 12月, 2015 1 次提交
-
-
由 Tejun Heo 提交于
Consider the following v2 hierarchy. P0 (+memory) --- P1 (-memory) --- A \- B P0 has memory enabled in its subtree_control while P1 doesn't. If both A and B contain processes, they would belong to the memory css of P1. Now if memory is enabled on P1's subtree_control, memory csses should be created on both A and B and A's processes should be moved to the former and B's processes the latter. IOW, enabling controllers can cause atomic migrations into different csses. The core cgroup migration logic has been updated accordingly but the controller migration methods haven't and still assume that all tasks migrate to a single target css; furthermore, the methods were fed the css in which subtree_control was updated which is the parent of the target csses. pids controller depends on the migration methods to move charges and this made the controller attribute charges to the wrong csses often triggering the following warning by driving a counter negative. WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40() Modules linked in: CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29 ... ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000 ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00 ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8 Call Trace: [<ffffffff81551ffc>] dump_stack+0x4e/0x82 [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0 [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20 [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40 [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0 [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330 [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190 [<ffffffff81189016>] cgroup_attach_task+0x176/0x200 [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460 [<ffffffff81189684>] cgroup_procs_write+0x14/0x20 [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0 [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190 [<ffffffff81265f88>] __vfs_write+0x28/0xe0 [<ffffffff812666fc>] vfs_write+0xac/0x1a0 [<ffffffff81267019>] SyS_write+0x49/0xb0 [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76 This patch fixes the bug by removing @css parameter from the three migration methods, ->can_attach, ->cancel_attach() and ->attach() and updating cgroup_taskset iteration helpers also return the destination css in addition to the task being migrated. All controllers are updated accordingly. * Controllers which don't care whether there are one or multiple target csses can be converted trivially. cpu, io, freezer, perf, netclassid and netprio fall in this category. * cpuset's current implementation assumes that there's single source and destination and thus doesn't support v2 hierarchy already. The only change made by this patchset is how that single destination css is obtained. * memory migration path already doesn't do anything on v2. How the single destination css is obtained is updated and the prep stage of mem_cgroup_can_attach() is reordered to accomodate the change. * pids is the only controller which was affected by this bug. It now correctly handles multi-destination migrations and no longer causes counter underflow from incorrect accounting. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de> Cc: Aleksa Sarai <cyphar@cyphar.com>
-
- 01 12月, 2015 1 次提交
-
-
由 Ming Lei 提交于
When bio has only one physical segment, we should set bio's bi_seg_front_size as the real(final) size of the single segment. Fixes: 02e70742(blk-merge: fix blk_bio_segment_split) Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de> Tested-by: NMarkus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 30 11月, 2015 1 次提交
-
-
由 Hannes Reinecke 提交于
When a cloned request is retried on other queues it always needs to be checked against the queue limits of that queue. Otherwise the calculations for nr_phys_segments might be wrong, leading to a crash in scsi_init_sgtable(). To clarify this the patch renames blk_rq_check_limits() to blk_cloned_rq_check_limits() and removes the symbol export, as the new function should only be used for cloned requests and never exported. Cc: Mike Snitzer <snitzer@redhat.com> Cc: Ewan Milne <emilne@redhat.com> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: NHannes Reinecke <hare@suse.de> Fixes: e2a60da7 ("block: Clean up special command handling logic") Cc: stable@vger.kernel.org # 3.7+ Acked-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 26 11月, 2015 3 次提交
-
-
由 Eric Sandeen 提交于
Today, blockdev --rereadpt /dev/sda will fail with EBUSY if any partition of sda is mounted (and will fail with EINVAL if pointed at a partition). But it will pass if the entire block device is formatted with a filesystem and mounted. I don't think this makes sense; partitioning should surely not ever change out from under a mounted device. So check for bdev->bd_super, and fail that with -EBUSY as well. Signed-off-by: NEric Sandeen <sandeen@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Martin K. Petersen 提交于
Commit 4f258a46 ("sd: Fix maximum I/O size for BLOCK_PC requests") had the unfortunate side-effect of removing an implicit clamp to BLK_DEF_MAX_SECTORS for REQ_TYPE_FS requests in the block layer code. This caused problems for some SMR drives. Debugging this issue revealed a few problems with the existing infrastructure since the block layer didn't know how to deal with device-imposed limits, only limits set by the I/O controller. - Introduce a new queue limit, max_dev_sectors, which is used by the ULD to signal the maximum sectors for a REQ_TYPE_FS request. - Ensure that max_dev_sectors is correctly stacked and taken into account when overriding max_sectors through sysfs. - Rework sd_read_block_limits() so it saves the max_xfer and opt_xfer values for later processing. - In sd_revalidate() set the queue's max_dev_sectors based on the MAXIMUM TRANSFER LENGTH value in the Block Limits VPD. If this value is not reported, fall back to a cap based on the CDB TRANSFER LENGTH field size. - In sd_revalidate(), use OPTIMAL TRANSFER LENGTH from the Block Limits VPD--if reported and sane--to signal the preferred device transfer size for FS requests. Otherwise use BLK_DEF_MAX_SECTORS. - blk_limits_max_hw_sectors() is no longer used and can be removed. Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=93581Reviewed-by: NChristoph Hellwig <hch@lst.de> Tested-by: sweeneygj@gmx.com Tested-by: NArzeets <anatol.pomozov@gmail.com> Tested-by: NDavid Eisner <david.eisner@oriel.oxon.org> Tested-by: NMario Kicherer <dev@kicherer.org> Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
-
由 Jens Axboe 提交于
This reverts commit 1b2ff19e. Jan writes: -- Thanks for report! After some investigation I found out we allocate elevator specific data in __get_request() only for non-flush requests. And this is actually required since the flush machinery uses the space in struct request for something else. Doh. So my patch is just wrong and not easy to fix since at the time __get_request() is called we are not sure whether the flush machinery will be used in the end. Jens, please revert 1b2ff19e. Thanks! I'm somewhat surprised that you can reliably hit the race where flushing gets disabled for the device just while the request is in flight. But I guess during boot it makes some sense. -- So let's just revert it, we can fix the queue run manually after the fact. This race is rare enough that it didn't trigger in testing, it requires the specific disable-while-in-flight scenario to trigger.
-
- 25 11月, 2015 1 次提交
-
-
由 Christoph Hellwig 提交于
We only added the request to the request list for the !blk-mq case, so we should only delete it in that case as well. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 24 11月, 2015 3 次提交
-
-
由 Ming Lei 提交于
We had seen lots of reports of this kind issue, so add one warnning in blk-merge, then it can be triggered easily and avoid to depend on warning/bug from drivers. Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
Commit bdced438(block: setup bi_phys_segments after splitting) introduces function of computing bio->bi_phys_segments during bio splitting. Unfortunately both bio->bi_seg_front_size and bio->bi_seg_back_size arn't computed, so too many physical segments may be obtained for one request since both the two are used to check if one segment across two bios can be possible. This patch fixes the issue by computing the two variables in blk_bio_segment_split(). Fixes: bdced438(block: setup bi_phys_segments after splitting) Reported-by: NMichael Ellerman <mpe@ellerman.id.au> Reported-by: NMark Salter <msalter@redhat.com> Tested-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: NMark Salter <msalter@redhat.com> Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
Inside blk_bio_segment_split(), previous bvec pointer(bvprvp) always points to the iterator local variable, which is obviously wrong, so fix it by pointing to the local variable of 'bvprv'. Fixes: 5014c311(block: fix bogus compiler warnings in blk-merge.c) Cc: stable@kernel.org #4.3 Reported-by: NMichael Ellerman <mpe@ellerman.id.au> Reported-by: NMark Salter <msalter@redhat.com> Tested-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: NMark Salter <msalter@redhat.com> Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 21 11月, 2015 1 次提交
-
-
由 Jens Axboe 提交于
Liu reported that running certain parts of xfstests threw the following error: BUG: sleeping function called from invalid context at mm/page_alloc.c:3190 in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0 3 locks held by kworker/u16:0/6: #0: ("writeback"){++++.+}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730 #1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730 #2: (&type->s_umount_key#44){+++++.}, at: [<ffffffff811e6805>] trylock_super+0x25/0x60 CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G OE 4.3.0+ #3 Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-btrfs-108) ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab 0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8 ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76 Call Trace: [<ffffffff8130191b>] dump_stack+0x4f/0x74 [<ffffffff8108ed95>] ___might_sleep+0x185/0x240 [<ffffffff8108eea2>] __might_sleep+0x52/0x90 [<ffffffff811817e8>] __alloc_pages_nodemask+0x268/0x410 [<ffffffff8109a43c>] ? sched_clock_local+0x1c/0x90 [<ffffffff8109a6d1>] ? local_clock+0x21/0x40 [<ffffffff810b9eb0>] ? __lock_release+0x420/0x510 [<ffffffff810b534c>] ? __lock_acquired+0x16c/0x3c0 [<ffffffff811ca265>] alloc_pages_current+0xc5/0x210 [<ffffffffa0577105>] ? rbio_is_full+0x55/0x70 [btrfs] [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0 [<ffffffff81666d50>] ? _raw_spin_unlock_irqrestore+0x40/0x60 [<ffffffffa0578c0a>] full_stripe_write+0x5a/0xc0 [btrfs] [<ffffffffa0578ca9>] __raid56_parity_write+0x39/0x60 [btrfs] [<ffffffffa0578deb>] run_plug+0x11b/0x140 [btrfs] [<ffffffffa0578e33>] btrfs_raid_unplug+0x23/0x70 [btrfs] [<ffffffff812d36c2>] blk_flush_plug_list+0x82/0x1f0 [<ffffffff812e0349>] blk_sq_make_request+0x1f9/0x740 [<ffffffff812ceba2>] ? generic_make_request_checks+0x222/0x7c0 [<ffffffff812cf264>] ? blk_queue_enter+0x124/0x310 [<ffffffff812cf1d2>] ? blk_queue_enter+0x92/0x310 [<ffffffff812d0ae2>] generic_make_request+0x172/0x2c0 [<ffffffff812d0ad4>] ? generic_make_request+0x164/0x2c0 [<ffffffff812d0ca0>] submit_bio+0x70/0x140 [<ffffffffa0577b29>] ? rbio_add_io_page+0x99/0x150 [btrfs] [<ffffffffa0578a89>] finish_rmw+0x4d9/0x600 [btrfs] [<ffffffffa0578c4c>] full_stripe_write+0x9c/0xc0 [btrfs] [<ffffffffa057ab7f>] raid56_parity_write+0xef/0x160 [btrfs] [<ffffffffa052bd83>] btrfs_map_bio+0xe3/0x2d0 [btrfs] [<ffffffffa04fbd6d>] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs] [<ffffffffa05173c4>] submit_one_bio+0x74/0xb0 [btrfs] [<ffffffffa0517f55>] submit_extent_page+0xe5/0x1c0 [btrfs] [<ffffffffa0519b18>] __extent_writepage_io+0x408/0x4c0 [btrfs] [<ffffffffa05179c0>] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs] [<ffffffffa051dc88>] __extent_writepage+0x218/0x3a0 [btrfs] [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0 [<ffffffffa051e2c9>] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs] [<ffffffffa051e422>] extent_writepages+0x52/0x70 [btrfs] [<ffffffffa05001f0>] ? btrfs_set_inode_index+0x70/0x70 [btrfs] [<ffffffffa04fcc17>] btrfs_writepages+0x27/0x30 [btrfs] [<ffffffff81184df3>] do_writepages+0x23/0x40 [<ffffffff81212229>] __writeback_single_inode+0x89/0x4d0 [<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480 [<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480 [<ffffffff8121295f>] ? writeback_sb_inodes+0x15f/0x480 [<ffffffff81212ad2>] writeback_sb_inodes+0x2d2/0x480 [<ffffffff810b1397>] ? down_read_trylock+0x57/0x60 [<ffffffff811e6805>] ? trylock_super+0x25/0x60 [<ffffffff810d629f>] ? rcu_read_lock_sched_held+0x4f/0x90 [<ffffffff81212d0c>] __writeback_inodes_wb+0x8c/0xc0 [<ffffffff812130b5>] wb_writeback+0x2b5/0x500 [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0 [<ffffffff810660a8>] ? __local_bh_enable_ip+0x68/0xc0 [<ffffffff81213362>] ? wb_do_writeback+0x62/0x310 [<ffffffff812133c1>] wb_do_writeback+0xc1/0x310 [<ffffffff8107c3d9>] ? set_worker_desc+0x79/0x90 [<ffffffff81213842>] wb_workfn+0x92/0x330 [<ffffffff8107f133>] process_one_work+0x223/0x730 [<ffffffff8107f083>] ? process_one_work+0x173/0x730 [<ffffffff8108035f>] ? worker_thread+0x18f/0x430 [<ffffffff810802ed>] worker_thread+0x11d/0x430 [<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0 [<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0 [<ffffffff810858df>] kthread+0xef/0x110 [<ffffffff8108f74e>] ? schedule_tail+0x1e/0xd0 [<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70 [<ffffffff816673bf>] ret_from_fork+0x3f/0x70 [<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70 The issue is that we've got the software context pinned while calling blk_flush_plug_list(), which flushes callbacks that are allowed to sleep. btrfs and raid has such callbacks. Flip the checks around a bit, so we can enable preempt a bit earlier and flush plugs without having preempt disabled. This only affects blk-mq driven devices, and only those that register a single queue. Reported-by: NLiu Bo <bo.li.liu@oracle.com> Tested-by: NLiu Bo <bo.li.liu@oracle.com> Cc: stable@kernel.org Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 20 11月, 2015 2 次提交
-
-
由 Kees Cook 提交于
If md->signature == MAC_DRIVER_MAGIC and md->block_size == 1023, a single 512 byte sector would be read (secsize / 512). However the partition structure would be located past the end of the buffer (secsize % 512). Signed-off-by: NKees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Dan Williams 提交于
Fix use after free crashes like the following: general protection fault: 0000 [#1] SMP Call Trace: [<ffffffffa0050216>] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem] [<ffffffffa0050ba2>] pmem_rw_page+0x42/0x80 [nd_pmem] [<ffffffff8128fd90>] bdev_read_page+0x50/0x60 [<ffffffff812972f0>] do_mpage_readpage+0x510/0x770 [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20 [<ffffffff811d86dc>] ? lru_cache_add+0x1c/0x50 [<ffffffff81297657>] mpage_readpages+0x107/0x170 [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20 [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20 [<ffffffff8129058d>] blkdev_readpages+0x1d/0x20 [<ffffffff811d615f>] __do_page_cache_readahead+0x28f/0x310 [<ffffffff811d6039>] ? __do_page_cache_readahead+0x169/0x310 [<ffffffff811c5abd>] ? pagecache_get_page+0x2d/0x1d0 [<ffffffff811c76f6>] filemap_fault+0x396/0x530 [<ffffffff811f816e>] __do_fault+0x4e/0xf0 [<ffffffff811fce7d>] handle_mm_fault+0x11bd/0x1b50 Cc: <stable@vger.kernel.org> Cc: Jens Axboe <axboe@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Reported-by: Nkbuild test robot <lkp@intel.com> Acked-by: NMatthew Wilcox <willy@linux.intel.com> [willy: symmetry fixups] Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 17 11月, 2015 2 次提交
-
-
由 Jan Kara 提交于
Currently blk_insert_flush() just adds flush request to q->queue_head when flush is not required. That completely bypasses IO scheduler so e.g. CFQ can be idling waiting for new request to arrive and will idle through the whole window unnecessarily. Luckily this only happens in rare cases as usually checks in generic_make_request_checks() clear FLUSH and FUA flags early if they are not needed. When no flushing is actually required, we can easily fix the problem by properly queueing the request through the IO scheduler. Ideally IO scheduler should be also made aware of requests queued via blk_flush_queue_rq(). However inserting flush request through IO scheduler can have unwanted side-effects since due to flush batching delaying the flush request in IO scheduler will delay all flush requests possibly coming from other processes. So we keep adding the request directly to q->queue_head. Signed-off-by: NJan Kara <jack@suse.com> Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Geliang Tang 提交于
To make the intention clearer, use list_{first,prev,next}_entry instead of list_entry. Signed-off-by: NGeliang Tang <geliangtang@163.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 12 11月, 2015 2 次提交
-
-
由 Randy Dunlap 提交于
Fix kernel-doc warning in blk-core.c: Warning(..//block/blk-core.c:1549): No description found for parameter 'same_queue_rq' Signed-off-by: NRandy Dunlap <rdunlap@infradead.org> Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
It's no longer used outside of blk-mq core. Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 08 11月, 2015 3 次提交
-
-
由 Jens Axboe 提交于
Add basic support for polling for specific IO to complete. This uses the cookie that blk-mq passes back, which enables the block layer to pass this cookie to the driver to spin for a specific request. This will be combined with request latency tracking, so we can make qualified decisions about when to poll and when not to. For now, for benchmark purposes, we add a sysfs file that controls whether polling is enabled or not. Signed-off-by: NJens Axboe <axboe@fb.com> Acked-by: NChristoph Hellwig <hch@lst.de> Acked-by: NKeith Busch <keith.busch@intel.com>
-
由 Jens Axboe 提交于
Return a cookie, blk_qc_t, from the blk-mq make request functions, that allows a later caller to uniquely identify a specific IO. The cookie doesn't mean anything to the caller, but the caller can use it to later pass back to the block layer. The block layer can then identify the hardware queue and request from that cookie. Signed-off-by: NJens Axboe <axboe@fb.com> Acked-by: NChristoph Hellwig <hch@lst.de> Acked-by: NKeith Busch <keith.busch@intel.com>
-
由 Jens Axboe 提交于
No functional changes in this patch, but it prepares us for returning a more useful cookie related to the IO that was queued up. Signed-off-by: NJens Axboe <axboe@fb.com> Acked-by: NChristoph Hellwig <hch@lst.de> Acked-by: NKeith Busch <keith.busch@intel.com>
-
- 07 11月, 2015 3 次提交
-
-
由 Ben Segall 提交于
setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of the current pid namespace. This is in contrast to both the other modes of setpriority and the example of kill(-1). Fix this. getpriority and ioprio have the same failure mode, fix them too. Eric said: : After some more thinking about it this patch sounds justifiable. : : My goal with namespaces is not to build perfect isolation mechanisms : as that can get into ill defined territory, but to build well defined : mechanisms. And to handle the corner cases so you can use only : a single namespace with well defined results. : : In this case you have found the two interfaces I am aware of that : identify processes by uid instead of by pid. Which quite frankly is : weird. Unfortunately the weird unexpected cases are hard to handle : in the usual way. : : I was hoping for a little more information. Changes like this one we : have to be careful of because someone might be depending on the current : behavior. I don't think they are and I do think this make sense as part : of the pid namespace. Signed-off-by: NBen Segall <bsegall@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ambrose Feinstein <ambrose@google.com> Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
__GFP_WAIT was used to signal that the caller was in atomic context and could not sleep. Now it is possible to distinguish between true atomic context and callers that are not willing to sleep. The latter should clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT behaves differently, there is a risk that people will clear the wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate what it does -- setting it allows all reclaim activity, clearing them prevents it. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 11月, 2015 1 次提交
-
-
由 Jeff Moyer 提交于
Hi, Zhangqing Luo reported long boot times on a system with thousands of LUNs when scsi-mq was enabled. He narrowed the problem down to blk_mq_add_queue_tag_set, where every queue is frozen in order to set the BLK_MQ_F_TAG_SHARED flag. Each added device will freeze all queues added before it in sequence, which involves waiting for an RCU grace period for each one. We don't need to do this. After the second queue is added, only new queues need to be initialized with the shared tag. We can do that by percolating the flag up to the blk_mq_tag_set, and updating the newly added queue's hctxs if the flag is set. This problem was introduced by commit 0d2602ca (blk-mq: improve support for shared tags maps). Reported-and-tested-by: NJason Luo <zhangqing.luo@oracle.com> Reviewed-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 28 10月, 2015 1 次提交
-
-
由 Ming Lin 提交于
In commit b49a0871("block: remove split code in blkdev_issue_{discard,write_same}"), discard_granularity and alignment checks were removed. Ideally, with bio late splitting, the upper layers shouldn't need to depend on device's limits. Christoph reported a discard regression on the HGST Ultrastar SN100 NVMe device when mkfs.xfs. We have not found the root cause yet. This patch re-adds discard_granularity and alignment checks by reverting the related changes in commit b49a0871. The good thing is now we can remove the 2G discard size cap and just use UINT_MAX to avoid bi_size overflow. Reviewed-by: NChristoph Hellwig <hch@lst.de> Tested-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMing Lin <ming.l@ssi.samsung.com> Reviewed-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 22 10月, 2015 13 次提交
-
-
由 Tejun Heo 提交于
The stat files on the root cgroup shows stats for the whole system and usually don't contain any information which isn't available through the usual system monitoring mechanisms. Some controllers skip collecting these duplicate stats to optimize cases where cgroup isn't used and later try to emulate the result on demand. This leads to complexities and subtle differences in the information shown through different channels. This is entirely unnecessary and cgroup v2 is dropping stat files which are duplicate from all controllers. This patch removes "io.stat" from the root hierarchy. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NJens Axboe <axboe@kernel.dk> Cc: Vivek Goyal <vgoyal@redhat.com>
-
由 Ming Lei 提交于
Most of times, flush plug should be the hottest I/O path, so mark ctx as pending after all requests in the list are inserted. Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
The trace point is for tracing plug event of each request queue instead of each task, so we should check the request count in the plug list from current queue instead of current task. Signed-off-by: NMing Lei <ming.lei@canonical.com> Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
After bio splitting is introduced, one bio can be splitted and it is marked as NOMERGE because it is too fat to be merged, so check bio_mergeable() earlier to avoid to try to merge it unnecessarily. Signed-off-by: NMing Lei <ming.lei@canonical.com> Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
It isn't necessary to try to merge the bio which is marked as NOMERGE. Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
The splitted bio has been already too fat to merge, so mark it as NOMERGE. Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
The number of bio->bi_phys_segments is always obtained during bio splitting, so it is natural to setup it just after bio splitting, then we can avoid to compute nr_segment again during merge. Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jeff Moyer 提交于
Request queues with merging disabled will not flush the plug list after BLK_MAX_REQUEST_COUNT requests have been queued, since the code relies on blk_attempt_plug_merge to compute the request_count. Fix this by computing the number of queued requests even for nomerge queues. Signed-off-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
This commits adds a driver API and ioctls for controlling Persistent Reservations s/genericly/generically/ at the block layer. Persistent Reservations are supported by SCSI and NVMe and allow controlling who gets access to a device in a shared storage setup. Note that we add a pr_ops structure to struct block_device_operations instead of adding the members directly to avoid bloating all instances of devices that will never support Persistent Reservations. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
Split out helpers for all non-trivial ioctls to make this function simpler, and also start passing around a pointer version of the argument, as that's what most ioctl handlers actually need. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Dan Williams 提交于
The libnvidmm-btt and nvme drivers use blk_integrity to reserve space for per-sector metadata, but sometimes without protection checksums. This property is generically useful, so teach the block core to internally specify a nop profile if one is not provided at registration time. Cc: Keith Busch <keith.busch@intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Suggested-by: NChristoph Hellwig <hch@lst.de> [hch: kill the local nvme nop profile as well] Acked-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Dan Williams 提交于
Since they lack requests to pin the request_queue active, synchronous bio-based drivers may have in-flight integrity work from bio_integrity_endio() that is not flushed by blk_freeze_queue(). Flush that work to prevent races to free the queue and the final usage of the blk_integrity profile. This is temporary unless/until bio-based drivers start to generically take a q_usage_counter reference while a bio is in-flight. Cc: Martin K. Petersen <martin.petersen@oracle.com> [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case] Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Dan Williams 提交于
A trace like the following proceeds a crash in bio_integrity_process() when it goes to use an already freed blk_integrity profile. BUG: unable to handle kernel paging request at ffff8800d31b10d8 IP: [<ffff8800d31b10d8>] 0xffff8800d31b10d8 PGD 2f65067 PUD 21fffd067 PMD 80000000d30001e3 Oops: 0011 [#1] SMP Dumping ftrace buffer: --------------------------------- ndctl-2222 2.... 44526245us : disk_release: pmem1s systemd--2223 4.... 44573945us : bio_integrity_endio: pmem1s <...>-409 4.... 44574005us : bio_integrity_process: pmem1s --------------------------------- [..] Call Trace: [<ffffffff8144e0f9>] ? bio_integrity_process+0x159/0x2d0 [<ffffffff8144e4f6>] bio_integrity_verify_fn+0x36/0x60 [<ffffffff810bd2dc>] process_one_work+0x1cc/0x4e0 Given that a request_queue is pinned while i/o is in flight and that a gendisk is allowed to have a shorter lifetime, move blk_integrity to request_queue to satisfy requests arriving after the gendisk has been torn down. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case] Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-