提交 · b85dfd30cb37318587018ee430c2c1cfabf3dabc · openeuler / raspberrypi-kernel

11 6月, 2015 1 次提交

block: fix ext_dev_lock lockdep report · 4d66e5e9

由 Dan Williams 提交于 6月 10, 2015

 =================================
 [ INFO: inconsistent lock state ]
 4.1.0-rc7+ #217 Tainted: G           O
 ---------------------------------
 inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
 swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
  (ext_devt_lock){+.?...}, at: [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70
 {SOFTIRQ-ON-W} state was registered at:
   [<ffffffff810bf6b1>] __lock_acquire+0x461/0x1e70
   [<ffffffff810c1947>] lock_acquire+0xb7/0x290
   [<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
   [<ffffffff8143a07d>] blk_alloc_devt+0x6d/0xd0  <-- take the lock in process context
[..]
  [<ffffffff810bf64e>] __lock_acquire+0x3fe/0x1e70
  [<ffffffff810c00ad>] ? __lock_acquire+0xe5d/0x1e70
  [<ffffffff810c1947>] lock_acquire+0xb7/0x290
  [<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
  [<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
  [<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
  [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70    <-- take the lock in softirq
  [<ffffffff8143bfec>] part_release+0x1c/0x50
  [<ffffffff8158edf6>] device_release+0x36/0xb0
  [<ffffffff8145ac2b>] kobject_cleanup+0x7b/0x1a0
  [<ffffffff8145aad0>] kobject_put+0x30/0x70
  [<ffffffff8158f147>] put_device+0x17/0x20
  [<ffffffff8143c29c>] delete_partition_rcu_cb+0x16c/0x180
  [<ffffffff8143c130>] ? read_dev_sector+0xa0/0xa0
  [<ffffffff810e0e0f>] rcu_process_callbacks+0x2ff/0xa90
  [<ffffffff810e0dcf>] ? rcu_process_callbacks+0x2bf/0xa90
  [<ffffffff81067e2e>] __do_softirq+0xde/0x600

Neil sees this in his tests and it also triggers on pmem driver unbind
for the libnvdimm tests.  This fix is on top of an initial fix by Keith
for incorrect usage of mutex_lock() in this path: 2da78092 "block:
Fix dev_t minor allocation lifetime".  Both this and 2da78092 are
candidates for -stable.

Fixes: 2da78092 ("block: Fix dev_t minor allocation lifetime")
Cc: <stable@vger.kernel.org>
Cc: Keith Busch <keith.busch@intel.com>
Reported-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

4d66e5e9

10 6月, 2015 1 次提交

blk-mq: free hctx->ctxs in queue's release handler · c3b4afca

由 Ming Lei 提交于 6月 04, 2015

Now blk_cleanup_queue() can be called before calling
del_gendisk()[1], inside which hctx->ctxs is touched
from blk_mq_unregister_hctx(), but the variable has
been freed by blk_cleanup_queue() at that time.

So this patch moves freeing of hctx->ctxs into queue's
release handler for fixing the oops reported by Stefan.

[1], 6cd18e71 (block: destroy bdi before blockdev is
unregistered)
Reported-by: NStefan Seyfried <stefan.seyfried@googlemail.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org (v4.0)
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c3b4afca

29 5月, 2015 1 次提交

block: discard bdi_unregister() in favour of bdi_destroy() · aad653a0

由 NeilBrown 提交于 5月 19, 2015

bdi_unregister() now contains very little functionality.

It contains a "WARN_ON" if bdi->dev is NULL.  This warning is of no
real consequence as bdi->dev isn't needed by anything else in the function,
and it triggers if
   blk_cleanup_queue() -> bdi_destroy()
is called before bdi_unregister, which happens since
  Commit: 6cd18e71 ("block: destroy bdi before blockdev is unregistered.")

So this isn't wanted.

It also calls bdi_set_min_ratio().  This needs to be called after
writes through the bdi have all been flushed, and before the bdi is destroyed.
Calling it early is better than calling it late as it frees up a global
resource.

Calling it immediately after bdi_wb_shutdown() in bdi_destroy()
perfectly fits these requirements.

So bdi_unregister() can be discarded with the important content moved to
bdi_destroy(), as can the
  writeback_bdi_unregister
event which is already not used.
Reported-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org (v4.0)
Fixes: c4db59d3 ("fs: don't reassign dirty inodes to default_backing_dev_info")
Fixes: 6cd18e71 ("block: destroy bdi before blockdev is unregistered.")
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NDan Williams <dan.j.williams@intel.com>
Tested-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

aad653a0

13 5月, 2015 1 次提交

block: remove export for blk_queue_bio · 336b7e1f

由 Mike Snitzer 提交于 5月 11, 2015

With commit ff36ab34 ("dm: remove request-based logic from
make_request_fn wrapper") DM no longer calls blk_queue_bio() directly,
so remove its export.  Doing so required a forward declaration in
blk-core.c.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

336b7e1f

05 5月, 2015 1 次提交

blk-mq: don't lose requests if a stopped queue restarts · 9ba52e58

由 Shaohua Li 提交于 5月 04, 2015

Normally if driver is busy to dispatch a request the logic is like below:
block layer:					driver:
	__blk_mq_run_hw_queue
a.						blk_mq_stop_hw_queue
b.	rq add to ctx->dispatch

later:
1.						blk_mq_start_hw_queue
2.	__blk_mq_run_hw_queue

But it's possible step 1-2 runs between a and b. And since rq isn't in
ctx->dispatch yet, step 2 will not run rq. The rq might get lost if
there are no subsequent requests kick in.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9ba52e58

28 4月, 2015 1 次提交

block: destroy bdi before blockdev is unregistered. · 6cd18e71

由 NeilBrown 提交于 4月 27, 2015

Because of the peculiar way that md devices are created (automatically
when the device node is opened), a new device can be created and
registered immediately after the
	blk_unregister_region(disk_devt(disk), disk->minors);
call in del_gendisk().

Therefore it is important that all visible artifacts of the previous
device are removed before this call.  In particular, the 'bdi'.

Since:
commit c4db59d3
Author: Christoph Hellwig <hch@lst.de>
    fs: don't reassign dirty inodes to default_backing_dev_info

moved the
   device_unregister(bdi->dev);
call from bdi_unregister() to bdi_destroy() it has been quite easy to
lose a race and have a new (e.g.) "md127" be created after the
blk_unregister_region() call and before bdi_destroy() is ultimately
called by the final 'put_disk', which must come after del_gendisk().

The new device finds that the bdi name is already registered in sysfs
and complains

> [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70()
> [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127'

We can fix this by moving the bdi_destroy() call out of
blk_release_queue() (which can happen very late when a refcount
reaches zero) and into blk_cleanup_queue() - which happens exactly when the md
device driver calls it.

Then it is only necessary for md to call blk_cleanup_queue() before
del_gendisk().  As loop.c devices are also created on demand by
opening the device node, we make the same change there.

Fixes: c4db59d3Reported-by: NAzat Khuzhin <a3at.mail@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org (v4.0)
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

6cd18e71

27 4月, 2015 1 次提交

block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE · 393a3397

由 Wang YanQing 提交于 4月 26, 2015

Commit d2c5e30c
("[PATCH] zoned vm counters: conversion of nr_bounce to per zone counter")
convert statistic of nr_bounce to per zone and one global value in vm_stat,
but it call inc_|dec_zone_page_state on different pages, then different
zones, and cause us to get unexpected value of NR_BOUNCE.

Below is the result on my machine:
Mar 2 09:26:08 udknight kernel: [144766.778265] Mem-Info:
Mar 2 09:26:08 udknight kernel: [144766.778266] DMA per-cpu:
Mar 2 09:26:08 udknight kernel: [144766.778268] CPU 0: hi: 0, btch: 1 usd: 0
Mar 2 09:26:08 udknight kernel: [144766.778269] CPU 1: hi: 0, btch: 1 usd: 0
Mar 2 09:26:08 udknight kernel: [144766.778270] Normal per-cpu:
Mar 2 09:26:08 udknight kernel: [144766.778271] CPU 0: hi: 186, btch: 31 usd: 0
Mar 2 09:26:08 udknight kernel: [144766.778273] CPU 1: hi: 186, btch: 31 usd: 0
Mar 2 09:26:08 udknight kernel: [144766.778274] HighMem per-cpu:
Mar 2 09:26:08 udknight kernel: [144766.778275] CPU 0: hi: 186, btch: 31 usd: 0
Mar 2 09:26:08 udknight kernel: [144766.778276] CPU 1: hi: 186, btch: 31 usd: 0
Mar 2 09:26:08 udknight kernel: [144766.778279] active_anon:46926 inactive_anon:287406 isolated_anon:0
Mar 2 09:26:08 udknight kernel: [144766.778279] active_file:105085 inactive_file:139432 isolated_file:0
Mar 2 09:26:08 udknight kernel: [144766.778279] unevictable:653 dirty:0 writeback:0 unstable:0
Mar 2 09:26:08 udknight kernel: [144766.778279] free:178957 slab_reclaimable:6419 slab_unreclaimable:9966
Mar 2 09:26:08 udknight kernel: [144766.778279] mapped:4426 shmem:305277 pagetables:784 bounce:0
Mar 2 09:26:08 udknight kernel: [144766.778279] free_cma:0
Mar 2 09:26:08 udknight kernel: [144766.778286] DMA free:3324kB min:68kB low:84kB high:100kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15976kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Mar 2 09:26:08 udknight kernel: [144766.778287] lowmem_reserve[]: 0 822 3754 3754
Mar 2 09:26:08 udknight kernel: [144766.778293] Normal free:26828kB min:3632kB low:4540kB high:5448kB active_anon:4872kB inactive_anon:68kB active_file:1796kB inactive_file:1796kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:892920kB managed:842560kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:4144kB slab_reclaimable:25676kB slab_unreclaimable:39864kB kernel_stack:1944kB pagetables:3136kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2412612 all_unreclaimable? yes
Mar 2 09:26:08 udknight kernel: [144766.778294] lowmem_reserve[]: 0 0 23451 23451
Mar 2 09:26:08 udknight kernel: [144766.778299] HighMem free:685676kB min:512kB low:3748kB high:6984kB active_anon:182832kB inactive_anon:1149556kB active_file:418544kB inactive_file:555932kB unevictable:2612kB isolated(anon):0kB isolated(file):0kB present:3001732kB managed:3001732kB mlocked:0kB dirty:0kB writeback:0kB mapped:17704kB shmem:1216964kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:75771152kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Mar 2 09:26:08 udknight kernel: [144766.778300] lowmem_reserve[]: 0 0 0 0

You can see bounce:75771152kB for HighMem, but bounce:0 for lowmem and global.

This patch fix it.
Signed-off-by: NWang YanQing <udknight@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

393a3397

24 4月, 2015 3 次提交

elevator: fix double release of elevator module · 8406a4d5

由 Chao Yu 提交于 4月 23, 2015

Our issue is descripted in below call path:
->elevator_init
 ->elevator_init_fn
  ->{cfq,deadline,noop}_init_queue
   ->elevator_alloc
    ->kzalloc_node
   fail to call kzalloc_node and then put module in elevator_alloc;
fail to call elevator_init_fn and then put module again in elevator_init.

Remove elevator_put invoking in error path of elevator_alloc to avoid
double release issue.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

8406a4d5

blk-mq: fix CPU hotplug handling · 2a34c087

由 Ming Lei 提交于 4月 21, 2015

hctx->tags has to be set as NULL in case that it is to be unmapped
no matter if set->tags[hctx->queue_num] is NULL or not in blk_mq_map_swqueue()
because shared tags can be freed already from another request queue.

The same situation has to be considered during handling CPU online too.
Unmapped hw queue can be remapped after CPU topo is changed, so we need
to allocate tags for the hw queue in blk_mq_map_swqueue(). Then tags
allocation for hw queue can be removed in hctx cpu online notifier, and it
is reasonable to do that after mapping is updated.

Cc: <stable@vger.kernel.org>
Reported-by: NDongsu Park <dongsu.park@profitbricks.com>
Tested-by: NDongsu Park <dongsu.park@profitbricks.com>
Signed-off-by: NMing Lei <ming.lei@canonical.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

2a34c087

blk-mq: fix race between timeout and CPU hotplug · f054b56c

由 Ming Lei 提交于 4月 21, 2015

Firstly during CPU hotplug, even queue is freezed, timeout
handler still may come and access hctx->tags, which may cause
use after free, so this patch deactivates timeout handler
inside CPU hotplug notifier.

Secondly, tags can be shared by more than one queues, so we
have to check if the hctx has been unmapped, otherwise
still use-after-free on tags can be triggered.

Cc: <stable@vger.kernel.org>
Reported-by: NDongsu Park <dongsu.park@profitbricks.com>
Tested-by: NDongsu Park <dongsu.park@profitbricks.com>
Signed-off-by: NMing Lei <ming.lei@canonical.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

f054b56c

17 4月, 2015 1 次提交

blk-mq: fix iteration of busy bitmap · 569fd0ce

由 Jens Axboe 提交于 4月 17, 2015

Commit 889fa31f was a bit too eager in reducing the loop count,
so we ended up missing queues in some configurations. Ensure that
our division rounds up, so that's not the case.
Reported-by: NGuenter Roeck <linux@roeck-us.net>
Fixes: 889fa31f ("blk-mq: reduce unnecessary software queue looping")
Signed-off-by: NJens Axboe <axboe@fb.com>

569fd0ce

16 4月, 2015 1 次提交

blk-mq: reduce unnecessary software queue looping · 889fa31f

由 Chong Yuan 提交于 4月 15, 2015

In flush_busy_ctxs() and blk_mq_hctx_has_pending(), regardless of how many
ctxs assigned to one hctx, they will all loop hctx->ctx_map.map_size
times. Here hctx->ctx_map.map_size is a const ALIGN(nr_cpu_ids, 8) / 8.
Especially, flush_busy_ctxs() is in hot code path. And it's unnecessary.
Change ->map_size to contain the actually mapped software queues, so we
only loop for as many iterations as we have to.

And remove cpumask setting and nr_ctx count in blk_mq_init_cpu_queues()
since they are all re-done in blk_mq_map_swqueue().
blk_mq_map_swqueue().
Signed-off-by: NChong Yuan <chong.yuan@memblaze.com>
Reviewed-by: NWenbo Wang <wenbo.wang@memblaze.com>

Updated by me for formatting and commenting.
Signed-off-by: NJens Axboe <axboe@fb.com>

889fa31f

12 4月, 2015 3 次提交

A
blk_rq_map_user(): use import_single_range() · 8f7e885a
由 Al Viro 提交于 3月 21, 2015
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
8f7e885a

sg_io(): use import_iovec() · e272b89f

由 Al Viro 提交于 3月 21, 2015

... and don't skip access_ok() validation.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e272b89f

blk-mq: initialize 'struct request' and associated data to zero · ac211175

由 Linus Torvalds 提交于 4月 09, 2015

Jan Engelhardt reports a strange oops with an invalid ->sense_buffer
pointer in scsi_init_cmd_errh() with the blk-mq code.

The sense_buffer pointer should have been initialized by the call to
scsi_init_request() from blk_mq_init_rq_map(), but there seems to be
some non-repeatable memory corruptor.

This patch makes sure we initialize the whole struct request allocation
(and the associated 'struct scsi_cmnd' for the SCSI case) to zero, by
using __GFP_ZERO in the allocation.  The old code initialized a couple
of individual fields, leaving the rest undefined (although many of them
are then initialized in later phases, like blk_mq_rq_ctx_init() etc.

It's not entirely clear why this matters, but it's the rigth thing to do
regardless, and with 4.0 imminent this is the defensive "let's just make
sure everything is initialized properly" patch.
Tested-by: NJan Engelhardt <jengelh@inai.de>
Acked-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ac211175

31 3月, 2015 1 次提交

block: fix blk_stack_limits() regression due to lcm() change · e9637415

由 Mike Snitzer 提交于 3月 30, 2015

Linux 3.19 commit 69c953c8 ("lib/lcm.c: lcm(n,0)=lcm(0,n) is 0, not n")
caused blk_stack_limits() to not properly stack queue_limits for stacked
devices (e.g. DM).

Fix this regression by establishing lcm_not_zero() and switching
blk_stack_limits() over to using it.

DM uses blk_set_stacking_limits() to establish the initial top-level
queue_limits that are then built up based on underlying devices' limits
using blk_stack_limits().  In the case of optimal_io_size (io_opt)
blk_set_stacking_limits() establishes a default value of 0.  With commit
69c953c8, lcm(0, n) is no longer n, which compromises proper stacking of
the underlying devices' io_opt.

Test:
$ modprobe scsi_debug dev_size_mb=10 num_tgts=1 opt_blks=1536
$ cat /sys/block/sde/queue/optimal_io_size
786432
$ dmsetup create node --table "0 100 linear /dev/sde 0"

Before this fix:
$ cat /sys/block/dm-5/queue/optimal_io_size
0

After this fix:
$ cat /sys/block/dm-5/queue/optimal_io_size
786432
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.19+
Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

e9637415

30 3月, 2015 2 次提交

blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue() · c76cbbcf

由 Wei Fang 提交于 3月 30, 2015

Don't assign ->rq_timeout twice.
Signed-off-by: NWei Fang <fangwei1@huawei.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c76cbbcf

block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set() · f9018ac9

由 Xiaoguang Wang 提交于 3月 30, 2015

At the beginning of blk_mq_alloc_tag_set(), we have already checked whether
'set->nr_hw_queues' is zero, so here remove this redundant check.
Signed-off-by: NXiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

f9018ac9

25 3月, 2015 1 次提交

block: allocate request memory local to request queue · 271508db

由 David Rientjes 提交于 3月 24, 2015

blk_init_rl() allocates a mempool using mempool_create_node() with node
local memory.  This only allocates the mempool and element list locally
to the requeue queue node.

What we really want to do is allocate the request itself local to the
queue.  To do this, we need our own alloc and free functions that will
allocate from request_cachep and pass the request queue node in to prefer
node local memory.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

271508db

20 3月, 2015 1 次提交

Fix bug in blk_rq_merge_ok · 7ee8e4f3

由 Wenbo Wang 提交于 3月 20, 2015

Use the right array index to reference the last
element of rq->biotail->bi_io_vec[]
Signed-off-by: NWenbo Wang <wenbo.wang@memblaze.com>
Reviewed-by: NChong Yuan <chong.yuan@memblaze.com>
Fixes: 66cb45aa ("block: add support for limiting gaps in SG lists")
Cc: stable@kernel.org
Signed-off-by: NJens Axboe <axboe@fb.com>

7ee8e4f3

19 3月, 2015 1 次提交

blkmq: Fix NULL pointer deref when all reserved tags in · bc188d81

由 Sam Bradshaw 提交于 3月 18, 2015

When allocating from the reserved tags pool, bt_get() is called with
a NULL hctx.  If all tags are in use, the hw queue is kicked to push
out any pending IO, potentially freeing tags, and tag allocation is
retried.  The problem is that blk_mq_run_hw_queue() doesn't check for
a NULL hctx.  So we avoid it with a simple NULL hctx test.

Tested by hammering mtip32xx with concurrent smartctl/hdparm.
Signed-off-by: NSam Bradshaw <sbradshaw@micron.com>
Signed-off-by: NSelvan Mani <smani@micron.com>
Fixes: b3223207 ("blk-mq: fix hang in bt_get()")
Cc: stable@kernel.org

Added appropriate comment.
Signed-off-by: NJens Axboe <axboe@fb.com>

bc188d81

13 3月, 2015 4 次提交

blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set · bfd343aa

由 Keith Busch 提交于 3月 11, 2015

Return -EBUSY if we're unable to enter a queue immediately when
allocating a blk-mq request without __GFP_WAIT.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

bfd343aa

blk-mq: export blk_mq_run_hw_queues · b94ec296

由 Mike Snitzer 提交于 3月 11, 2015

Rename blk_mq_run_queues to blk_mq_run_hw_queues, add async argument,
and export it.

DM's suspend support must be able to run the queue without starting
stopped hw queues.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

b94ec296

blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk · b62c21b7

由 Mike Snitzer 提交于 3月 12, 2015

Add a variant of blk_mq_init_queue that allows a previously allocated
queue to be initialized. blk_mq_init_allocated_queue models
blk_init_allocated_queue -- which was also created for DM's use.

DM's approach to device creation requires a placeholder request_queue be
allocated for use with alloc_dev() but the decision about what type of
request_queue will be ultimately created is deferred until all component
devices referenced in the DM table are processed to determine the table
type (request-based, blk-mq request-based, or bio-based).

Also, because of DM's late finalization of the request_queue type
the call to blk_mq_register_disk() doesn't happen during alloc_dev().
Must export blk_mq_register_disk() so that DM can backfill the 'mq' dir
once the blk-mq queue is fully allocated.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NMing Lei <ming.lei@canonical.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

b62c21b7

blk-mq: fix use of incorrect goto label in blk_mq_init_queue error path · 9a30b096

由 Mike Snitzer 提交于 3月 12, 2015

If percpu_ref_init() fails the allocated q and hctxs must get cleaned
up; using 'err_map' doesn't allow that to happen.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NMing Lei <ming.lei@canonical.com>
Cc: stable@kernel.org
Signed-off-by: NJens Axboe <axboe@fb.com>

9a30b096

21 2月, 2015 1 次提交

blk-throttle: check stats_cpu before reading it from sysfs · 045c47ca

由 Thadeu Lima de Souza Cascardo 提交于 2月 16, 2015

When reading blkio.throttle.io_serviced in a recently created blkio
cgroup, it's possible to race against the creation of a throttle policy,
which delays the allocation of stats_cpu.

Like other functions in the throttle code, just checking for a NULL
stats_cpu prevents the following oops caused by that race.

[ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
[ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
[ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
[ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
[ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
[ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
[ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
[ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
[ 1137.734202] REGS: c000000f1d213500 TRAP: 0300   Not tainted  (3.19.0)
[ 1137.734230] MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI>  CR: 42008884  XER: 20000000
[ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
[ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
[ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
[ 1137.734943] Call Trace:
[ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
[ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
[ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
[ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
[ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
[ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
[ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
[ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
[ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
[ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
[ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
[ 1137.735383] Instruction dump:
[ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
[ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 <7cead02a> e9090008 e9490010 e9290018

And here is one code that allows to easily reproduce this, although this
has first been found by running docker.

void run(pid_t pid)
{
	int n;
	int status;
	int fd;
	char *buffer;
	buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
	n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
	fd = open(CGPATH "/test/tasks", O_WRONLY);
	write(fd, buffer, n);
	close(fd);
	if (fork() > 0) {
		fd = open("/dev/sda", O_RDONLY | O_DIRECT);
		read(fd, buffer, 512);
		close(fd);
		wait(&status);
	} else {
		fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
		n = read(fd, buffer, BUFFER_SIZE);
		close(fd);
	}
	free(buffer);
	exit(0);
}

void test(void)
{
	int status;
	mkdir(CGPATH "/test", 0666);
	if (fork() > 0)
		wait(&status);
	else
		run(getpid());
	rmdir(CGPATH "/test");
}

int main(int argc, char **argv)
{
	int i;
	for (i = 0; i < NR_TESTS; i++)
		test();
	return 0;
}
Reported-by: NRicardo Marin Matinata <rmm@br.ibm.com>
Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@fb.com>

045c47ca

12 2月, 2015 4 次提交

C
block: remove unused function blk_bio_map_sg · d427e3c8
由 Christoph Hellwig 提交于 2月 11, 2015
```
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>
```
d427e3c8

block: handle the null_mapped flag correctly in blk_rq_map_user_iov · a0763b27

由 Christoph Hellwig 提交于 2月 11, 2015

The tape drivers (and the sg driver in a special case that doesn't matter
here) use the null_mapped flag to tell blk_rq_map_user to not copy around
any data into or out of the bounce buffers. blk_rq_map_user_iov never
got that treatment, which didn't matter until I refactored blk_rq_map_user
to be implemented in terms of blk_rq_map_user_iov.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Fixes: ddad8dd0 ("block: use blk_rq_map_user_iov to implement blk_rq_map_user")
Signed-off-by: NJens Axboe <axboe@fb.com>

a0763b27

blk-mq: fix double-free in error path · 564e559f

由 Tony Battersby 提交于 2月 11, 2015

If the allocation of bt->bs fails, then bt->map can be freed twice, once
in blk_mq_init_bitmap_tags() -> bt_alloc(), and once in
blk_mq_init_bitmap_tags() -> bt_free().  Fix by setting the pointer to
NULL after the first free.

Cc: <stable@vger.kernel.org>
Signed-off-by: NTony Battersby <tonyb@cybernetics.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

564e559f

block: prevent request-to-request merging with gaps if not allowed · 854fbb9c

由 Keith Busch 提交于 2月 11, 2015

If the queue has SG_GAPS set, we must not merge across an sg gap.
This is caught for the bio case, but currently not for the
more rare case of merging two requests directly.
Signed-off-by: NKeith Busch <keith.busch@intel.com>

Cut the dm bits, those will go through the dm tree, and fixed
the test_bit() test.
Signed-off-by: NJens Axboe <axboe@fb.com>

854fbb9c

11 2月, 2015 1 次提交

blk-mq: make blk_mq_run_queues() static · 201f201c

由 Jens Axboe 提交于 2月 10, 2015

We no longer use it outside of blk-mq.c, so we can make it static
and stop exporting it. Additionally, kill the 'async' argument, as
there's only one used of it.
Signed-off-by: NJens Axboe <axboe@fb.com>

201f201c

10 2月, 2015 1 次提交

cfq-iosched: handle failure of cfq group allocation · 69abaffe

由 Konstantin Khlebnikov 提交于 2月 09, 2015

Cfq_lookup_create_cfqg() allocates struct blkcg_gq using GFP_ATOMIC.
In cfq_find_alloc_queue() possible allocation failure is not handled.
As a result kernel oopses on NULL pointer dereference when
cfq_link_cfqq_cfqg() calls cfqg_get() for NULL pointer.

Bug was introduced in v3.5 in commit cd1604fa ("blkcg: factor
out blkio_group creation"). Prior to that commit cfq group lookup
had returned pointer to root group as fallback.

This patch handles this error using existing fallback oom_cfqq.
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: NTejun Heo <tj@kernel.org>
Acked-by: NVivek Goyal <vgoyal@redhat.com>
Fixes: cd1604fa ("blkcg: factor out blkio_group creation")
Cc: stable@kernel.org
Signed-off-by: NJens Axboe <axboe@fb.com>

69abaffe

06 2月, 2015 8 次提交

block: Quiesce zeroout wrapper · 9f9ee1f2

由 Martin K. Petersen 提交于 2月 05, 2015

blkdev_issue_zeroout() printed a warning if a device failed a discard or
write same request despite advertising support for these. That's fine
for SCSI since we'll disable these commands if we get an error back from
the disk saying that they are not supported. And consequently the
warning only gets printed once.

There are other types of block devices that support discard, however,
and these may return -EOPNOTSUPP for each command but leave discard
enabled in the queue limits. This will cause a warning message for every
blkdev_issue_zeroout() invocation.

Remove the offending warning messages.
Reported-by: NSedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9f9ee1f2

block: rewrite and split __bio_copy_iov() · 9124d3fe

由 Dongsu Park 提交于 1月 18, 2015

Rewrite __bio_copy_iov using the copy_page_{from,to}_iter helpers, and
split it into two simpler functions.

This commit should contain only literal replacements, without
functional changes.

Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NDongsu Park <dongsu.park@profitbricks.com>
[hch: removed the __bio_copy_iov wrapper]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9124d3fe

block: merge __bio_map_user_iov into bio_map_user_iov · 37f19e57

由 Christoph Hellwig 提交于 1月 18, 2015

And also remove the unused bdev argument.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

37f19e57

block: merge __bio_map_kern into bio_map_kern · 75c72b83

由 Christoph Hellwig 提交于 1月 18, 2015

This saves a little code, and allow to simplify the error handling.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

75c72b83

block: pass iov_iter to the BLOCK_PC mapping functions · 26e49cfc

由 Kent Overstreet 提交于 1月 18, 2015

Make use of a new interface provided by iov_iter, backed by
scatter-gather list of iovec, instead of the old interface based on
sg_iovec. Also use iov_iter_advance() instead of manual iteration.

This commit should contain only literal replacements, without
functional changes.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
[dpark: add more description in commit message]
Signed-off-by: NDongsu Park <dongsu.park@profitbricks.com>
[hch: fixed to do a deep clone of the iov_iter, and to properly use
      the iov_iter direction]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

26e49cfc

block: add a helper to free bio bounce buffer pages · 1dfa0f68

由 Christoph Hellwig 提交于 1月 18, 2015

The code sniplet to walk all bio_vecs and free their pages is opencoded in
way to many places, so factor it into a helper.  Also convert the slightly
more complex cases in bio_kern_endio and __bio_copy_iov where we break
the freeing from an existing loop into a separate one.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

1dfa0f68

block: use blk_rq_map_user_iov to implement blk_rq_map_user · ddad8dd0

由 Christoph Hellwig 提交于 1月 18, 2015

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

ddad8dd0

block: simplify bio_map_kern · 42d2683a

由 Christoph Hellwig 提交于 1月 18, 2015

Just open code the trivial mapping from a kernel virtual address to
a bio instead of going through the complex user address mapping
machinery.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

42d2683a