1. 11 6月, 2015 1 次提交
    • D
      block: fix ext_dev_lock lockdep report · 4d66e5e9
      Dan Williams 提交于
       =================================
       [ INFO: inconsistent lock state ]
       4.1.0-rc7+ #217 Tainted: G           O
       ---------------------------------
       inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
       swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
        (ext_devt_lock){+.?...}, at: [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70
       {SOFTIRQ-ON-W} state was registered at:
         [<ffffffff810bf6b1>] __lock_acquire+0x461/0x1e70
         [<ffffffff810c1947>] lock_acquire+0xb7/0x290
         [<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
         [<ffffffff8143a07d>] blk_alloc_devt+0x6d/0xd0  <-- take the lock in process context
      [..]
        [<ffffffff810bf64e>] __lock_acquire+0x3fe/0x1e70
        [<ffffffff810c00ad>] ? __lock_acquire+0xe5d/0x1e70
        [<ffffffff810c1947>] lock_acquire+0xb7/0x290
        [<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
        [<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
        [<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
        [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70    <-- take the lock in softirq
        [<ffffffff8143bfec>] part_release+0x1c/0x50
        [<ffffffff8158edf6>] device_release+0x36/0xb0
        [<ffffffff8145ac2b>] kobject_cleanup+0x7b/0x1a0
        [<ffffffff8145aad0>] kobject_put+0x30/0x70
        [<ffffffff8158f147>] put_device+0x17/0x20
        [<ffffffff8143c29c>] delete_partition_rcu_cb+0x16c/0x180
        [<ffffffff8143c130>] ? read_dev_sector+0xa0/0xa0
        [<ffffffff810e0e0f>] rcu_process_callbacks+0x2ff/0xa90
        [<ffffffff810e0dcf>] ? rcu_process_callbacks+0x2bf/0xa90
        [<ffffffff81067e2e>] __do_softirq+0xde/0x600
      
      Neil sees this in his tests and it also triggers on pmem driver unbind
      for the libnvdimm tests.  This fix is on top of an initial fix by Keith
      for incorrect usage of mutex_lock() in this path: 2da78092 "block:
      Fix dev_t minor allocation lifetime".  Both this and 2da78092 are
      candidates for -stable.
      
      Fixes: 2da78092 ("block: Fix dev_t minor allocation lifetime")
      Cc: <stable@vger.kernel.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Reported-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4d66e5e9
  2. 10 6月, 2015 1 次提交
  3. 29 5月, 2015 1 次提交
  4. 13 5月, 2015 1 次提交
  5. 05 5月, 2015 1 次提交
    • S
      blk-mq: don't lose requests if a stopped queue restarts · 9ba52e58
      Shaohua Li 提交于
      Normally if driver is busy to dispatch a request the logic is like below:
      block layer:					driver:
      	__blk_mq_run_hw_queue
      a.						blk_mq_stop_hw_queue
      b.	rq add to ctx->dispatch
      
      later:
      1.						blk_mq_start_hw_queue
      2.	__blk_mq_run_hw_queue
      
      But it's possible step 1-2 runs between a and b. And since rq isn't in
      ctx->dispatch yet, step 2 will not run rq. The rq might get lost if
      there are no subsequent requests kick in.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9ba52e58
  6. 28 4月, 2015 1 次提交
    • N
      block: destroy bdi before blockdev is unregistered. · 6cd18e71
      NeilBrown 提交于
      Because of the peculiar way that md devices are created (automatically
      when the device node is opened), a new device can be created and
      registered immediately after the
      	blk_unregister_region(disk_devt(disk), disk->minors);
      call in del_gendisk().
      
      Therefore it is important that all visible artifacts of the previous
      device are removed before this call.  In particular, the 'bdi'.
      
      Since:
      commit c4db59d3
      Author: Christoph Hellwig <hch@lst.de>
          fs: don't reassign dirty inodes to default_backing_dev_info
      
      moved the
         device_unregister(bdi->dev);
      call from bdi_unregister() to bdi_destroy() it has been quite easy to
      lose a race and have a new (e.g.) "md127" be created after the
      blk_unregister_region() call and before bdi_destroy() is ultimately
      called by the final 'put_disk', which must come after del_gendisk().
      
      The new device finds that the bdi name is already registered in sysfs
      and complains
      
      > [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70()
      > [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127'
      
      We can fix this by moving the bdi_destroy() call out of
      blk_release_queue() (which can happen very late when a refcount
      reaches zero) and into blk_cleanup_queue() - which happens exactly when the md
      device driver calls it.
      
      Then it is only necessary for md to call blk_cleanup_queue() before
      del_gendisk().  As loop.c devices are also created on demand by
      opening the device node, we make the same change there.
      
      Fixes: c4db59d3Reported-by: NAzat Khuzhin <a3at.mail@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org (v4.0)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6cd18e71
  7. 27 4月, 2015 1 次提交
    • W
      block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE · 393a3397
      Wang YanQing 提交于
      Commit d2c5e30c
      ("[PATCH] zoned vm counters: conversion of nr_bounce to per zone counter")
      convert statistic of nr_bounce to per zone and one global value in vm_stat,
      but it call inc_|dec_zone_page_state on different pages, then different
      zones, and cause us to get unexpected value of NR_BOUNCE.
      
      Below is the result on my machine:
      Mar  2 09:26:08 udknight kernel: [144766.778265] Mem-Info:
      Mar  2 09:26:08 udknight kernel: [144766.778266] DMA per-cpu:
      Mar  2 09:26:08 udknight kernel: [144766.778268] CPU    0: hi:    0, btch:   1 usd:   0
      Mar  2 09:26:08 udknight kernel: [144766.778269] CPU    1: hi:    0, btch:   1 usd:   0
      Mar  2 09:26:08 udknight kernel: [144766.778270] Normal per-cpu:
      Mar  2 09:26:08 udknight kernel: [144766.778271] CPU    0: hi:  186, btch:  31 usd:   0
      Mar  2 09:26:08 udknight kernel: [144766.778273] CPU    1: hi:  186, btch:  31 usd:   0
      Mar  2 09:26:08 udknight kernel: [144766.778274] HighMem per-cpu:
      Mar  2 09:26:08 udknight kernel: [144766.778275] CPU    0: hi:  186, btch:  31 usd:   0
      Mar  2 09:26:08 udknight kernel: [144766.778276] CPU    1: hi:  186, btch:  31 usd:   0
      Mar  2 09:26:08 udknight kernel: [144766.778279] active_anon:46926 inactive_anon:287406 isolated_anon:0
      Mar  2 09:26:08 udknight kernel: [144766.778279]  active_file:105085 inactive_file:139432 isolated_file:0
      Mar  2 09:26:08 udknight kernel: [144766.778279]  unevictable:653 dirty:0 writeback:0 unstable:0
      Mar  2 09:26:08 udknight kernel: [144766.778279]  free:178957 slab_reclaimable:6419 slab_unreclaimable:9966
      Mar  2 09:26:08 udknight kernel: [144766.778279]  mapped:4426 shmem:305277 pagetables:784 bounce:0
      Mar  2 09:26:08 udknight kernel: [144766.778279]  free_cma:0
      Mar  2 09:26:08 udknight kernel: [144766.778286] DMA free:3324kB min:68kB low:84kB high:100kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15976kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      Mar  2 09:26:08 udknight kernel: [144766.778287] lowmem_reserve[]: 0 822 3754 3754
      Mar  2 09:26:08 udknight kernel: [144766.778293] Normal free:26828kB min:3632kB low:4540kB high:5448kB active_anon:4872kB inactive_anon:68kB active_file:1796kB inactive_file:1796kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:892920kB managed:842560kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:4144kB slab_reclaimable:25676kB slab_unreclaimable:39864kB kernel_stack:1944kB pagetables:3136kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2412612 all_unreclaimable? yes
      Mar  2 09:26:08 udknight kernel: [144766.778294] lowmem_reserve[]: 0 0 23451 23451
      Mar  2 09:26:08 udknight kernel: [144766.778299] HighMem free:685676kB min:512kB low:3748kB high:6984kB active_anon:182832kB inactive_anon:1149556kB active_file:418544kB inactive_file:555932kB unevictable:2612kB isolated(anon):0kB isolated(file):0kB present:3001732kB managed:3001732kB mlocked:0kB dirty:0kB writeback:0kB mapped:17704kB shmem:1216964kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:75771152kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      Mar  2 09:26:08 udknight kernel: [144766.778300] lowmem_reserve[]: 0 0 0 0
      
      You can see bounce:75771152kB for HighMem, but bounce:0 for lowmem and global.
      
      This patch fix it.
      Signed-off-by: NWang YanQing <udknight@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      393a3397
  8. 24 4月, 2015 3 次提交
  9. 17 4月, 2015 1 次提交
  10. 16 4月, 2015 1 次提交
    • C
      blk-mq: reduce unnecessary software queue looping · 889fa31f
      Chong Yuan 提交于
      In flush_busy_ctxs() and blk_mq_hctx_has_pending(), regardless of how many
      ctxs assigned to one hctx, they will all loop hctx->ctx_map.map_size
      times. Here hctx->ctx_map.map_size is a const ALIGN(nr_cpu_ids, 8) / 8.
      Especially, flush_busy_ctxs() is in hot code path. And it's unnecessary.
      Change ->map_size to contain the actually mapped software queues, so we
      only loop for as many iterations as we have to.
      
      And remove cpumask setting and nr_ctx count in blk_mq_init_cpu_queues()
      since they are all re-done in blk_mq_map_swqueue().
      blk_mq_map_swqueue().
      Signed-off-by: NChong Yuan <chong.yuan@memblaze.com>
      Reviewed-by: NWenbo Wang <wenbo.wang@memblaze.com>
      
      Updated by me for formatting and commenting.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      889fa31f
  11. 12 4月, 2015 3 次提交
  12. 31 3月, 2015 1 次提交
    • M
      block: fix blk_stack_limits() regression due to lcm() change · e9637415
      Mike Snitzer 提交于
      Linux 3.19 commit 69c953c8 ("lib/lcm.c: lcm(n,0)=lcm(0,n) is 0, not n")
      caused blk_stack_limits() to not properly stack queue_limits for stacked
      devices (e.g. DM).
      
      Fix this regression by establishing lcm_not_zero() and switching
      blk_stack_limits() over to using it.
      
      DM uses blk_set_stacking_limits() to establish the initial top-level
      queue_limits that are then built up based on underlying devices' limits
      using blk_stack_limits().  In the case of optimal_io_size (io_opt)
      blk_set_stacking_limits() establishes a default value of 0.  With commit
      69c953c8, lcm(0, n) is no longer n, which compromises proper stacking of
      the underlying devices' io_opt.
      
      Test:
      $ modprobe scsi_debug dev_size_mb=10 num_tgts=1 opt_blks=1536
      $ cat /sys/block/sde/queue/optimal_io_size
      786432
      $ dmsetup create node --table "0 100 linear /dev/sde 0"
      
      Before this fix:
      $ cat /sys/block/dm-5/queue/optimal_io_size
      0
      
      After this fix:
      $ cat /sys/block/dm-5/queue/optimal_io_size
      786432
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.19+
      Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e9637415
  13. 30 3月, 2015 2 次提交
  14. 25 3月, 2015 1 次提交
  15. 20 3月, 2015 1 次提交
  16. 19 3月, 2015 1 次提交
  17. 13 3月, 2015 4 次提交
  18. 21 2月, 2015 1 次提交
    • T
      blk-throttle: check stats_cpu before reading it from sysfs · 045c47ca
      Thadeu Lima de Souza Cascardo 提交于
      When reading blkio.throttle.io_serviced in a recently created blkio
      cgroup, it's possible to race against the creation of a throttle policy,
      which delays the allocation of stats_cpu.
      
      Like other functions in the throttle code, just checking for a NULL
      stats_cpu prevents the following oops caused by that race.
      
      [ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
      [ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
      [ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
      [ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
      [ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
      [ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
      [ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
      [ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
      [ 1137.734202] REGS: c000000f1d213500 TRAP: 0300   Not tainted  (3.19.0)
      [ 1137.734230] MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI>  CR: 42008884  XER: 20000000
      [ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
      GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
      GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
      GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
      GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
      GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
      GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
      GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
      [ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
      [ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
      [ 1137.734943] Call Trace:
      [ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
      [ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
      [ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
      [ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
      [ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
      [ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
      [ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
      [ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
      [ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
      [ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
      [ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
      [ 1137.735383] Instruction dump:
      [ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
      [ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 <7cead02a> e9090008 e9490010 e9290018
      
      And here is one code that allows to easily reproduce this, although this
      has first been found by running docker.
      
      void run(pid_t pid)
      {
      	int n;
      	int status;
      	int fd;
      	char *buffer;
      	buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
      	n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
      	fd = open(CGPATH "/test/tasks", O_WRONLY);
      	write(fd, buffer, n);
      	close(fd);
      	if (fork() > 0) {
      		fd = open("/dev/sda", O_RDONLY | O_DIRECT);
      		read(fd, buffer, 512);
      		close(fd);
      		wait(&status);
      	} else {
      		fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
      		n = read(fd, buffer, BUFFER_SIZE);
      		close(fd);
      	}
      	free(buffer);
      	exit(0);
      }
      
      void test(void)
      {
      	int status;
      	mkdir(CGPATH "/test", 0666);
      	if (fork() > 0)
      		wait(&status);
      	else
      		run(getpid());
      	rmdir(CGPATH "/test");
      }
      
      int main(int argc, char **argv)
      {
      	int i;
      	for (i = 0; i < NR_TESTS; i++)
      		test();
      	return 0;
      }
      Reported-by: NRicardo Marin Matinata <rmm@br.ibm.com>
      Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      045c47ca
  19. 12 2月, 2015 4 次提交
  20. 11 2月, 2015 1 次提交
  21. 10 2月, 2015 1 次提交
  22. 06 2月, 2015 8 次提交