1. 30 3月, 2018 2 次提交
    • M
      dm: fix dropped return code from dm_get_bdev_for_ioctl · da5dadb4
      Mike Snitzer 提交于
      dm_get_bdev_for_ioctl()'s return of 0 or 1 must be the result from
      prepare_ioctl (1 means the ioctl was issued to a partition, 0 means it
      wasn't).  Unfortunately commit 519049af ("dm: use blkdev_get rather
      than bdgrab when issuing pass-through ioctl") reused the variable 'r'
      to store the return from blkdev_get() that follows prepare_ioctl()
      -- whereby dropping prepare_ioctl()'s result on the floor.
      
      This can lead to an ioctl or persistent reservation being issued to a
      partition going unnoticed, which implies the extra permission check for
      CAP_SYS_RAWIO is skipped.
      
      Fix this by using a different variable to store blkdev_get()'s return.
      
      Fixes: 519049af ("dm: use blkdev_get rather than bdgrab when issuing pass-through ioctl")
      Reported-by: NAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      da5dadb4
    • M
      dm mpath: fix support for loading scsi_dh modules during table load · e457edf0
      Mike Snitzer 提交于
      The ability to have multipath dynamically attach a scsi_dh, that the user
      specified in the multipath table, was broken by commit e8f74a0f ("dm
      mpath: eliminate need to use scsi_device_from_queue").
      
      Restore the ability to load, and attach, a particular scsi_dh module if
      one is specified (as noticed by checking m->hw_handler_name).
      
      Fixes: e8f74a0f ("dm mpath: eliminate need to use scsi_device_from_queue")
      Reported-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      e457edf0
  2. 15 3月, 2018 1 次提交
  3. 14 3月, 2018 2 次提交
  4. 07 3月, 2018 6 次提交
    • M
      dm table: allow upgrade from bio-based to specialized bio-based variant · c934edad
      Mike Snitzer 提交于
      In practice this is really only meaningful in the context of the DM
      multipath target (which uses dm_table_set_type() to set the type of
      device DM should create via its "queue_mode" option).
      
      So this change allows a DM multipath device with "queue_mode bio" to be
      upgraded from DM_TYPE_BIO_BASED to DM_TYPE_NVME_BIO_BASED -- iff the
      underlying device(s) are NVMe.
      
      DM_TYPE_NVME_BIO_BASED is just a DM core implementation detail that
      allows for NVMe-specific optimizations (e.g. use direct_make_request
      instead of generic_make_request).  If in the future there is no benefit
      or need to distinguish NVMe vs not: then it will be removed.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c934edad
    • M
      dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks · 8d47e659
      Mike Snitzer 提交于
      This eliminates the "queue_mode" configuration's "nvme" mode.  There
      wasn't anything NVMe-specific about that mode.  It was named "nvme"
      because it was a short name for the mode.  But the entire point of the
      mode was to optimize the multipath target for underlying devices that
      are _not_ SCSI-based.  Devices that aren't SCSI have no need for the
      various SCSI device handler (scsi_dh) specific code in DM multipath.
      
      But rather than narrowly define this scsi_dh vs not branching in terms
      of "nvme": invert the logic so that we're just checking whether a
      multipath device is layered on SCSI devices with scsi_dh attached.
      
      This allows any future storage technology to avoid scsi_dh specific code
      in the multipath target too.
      
      Fixes: 848b8aef ("dm mpath: optimize NVMe bio-based support")
      Suggested-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      8d47e659
    • M
      dm table: fix "nvme" test · 99243b92
      Mikulas Patocka 提交于
      The strncmp function should compare 4 bytes.
      
      Fixes: 22c11858 ("dm: introduce DM_TYPE_NVME_BIO_BASED")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      99243b92
    • J
      dm raid: fix incorrect sync_ratio when degraded · da1e1488
      Jonathan Brassow 提交于
      Upstream commit 4102d9de ("dm raid: fix rs_get_progress()
      synchronization state/ratio") in combination with commit 7c29744e
      ("dm raid: simplify rs_get_progress()") introduced a regression by
      incorrectly reporting a sync_ratio of 0 for degraded raid sets.  This
      caused lvm2 to fail to repair raid legs automatically.
      
      Fix by identifying the degraded state by checking the MD_RECOVERY_INTR
      flag and returning mddev->recovery_cp in case it is set.
      
      MD sets recovery = [ MD_RECOVERY_RECOVER MD_RECOVERY_INTR
      MD_RECOVERY_NEEDED ] when a RAID member fails.  It then shuts down any
      sync thread that is running and leaves us with all MD_RECOVERY_* flags
      cleared.  The bug occurs if a status is requested in the short time it
      takes to shut down any sync thread and clear the flags, because we were
      keying in on the MD_RECOVERY_NEEDED - understanding it to be the initial
      phase of a “recover” sync thread.  However, this is an incorrect
      interpretation if MD_RECOVERY_INTR is also set.
      
      This also explains why the bug only happened when automatic repair was
      enabled and not a normal ‘manual’ method.  It is impossible to react
      quick enough to hit the problematic window without it being automated.
      
      Fix passes automatic repair tests.
      
      Fixes: 7c29744e ("dm raid: simplify rs_get_progress()")
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      da1e1488
    • M
      dm: use blkdev_get rather than bdgrab when issuing pass-through ioctl · 519049af
      Mike Snitzer 提交于
      Otherwise an underlying device's teardown (e.g. SCSI) may race with the
      DM ioctl or persistent reservation and result in dereferencing driver
      memory that gets freed when the underlying device's final blkdev_put()
      occurs.
      
      bdgrab() only increases the refcount for the block_device's inode to
      ensure the block_device struct itself will not be freed, but does not
      guarantee the block_device will remain associated with the gendisk or
      its storage.
      
      Cc: stable@vger.kernel.org # 4.8+
      Reported-by: NDavid Jeffery <djeffery@redhat.com>
      Suggested-by: NDavid Jeffery <djeffery@redhat.com>
      Reviewed-by: NBen Marzinski <bmarzins@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      519049af
    • A
      dm bufio: avoid false-positive Wmaybe-uninitialized warning · 590347e4
      Arnd Bergmann 提交于
      gcc-6.3 and earlier show a new warning after a seemingly unrelated
      change to the arm64 PAGE_KERNEL definition:
      
      In file included from drivers/md/dm-bufio.c:14:0:
      drivers/md/dm-bufio.c: In function 'alloc_buffer':
      include/linux/sched/mm.h:182:56: warning: 'noio_flag' may be used uninitialized in this function [-Wmaybe-uninitialized]
        current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
                                                              ^
      
      The same warning happened earlier on linux-3.18 for MIPS and I did a
      workaround for that, but now it's come back.
      
      gcc-7 and newer are apparently smart enough to figure this out, and
      other architectures don't show it, so the best I could come up with is
      to rework the caller slightly in a way that makes it obvious enough to
      all arm64 compilers what is happening here.
      
      Fixes: 41acec62 ("arm64: kpti: Make use of nG dependent on arm64_kernel_unmapped_at_el0()")
      Link: https://patchwork.kernel.org/patch/9692829/
      Cc: stable@vger.kernel.org
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      [snitzer: moved declarations inside conditional, altered vmalloc return]
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      590347e4
  5. 06 3月, 2018 2 次提交
    • M
      bcache: don't attach backing with duplicate UUID · 86755b7a
      Michael Lyle 提交于
      This can happen e.g. during disk cloning.
      
      This is an incomplete fix: it does not catch duplicate UUIDs earlier
      when things are still unattached.  It does not unregister the device.
      Further changes to cope better with this are planned but conflict with
      Coly's ongoing improvements to handling device errors.  In the meantime,
      one can manually stop the device after this has happened.
      
      Attempts to attach a duplicate device result in:
      
      [  136.372404] loop: module loaded
      [  136.424461] bcache: register_bdev() registered backing device loop0
      [  136.424464] bcache: bch_cached_dev_attach() Tried to attach loop0 but duplicate UUID already attached
      
      My test procedure is:
      
        dd if=/dev/sdb1 of=imgfile bs=1024 count=262144
        losetup -f imgfile
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      86755b7a
    • T
      bcache: fix crashes in duplicate cache device register · cc40daf9
      Tang Junhui 提交于
      Kernel crashed when register a duplicate cache device, the call trace is
      bellow:
      [  417.643790] CPU: 1 PID: 16886 Comm: bcache-register Tainted: G
         W  OE    4.15.5-amd64-preempt-sysrq-20171018 #2
      [  417.643861] Hardware name: LENOVO 20ERCTO1WW/20ERCTO1WW, BIOS
      N1DET41W (1.15 ) 12/31/2015
      [  417.643870] RIP: 0010:bdevname+0x13/0x1e
      [  417.643876] RSP: 0018:ffffa3aa9138fd38 EFLAGS: 00010282
      [  417.643884] RAX: 0000000000000000 RBX: ffff8c8f2f2f8000 RCX: ffffd6701f8
      c7edf
      [  417.643890] RDX: ffffa3aa9138fd88 RSI: ffffa3aa9138fd88 RDI: 00000000000
      00000
      [  417.643895] RBP: ffffa3aa9138fde0 R08: ffffa3aa9138fae8 R09: 00000000000
      1850e
      [  417.643901] R10: ffff8c8eed34b271 R11: ffff8c8eed34b250 R12: 00000000000
      00000
      [  417.643906] R13: ffffd6701f78f940 R14: ffff8c8f38f80000 R15: ffff8c8ea7d
      90000
      [  417.643913] FS:  00007fde7e66f500(0000) GS:ffff8c8f61440000(0000) knlGS:
      0000000000000000
      [  417.643919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  417.643925] CR2: 0000000000000314 CR3: 00000007e6fa0001 CR4: 00000000003
      606e0
      [  417.643931] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 00000000000
      00000
      [  417.643938] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 00000000000
      00400
      [  417.643946] Call Trace:
      [  417.643978]  register_bcache+0x1117/0x1270 [bcache]
      [  417.643994]  ? slab_pre_alloc_hook+0x15/0x3c
      [  417.644001]  ? slab_post_alloc_hook.isra.44+0xa/0x1a
      [  417.644013]  ? kernfs_fop_write+0xf6/0x138
      [  417.644020]  kernfs_fop_write+0xf6/0x138
      [  417.644031]  __vfs_write+0x31/0xcc
      [  417.644043]  ? current_kernel_time64+0x10/0x36
      [  417.644115]  ? __audit_syscall_entry+0xbf/0xe3
      [  417.644124]  vfs_write+0xa5/0xe2
      [  417.644133]  SyS_write+0x5c/0x9f
      [  417.644144]  do_syscall_64+0x72/0x81
      [  417.644161]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      [  417.644169] RIP: 0033:0x7fde7e1c1974
      [  417.644175] RSP: 002b:00007fff13009a38 EFLAGS: 00000246 ORIG_RAX: 0000000
      000000001
      [  417.644183] RAX: ffffffffffffffda RBX: 0000000001658280 RCX: 00007fde7e1c
      1974
      [  417.644188] RDX: 000000000000000a RSI: 0000000001658280 RDI: 000000000000
      0001
      [  417.644193] RBP: 000000000000000a R08: 0000000000000003 R09: 000000000000
      0077
      [  417.644198] R10: 000000000000089e R11: 0000000000000246 R12: 000000000000
      0001
      [  417.644203] R13: 000000000000000a R14: 7fffffffffffffff R15: 000000000000
      0000
      [  417.644213] Code: c7 c2 83 6f ee 98 be 20 00 00 00 48 89 df e8 6c 27 3b 0
      0 48 89 d8 5b c3 0f 1f 44 00 00 48 8b 47 70 48 89 f2 48 8b bf 80 00 00 00 <8
      b> b0 14 03 00 00 e9 73 ff ff ff 0f 1f 44 00 00 48 8b 47 40 39
      [  417.644302] RIP: bdevname+0x13/0x1e RSP: ffffa3aa9138fd38
      [  417.644306] CR2: 0000000000000314
      
      When registering duplicate cache device in register_cache(), after failure
      on calling register_cache_set(), bch_cache_release() will be called, then
      bdev will be freed, so bdevname(bdev, name) caused kernel crash.
      
      Since bch_cache_release() will free bdev, so in this patch we make sure
      bdev being freed if register_cache() fail, and do not free bdev again in
      register_bcache() when register_cache() fail.
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reported-by: NMarc MERLIN <marc@merlins.org>
      Tested-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cc40daf9
  6. 28 2月, 2018 2 次提交
    • T
      bcache: fix kcrashes with fio in RAID5 backend dev · 60eb34ec
      Tang Junhui 提交于
      Kernel crashed when run fio in a RAID5 backend bcache device, the call
      trace is bellow:
      [  440.012034] kernel BUG at block/blk-ioc.c:146!
      [  440.012696] invalid opcode: 0000 [#1] SMP NOPTI
      [  440.026537] CPU: 2 PID: 2205 Comm: md127_raid5 Not tainted 4.15.0 #8
      [  440.027441] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16
      /2015
      [  440.028615] RIP: 0010:put_io_context+0x8b/0x90
      [  440.029246] RSP: 0018:ffffa8c882b43af8 EFLAGS: 00010246
      [  440.029990] RAX: 0000000000000000 RBX: ffffa8c88294fca0 RCX: 0000000000
      0f4240
      [  440.031006] RDX: 0000000000000004 RSI: 0000000000000286 RDI: ffffa8c882
      94fca0
      [  440.032030] RBP: ffffa8c882b43b10 R08: 0000000000000003 R09: ffff949cb8
      0c1700
      [  440.033206] R10: 0000000000000104 R11: 000000000000b71c R12: 00000000000
      01000
      [  440.034222] R13: 0000000000000000 R14: ffff949cad84db70 R15: ffff949cb11
      bd1e0
      [  440.035239] FS:  0000000000000000(0000) GS:ffff949cba280000(0000) knlGS:
      0000000000000000
      [  440.060190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  440.084967] CR2: 00007ff0493ef000 CR3: 00000002f1e0a002 CR4: 00000000001
      606e0
      [  440.110498] Call Trace:
      [  440.135443]  bio_disassociate_task+0x1b/0x60
      [  440.160355]  bio_free+0x1b/0x60
      [  440.184666]  bio_put+0x23/0x30
      [  440.208272]  search_free+0x23/0x40 [bcache]
      [  440.231448]  cached_dev_write_complete+0x31/0x70 [bcache]
      [  440.254468]  closure_put+0xb6/0xd0 [bcache]
      [  440.277087]  request_endio+0x30/0x40 [bcache]
      [  440.298703]  bio_endio+0xa1/0x120
      [  440.319644]  handle_stripe+0x418/0x2270 [raid456]
      [  440.340614]  ? load_balance+0x17b/0x9c0
      [  440.360506]  handle_active_stripes.isra.58+0x387/0x5a0 [raid456]
      [  440.380675]  ? __release_stripe+0x15/0x20 [raid456]
      [  440.400132]  raid5d+0x3ed/0x5d0 [raid456]
      [  440.419193]  ? schedule+0x36/0x80
      [  440.437932]  ? schedule_timeout+0x1d2/0x2f0
      [  440.456136]  md_thread+0x122/0x150
      [  440.473687]  ? wait_woken+0x80/0x80
      [  440.491411]  kthread+0x102/0x140
      [  440.508636]  ? find_pers+0x70/0x70
      [  440.524927]  ? kthread_associate_blkcg+0xa0/0xa0
      [  440.541791]  ret_from_fork+0x35/0x40
      [  440.558020] Code: c2 48 00 5b 41 5c 41 5d 5d c3 48 89 c6 4c 89 e7 e8 bb c2
      48 00 48 8b 3d bc 36 4b 01 48 89 de e8 7c f7 e0 ff 5b 41 5c 41 5d 5d c3 <0f> 0b
      0f 1f 00 0f 1f 44 00 00 55 48 8d 47 b8 48 89 e5 41 57 41
      [  440.610020] RIP: put_io_context+0x8b/0x90 RSP: ffffa8c882b43af8
      [  440.628575] ---[ end trace a1fd79d85643a73e ]--
      
      All the crash issue happened when a bypass IO coming, in such scenario
      s->iop.bio is pointed to the s->orig_bio. In search_free(), it finishes the
      s->orig_bio by calling bio_complete(), and after that, s->iop.bio became
      invalid, then kernel would crash when calling bio_put(). Maybe its upper
      layer's faulty, since bio should not be freed before we calling bio_put(),
      but we'd better calling bio_put() first before calling bio_complete() to
      notify upper layer ending this bio.
      
      This patch moves bio_complete() under bio_put() to avoid kernel crash.
      
      [mlyle: fixed commit subject for character limits]
      Reported-by: NMatthias Ferdinand <bcache@mfedv.net>
      Tested-by: NMatthias Ferdinand <bcache@mfedv.net>
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      60eb34ec
    • C
      bcache: correct flash only vols (check all uuids) · 02aa8a8b
      Coly Li 提交于
      Commit 2831231d ("bcache: reduce cache_set devices iteration by
      devices_max_used") adds c->devices_max_used to reduce iteration of
      c->uuids elements, this value is updated in bcache_device_attach().
      
      But for flash only volume, when calling flash_devs_run(), the function
      bcache_device_attach() is not called yet and c->devices_max_used is not
      updated. The unexpected result is, the flash only volume won't be run
      by flash_devs_run().
      
      This patch fixes the issue by iterate all c->uuids elements in
      flash_devs_run(). c->devices_max_used will be updated properly when
      bcache_device_attach() gets called.
      
      [mlyle: commit subject edited for character limit]
      
      Fixes: 2831231d ("bcache: reduce cache_set devices iteration by devices_max_used")
      Reported-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      02aa8a8b
  7. 26 2月, 2018 3 次提交
    • Y
      md/raid1: fix NULL pointer dereference · 3de59bb9
      Yufen Yu 提交于
      In handle_write_finished(), if r1_bio->bios[m] != NULL, it thinks
      the corresponding conf->mirrors[m].rdev is also not NULL. But, it
      is not always true.
      
      Even if some io hold replacement rdev(i.e. rdev->nr_pending.count > 0),
      raid1_remove_disk() can also set the rdev as NULL. That means,
      bios[m] != NULL, but mirrors[m].rdev is NULL, resulting in NULL
      pointer dereference in handle_write_finished and sync_request_write.
      
      This patch can fix BUGs as follows:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000140
       IP: [<ffffffff815bbbbd>] raid1d+0x2bd/0xfc0
       PGD 12ab52067 PUD 12f587067 PMD 0
       Oops: 0000 [#1] SMP
       CPU: 1 PID: 2008 Comm: md3_raid1 Not tainted 4.1.44+ #130
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
       Call Trace:
        ? schedule+0x37/0x90
        ? prepare_to_wait_event+0x83/0xf0
        md_thread+0x144/0x150
        ? wake_atomic_t_function+0x70/0x70
        ? md_start_sync+0xf0/0xf0
        kthread+0xd8/0xf0
        ? kthread_worker_fn+0x160/0x160
        ret_from_fork+0x42/0x70
        ? kthread_worker_fn+0x160/0x160
      
       BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
       IP: sync_request_write+0x9e/0x980
       PGD 800000007c518067 P4D 800000007c518067 PUD 8002b067 PMD 0
       Oops: 0000 [#1] SMP PTI
       CPU: 24 PID: 2549 Comm: md3_raid1 Not tainted 4.15.0+ #118
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
       Call Trace:
        ? sched_clock+0x5/0x10
        ? sched_clock_cpu+0xc/0xb0
        ? flush_pending_writes+0x3a/0xd0
        ? pick_next_task_fair+0x4d5/0x5f0
        ? __switch_to+0xa2/0x430
        raid1d+0x65a/0x870
        ? find_pers+0x70/0x70
        ? find_pers+0x70/0x70
        ? md_thread+0x11c/0x160
        md_thread+0x11c/0x160
        ? finish_wait+0x80/0x80
        kthread+0x111/0x130
        ? kthread_create_worker_on_cpu+0x70/0x70
        ? do_syscall_64+0x6f/0x190
        ? SyS_exit_group+0x10/0x10
        ret_from_fork+0x35/0x40
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      3de59bb9
    • B
      md: fix a potential deadlock of raid5/raid10 reshape · 8876391e
      BingJing Chang 提交于
      There is a potential deadlock if mount/umount happens when
      raid5_finish_reshape() tries to grow the size of emulated disk.
      
      How the deadlock happens?
      1) The raid5 resync thread finished reshape (expanding array).
      2) The mount or umount thread holds VFS sb->s_umount lock and tries to
         write through critical data into raid5 emulated block device. So it
         waits for raid5 kernel thread handling stripes in order to finish it
         I/Os.
      3) In the routine of raid5 kernel thread, md_check_recovery() will be
         called first in order to reap the raid5 resync thread. That is,
         raid5_finish_reshape() will be called. In this function, it will try
         to update conf and call VFS revalidate_disk() to grow the raid5
         emulated block device. It will try to acquire VFS sb->s_umount lock.
      The raid5 kernel thread cannot continue, so no one can handle mount/
      umount I/Os (stripes). Once the write-through I/Os cannot be finished,
      mount/umount will not release sb->s_umount lock. The deadlock happens.
      
      The raid5 kernel thread is an emulated block device. It is responible to
      handle I/Os (stripes) from upper layers. The emulated block device
      should not request any I/Os on itself. That is, it should not call VFS
      layer functions. (If it did, it will try to acquire VFS locks to
      guarantee the I/Os sequence.) So we have the resync thread to send
      resync I/O requests and to wait for the results.
      
      For solving this potential deadlock, we can put the size growth of the
      emulated block device as the final step of reshape thread.
      
      2017/12/29:
      Thanks to Guoqing Jiang <gqjiang@suse.com>,
      we confirmed that there is the same deadlock issue in raid10. It's
      reproducible and can be fixed by this patch. For raid10.c, we can remove
      the similar code to prevent deadlock as well since they has been called
      before.
      Reported-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NChung-Chiang Cheng <cccheng@synology.com>
      Signed-off-by: NBingJing Chang <bingjingc@synology.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      8876391e
    • L
      md-cluster: choose correct label when clustered layout is not supported · 43a52123
      Lidong Zhong 提交于
      r10conf is already successfully allocated before checking the layout
      Signed-off-by: NLidong Zhong <lzhong@suse.com>
      Reviewed-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      43a52123
  8. 22 2月, 2018 3 次提交
    • I
      treewide/trivial: Remove ';;$' typo noise · ed7158ba
      Ingo Molnar 提交于
      On lkml suggestions were made to split up such trivial typo fixes into per subsystem
      patches:
      
        --- a/arch/x86/boot/compressed/eboot.c
        +++ b/arch/x86/boot/compressed/eboot.c
        @@ -439,7 +439,7 @@ setup_uga32(void **uga_handle, unsigned long size, u32 *width, u32 *height)
                struct efi_uga_draw_protocol *uga = NULL, *first_uga;
                efi_guid_t uga_proto = EFI_UGA_PROTOCOL_GUID;
                unsigned long nr_ugas;
        -       u32 *handles = (u32 *)uga_handle;;
        +       u32 *handles = (u32 *)uga_handle;
                efi_status_t status = EFI_INVALID_PARAMETER;
                int i;
      
      This patch is the result of the following script:
      
        $ sed -i 's/;;$/;/g' $(git grep -E ';;$'  | grep "\.[ch]:"  | grep -vwE 'for|ia64' | cut -d: -f1 | sort | uniq)
      
      ... followed by manual review to make sure it's all good.
      
      Splitting this up is just crazy talk, let's get over with this and just do it.
      Reported-by: NPavel Machek <pavel@ucw.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ed7158ba
    • A
      md: raid5: avoid string overflow warning · 53b8d89d
      Arnd Bergmann 提交于
      gcc warns about a possible overflow of the kmem_cache string, when adding
      four characters to a string of the same length:
      
      drivers/md/raid5.c: In function 'setup_conf':
      drivers/md/raid5.c:2207:34: error: '-alt' directive writing 4 bytes into a region of size between 1 and 32 [-Werror=format-overflow=]
        sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
                                        ^~~~
      drivers/md/raid5.c:2207:2: note: 'sprintf' output between 5 and 36 bytes into a destination of size 32
        sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      If I'm counting correctly, we need 11 characters for the fixed part
      of the string and 18 characters for a 64-bit pointer (when no gendisk
      is used), so that leaves three characters for conf->level, which should
      always be sufficient.
      
      This makes the code use snprintf() with the correct length, to
      make the code more robust against changes, and to get the compiler
      to shut up.
      
      In commit f4be6b43 ("md/raid5: ensure we create a unique name for
      kmem_cache when mddev has no gendisk") from 2010, Neil said that
      the pointer could be removed "shortly" once devices without gendisk
      are disallowed. I have no idea if that happened, but if it did, that
      should probably be changed as well.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      53b8d89d
    • A
      raid5-ppl: fix handling flush requests · f4bc0c81
      Artur Paszkiewicz 提交于
      Add missing bio completion. Without this any flush request would hang.
      
      Fixes: 1532d9e8 ("raid5-ppl: PPL support for disks with write-back cache enabled")
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      f4bc0c81
  9. 20 2月, 2018 2 次提交
    • Y
      md raid10: fix NULL deference in handle_write_completed() · 01a69cab
      Yufen Yu 提交于
      In the case of 'recover', an r10bio with R10BIO_WriteError &
      R10BIO_IsRecover will be progressed by handle_write_completed().
      This function traverses all r10bio->devs[copies].
      If devs[m].repl_bio != NULL, it thinks conf->mirrors[dev].replacement
      is also not NULL. However, this is not always true.
      
      When there is an rdev of raid10 has replacement, then each r10bio
      ->devs[m].repl_bio != NULL in conf->r10buf_pool. However, in 'recover',
      even if corresponded replacement is NULL, it doesn't clear r10bio
      ->devs[m].repl_bio, resulting in replacement NULL deference.
      
      This bug was introduced when replacement support for raid10 was
      added in Linux 3.3.
      
      As NeilBrown suggested:
      	Elsewhere the determination of "is this device part of the
      	resync/recovery" is made by resting bio->bi_end_io.
      	If this is end_sync_write, then we tried to write here.
      	If it is NULL, then we didn't try to write.
      
      Fixes: 9ad1aefc ("md/raid10:  Handle replacement devices during resync.")
      Cc: stable (V3.3+)
      Suggested-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      01a69cab
    • N
      md: only allow remove_and_add_spares when no sync_thread running. · 39772f0a
      NeilBrown 提交于
      The locking protocols in md assume that a device will
      never be removed from an array during resync/recovery/reshape.
      When that isn't happening, rcu or reconfig_mutex is needed
      to protect an rdev pointer while taking a refcount.  When
      it is happening, that protection isn't needed.
      
      Unfortunately there are cases were remove_and_add_spares() is
      called when recovery might be happening: is state_store(),
      slot_store() and hot_remove_disk().
      In each case, this is just an optimization, to try to expedite
      removal from the personality so the device can be removed from
      the array.  If resync etc is happening, we just have to wait
      for md_check_recover to find a suitable time to call
      remove_and_add_spares().
      
      This optimization and not essential so it doesn't
      matter if it fails.
      So change remove_and_add_spares() to abort early if
      resync/recovery/reshape is happening, unless it is called
      from md_check_recovery() as part of a newly started recovery.
      The parameter "this" is only NULL when called from
      md_check_recovery() so when it is NULL, there is no need to abort.
      
      As this can result in a NULL dereference, the fix is suitable
      for -stable.
      
      cc: yuyufen <yuyufen@huawei.com>
      Cc: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
      Fixes: 8430e7e0 ("md: disconnect device from personality before trying to remove it.")
      Cc: stable@ver.kernel.org (v4.8+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      39772f0a
  10. 19 2月, 2018 2 次提交
  11. 18 2月, 2018 5 次提交
  12. 16 2月, 2018 1 次提交
    • N
      dm: correctly handle chained bios in dec_pending() · 8dd601fa
      NeilBrown 提交于
      dec_pending() is given an error status (possibly 0) to be recorded
      against a bio.  It can be called several times on the one 'struct
      dm_io', and it is careful to only assign a non-zero error to
      io->status.  However when it then assigned io->status to bio->bi_status,
      it is not careful and could overwrite a genuine error status with 0.
      
      This can happen when chained bios are in use.  If a bio is chained
      beneath the bio that this dm_io is handling, the child bio might
      complete and set bio->bi_status before the dm_io completes.
      
      This has been possible since chained bios were introduced in 3.14, and
      has become a lot easier to trigger with commit 18a25da8 ("dm: ensure
      bio submission follows a depth-first tree walk") as that commit caused
      dm to start using chained bios itself.
      
      A particular failure mode is that if a bio spans an 'error' target and a
      working target, the 'error' fragment will complete instantly and set the
      ->bi_status, and the other fragment will normally complete a little
      later, and will clear ->bi_status.
      
      The fix is simply to only assign io_error to bio->bi_status when
      io_error is not zero.
      Reported-and-tested-by: NMilan Broz <gmazyland@gmail.com>
      Cc: stable@vger.kernel.org (v3.14+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      8dd601fa
  13. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  14. 08 2月, 2018 8 次提交
    • T
      bcache: fix for data collapse after re-attaching an attached device · 73ac105b
      Tang Junhui 提交于
      back-end device sdm has already attached a cache_set with ID
      f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with
      another cache set, and it returns with an error:
      [root]# cd /sys/block/sdm/bcache
      [root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f > attach
      -bash: echo: write error: Invalid argument
      
      After that, execute a command to modify the label of bcache
      device:
      [root]# echo data_disk1 > label
      
      Then we reboot the system, when the system power on, the back-end
      device can not attach to cache_set, a messages show in the log:
      Feb  5 12:05:52 ceph152 kernel: [922385.508498] bcache:
      bch_cached_dev_attach() couldn't find uuid for sdm in set
      
      In sysfs_attach(), dc->sb.set_uuid was assigned to the value
      which input through sysfs, no matter whether it is success
      or not in bch_cached_dev_attach(). For example, If the back-end
      device has already attached to an cache set, bch_cached_dev_attach()
      would fail, but dc->sb.set_uuid was changed. Then modify the
      label of bcache device, it will call bch_write_bdev_super(),
      which would write the dc->sb.set_uuid to the super block, so we
      record a wrong cache set ID in the super block, after the system
      reboot, the cache set couldn't find the uuid of the back-end
      device, so the bcache device couldn't exist and use any more.
      
      In this patch, we don't assigned cache set ID to dc->sb.set_uuid
      in sysfs_attach() directly, but input it into bch_cached_dev_attach(),
      and assigned dc->sb.set_uuid to the cache set ID after the back-end
      device attached to the cache set successful.
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      73ac105b
    • T
      bcache: return attach error when no cache set exist · 7f4fc93d
      Tang Junhui 提交于
      I attach a back-end device to a cache set, and the cache set is not
      registered yet, this back-end device did not attach successfully, and no
      error returned:
      [root]# echo 87859280-fec6-4bcc-20df7ca8f86b > /sys/block/sde/bcache/attach
      [root]#
      
      In sysfs_attach(), the return value "v" is initialized to "size" in
      the beginning, and if no cache set exist in bch_cache_sets, the "v" value
      would not change any more, and return to sysfs, sysfs regard it as success
      since the "size" is a positive number.
      
      This patch fixes this issue by assigning "v" with "-ENOENT" in the
      initialization.
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7f4fc93d
    • C
      bcache: set writeback_rate_update_seconds in range [1, 60] seconds · 7a5e3ecb
      Coly Li 提交于
      dc->writeback_rate_update_seconds can be set via sysfs and its value can
      be set to [1, ULONG_MAX].  It does not make sense to set such a large
      value, 60 seconds is long enough value considering the default 5 seconds
      works well for long time.
      
      Because dc->writeback_rate_update is a special delayed work, it re-arms
      itself inside the delayed work routine update_writeback_rate(). When
      stopping it by cancel_delayed_work_sync(), there should be a timeout to
      wait and make sure the re-armed delayed work is stopped too. A small max
      value of dc->writeback_rate_update_seconds is also helpful to decide a
      reasonable small timeout.
      
      This patch limits sysfs interface to set dc->writeback_rate_update_seconds
      in range of [1, 60] seconds, and replaces the hand-coded number by macros.
      
      Changelog:
      v2: fix a rebase typo in v4, which is pointed out by Michael Lyle.
      v1: initial version.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7a5e3ecb
    • T
      bcache: fix for allocator and register thread race · 682811b3
      Tang Junhui 提交于
      After long time running of random small IO writing,
      I reboot the machine, and after the machine power on,
      I found bcache got stuck, the stack is:
      [root@ceph153 ~]# cat /proc/2510/task/*/stack
      [<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache]
      [<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache]
      [<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache]
      [<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache]
      [<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
      [<ffffffff810a631f>] kthread+0xcf/0xe0
      [<ffffffff8164c318>] ret_from_fork+0x58/0x90
      [<ffffffffffffffff>] 0xffffffffffffffff
      [root@ceph153 ~]# cat /proc/2038/task/*/stack
      [<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
      [<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache]
      [<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache]
      [<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache]
      [<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache]
      [<ffffffff812f702f>] kobj_attr_store+0xf/0x20
      [<ffffffff8125b216>] sysfs_write_file+0xc6/0x140
      [<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0
      [<ffffffff811e069f>] SyS_write+0x7f/0xe0
      [<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1
      The stack shows the register thread and allocator thread
      were getting stuck when registering cache device.
      
      I reboot the machine several times, the issue always
      exsit in this machine.
      
      I debug the code, and found the call trace as bellow:
      register_bcache()
         ==>run_cache_set()
            ==>bch_journal_replay()
               ==>bch_btree_insert()
                  ==>__bch_btree_map_nodes()
                     ==>btree_insert_fn()
                        ==>btree_split() //node need split
                           ==>btree_check_reserve()
      In btree_check_reserve(), It will check if there is enough buckets
      of RESERVE_BTREE type, since allocator thread did not work yet, so
      no buckets of RESERVE_BTREE type allocated, so the register thread
      waits on c->btree_cache_wait, and goes to sleep.
      
      Then the allocator thread initialized, the call trace is bellow:
      bch_allocator_thread()
      ==>bch_prio_write()
         ==>bch_journal_meta()
            ==>bch_journal()
               ==>journal_wait_for_write()
      In journal_wait_for_write(), It will check if journal is full by
      journal_full(), but the long time random small IO writing
      causes the exhaustion of journal buckets(journal.blocks_free=0),
      In order to release the journal buckets,
      the allocator calls btree_flush_write() to flush keys to
      btree nodes, and waits on c->journal.wait until btree nodes writing
      over or there has already some journal buckets space, then the
      allocator thread goes to sleep. but in btree_flush_write(), since
      bch_journal_replay() is not finished, so no btree nodes have journal
      (condition "if (btree_current_write(b)->journal)" never satisfied),
      so we got no btree node to flush, no journal bucket released,
      and allocator sleep all the times.
      
      Through the above analysis, we can see that:
      1) Register thread wait for allocator thread to allocate buckets of
         RESERVE_BTREE type;
      2) Alloctor thread wait for register thread to replay journal, so it
         can flush btree nodes and get journal bucket.
         then they are all got stuck by waiting for each other.
      
      Hua Rui provided a patch for me, by allocating some buckets of
      RESERVE_BTREE type in advance, so the register thread can get bucket
      when btree node splitting and no need to waiting for the allocator
      thread. I tested it, it has effect, and register thread run a step
      forward, but finally are still got stuck, the reason is only 8 bucket
      of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
      after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
      then btree_check_reserve() is not satisfied anymore, so it goes to sleep
      again, and in the same time, alloctor thread did not flush enough btree
      nodes to release a journal bucket, so they all got stuck again.
      
      So we need to allocate more buckets of RESERVE_BTREE type in advance,
      but how much is enough?  By experience and test, I think it should be
      as much as journal buckets. Then I modify the code as this patch,
      and test in the machine, and it works.
      
      This patch modified base on Hua Rui’s patch, and allocate more buckets
      of RESERVE_BTREE type in advance to avoid register thread and allocate
      thread going to wait for each other.
      
      [patch v2] ca->sb.njournal_buckets would be 0 in the first time after
      cache creation, and no journal exists, so just 8 btree buckets is OK.
      Signed-off-by: NHua Rui <huarui.dev@gmail.com>
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      682811b3
    • C
      bcache: set error_limit correctly · 7ba0d830
      Coly Li 提交于
      Struct cache uses io_errors for two purposes,
      - Error decay: when cache set error_decay is set, io_errors is used to
        generate a small piece of delay when I/O error happens.
      - I/O errors counter: in order to generate big enough value for error
        decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a
        IO_ERROR_SHIFT).
      
      In function bch_count_io_errors(), if I/O errors counter reaches cache set
      error limit, bch_cache_set_error() will be called to retire the whold cache
      set. But current code is problematic when checking the error limit, see the
      following code piece from bch_count_io_errors(),
      
       90     if (error) {
       91             char buf[BDEVNAME_SIZE];
       92             unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT,
       93                                                 &ca->io_errors);
       94             errors >>= IO_ERROR_SHIFT;
       95
       96             if (errors < ca->set->error_limit)
       97                     pr_err("%s: IO error on %s, recovering",
       98                            bdevname(ca->bdev, buf), m);
       99             else
      100                     bch_cache_set_error(ca->set,
      101                                         "%s: too many IO errors %s",
      102                                         bdevname(ca->bdev, buf), m);
      103     }
      
      At line 94, errors is right shifting IO_ERROR_SHIFT bits, now it is real
      errors counter to compare at line 96. But ca->set->error_limit is initia-
      lized with an amplified value in bch_cache_set_alloc(),
      1545         c->error_limit  = 8 << IO_ERROR_SHIFT;
      
      It means by default, in bch_count_io_errors(), before 8<<20 errors happened
      bch_cache_set_error() won't be called to retire the problematic cache
      device. If the average request size is 64KB, it means bcache won't handle
      failed device until 512GB data is requested. This is too large to be an I/O
      threashold. So I believe the correct error limit should be much less.
      
      This patch sets default cache set error limit to 8, then in
      bch_count_io_errors() when errors counter reaches 8 (if it is default
      value), function bch_cache_set_error() will be called to retire the whole
      cache set. This patch also removes bits shifting when store or show
      io_error_limit value via sysfs interface.
      
      Nowadays most of SSDs handle internal flash failure automatically by LBA
      address re-indirect mapping. If an I/O error can be observed by upper layer
      code, it will be a notable error because that SSD can not re-indirect
      map the problematic LBA address to an available flash block. This situation
      indicates the whole SSD will be failed very soon. Therefore setting 8 as
      the default io error limit value makes sense, it is enough for most of
      cache devices.
      
      Changelog:
      v2: add reviewed-by from Hannes.
      v1: initial version for review.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7ba0d830
    • C
      bcache: properly set task state in bch_writeback_thread() · 99361bbf
      Coly Li 提交于
      Kernel thread routine bch_writeback_thread() has the following code block,
      
      447         down_write(&dc->writeback_lock);
      448~450     if (check conditions) {
      451                 up_write(&dc->writeback_lock);
      452                 set_current_state(TASK_INTERRUPTIBLE);
      453
      454                 if (kthread_should_stop())
      455                         return 0;
      456
      457                 schedule();
      458                 continue;
      459         }
      
      If condition check is true, its task state is set to TASK_INTERRUPTIBLE
      and call schedule() to wait for others to wake up it.
      
      There are 2 issues in current code,
      1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if
         another process changes the condition and call wake_up_process(dc->
         writeback_thread), then at line 452 task state is set back to
         TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be
         waken up.
      2, At line 454 if kthread_should_stop() is true, writeback kernel thread
         will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and
         call do_exit(). It is not good to enter do_exit() with task state
         TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a
         warning message is reported by __might_sleep(): "WARNING: do not call
         blocking ops when !TASK_RUNNING; state=1 set at [xxxx]".
      
      For the first issue, task state should be set before condition checks.
      Ineed because dc->writeback_lock is required when modifying all the
      conditions, calling set_current_state() inside code block where dc->
      writeback_lock is hold is safe. But this is quite implicit, so I still move
      set_current_state() before all the condition checks.
      
      For the second issue, frankley speaking it does not hurt when kernel thread
      exits with TASK_INTERRUPTIBLE state, but this warning message scares users,
      makes them feel there might be something risky with bcache and hurt their
      data.  Setting task state to TASK_RUNNING before returning fixes this
      problem.
      
      In alloc.c:allocator_wait(), there is also a similar issue, and is also
      fixed in this patch.
      
      Changelog:
      v3: merge two similar fixes into one patch
      v2: fix the race issue in v1 patch.
      v1: initial buggy fix.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Michael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      99361bbf
    • T
      bcache: fix high CPU occupancy during journal · c4dc2497
      Tang Junhui 提交于
      After long time small writing I/O running, we found the occupancy of CPU
      is very high and I/O performance has been reduced by about half:
      
      [root@ceph151 internal]# top
      top - 15:51:05 up 1 day,2:43,  4 users,  load average: 16.89, 15.15, 16.53
      Tasks: 2063 total,   4 running, 2059 sleeping,   0 stopped,   0 zombie
      %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa,  0.0 hi,  0.5 si,  0.0 st
      KiB Mem : 65450044 total, 24586420 free, 38909008 used,  1954616 buff/cache
      KiB Swap: 65667068 total, 65667068 free,        0 used. 25136812 avail Mem
      
        PID USER PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
       2023 root 20  0       0      0      0 S 55.1  0.0   0:04.42 kworker/11:191
      14126 root 20  0       0      0      0 S 42.9  0.0   0:08.72 kworker/10:3
       9292 root 20  0       0      0      0 S 30.4  0.0   1:10.99 kworker/6:1
       8553 ceph 20  0 4242492 1.805g  18804 S 30.0  2.9 410:07.04 ceph-osd
      12287 root 20  0       0      0      0 S 26.7  0.0   0:28.13 kworker/7:85
      31019 root 20  0       0      0      0 S 26.1  0.0   1:30.79 kworker/22:1
       1787 root 20  0       0      0      0 R 25.7  0.0   5:18.45 kworker/8:7
      32169 root 20  0       0      0      0 S 14.5  0.0   1:01.92 kworker/23:1
      21476 root 20  0       0      0      0 S 13.9  0.0   0:05.09 kworker/1:54
       2204 root 20  0       0      0      0 S 12.5  0.0   1:25.17 kworker/9:10
      16994 root 20  0       0      0      0 S 12.2  0.0   0:06.27 kworker/5:106
      15714 root 20  0       0      0      0 R 10.9  0.0   0:01.85 kworker/19:2
       9661 ceph 20  0 4246876 1.731g  18800 S 10.6  2.8 403:00.80 ceph-osd
      11460 ceph 20  0 4164692 2.206g  18876 S 10.6  3.5 360:27.19 ceph-osd
       9960 root 20  0       0      0      0 S 10.2  0.0   0:02.75 kworker/2:139
      11699 ceph 20  0 4169244 1.920g  18920 S 10.2  3.1 355:23.67 ceph-osd
       6843 ceph 20  0 4197632 1.810g  18900 S  9.6  2.9 380:08.30 ceph-osd
      
      The kernel work consumed a lot of CPU, and I found they are running journal
      work, The journal is reclaiming source and flush btree node with surprising
      frequency.
      
      Through further analysis, we found that in btree_flush_write(), we try to
      get a btree node with the smallest fifo idex to flush by traverse all the
      btree nodein c->bucket_hash, after we getting it, since no locker protects
      it, this btree node may have been written to cache device by other works,
      and if this occurred, we retry to traverse in c->bucket_hash and get
      another btree node. When the problem occurrd, the retry times is very high,
      and we consume a lot of CPU in looking for a appropriate btree node.
      
      In this patch, we try to record 128 btree nodes with the smallest fifo idex
      in heap, and pop one by one when we need to flush btree node. It greatly
      reduces the time for the loop to find the appropriate BTREE node, and also
      reduce the occupancy of CPU.
      
      [note by mpl: this triggers a checkpatch error because of adjacent,
      pre-existing style violations]
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c4dc2497
    • T
      bcache: add journal statistic · a728eacb
      Tang Junhui 提交于
      Sometimes, Journal takes up a lot of CPU, we need statistics
      to know what's the journal is doing. So this patch provide
      some journal statistics:
      1) reclaim: how many times the journal try to reclaim resource,
         usually the journal bucket or/and the pin are exhausted.
      2) flush_write: how many times the journal try to flush btree node
         to cache device, usually the journal bucket are exhausted.
      3) retry_flush_write: how many times the journal retry to flush
         the next btree node, usually the previous tree node have been
         flushed by other thread.
      we show these statistic by sysfs interface. Through these statistics
      We can totally see the status of journal module when the CPU is too
      high.
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a728eacb