1. 05 2月, 2019 2 次提交
  2. 23 1月, 2019 2 次提交
  3. 22 1月, 2019 2 次提交
    • M
      dm: fix redundant IO accounting for bios that need splitting · a1e1cb72
      Mike Snitzer 提交于
      The risk of redundant IO accounting was not taken into consideration
      when commit 18a25da8 ("dm: ensure bio submission follows a
      depth-first tree walk") introduced IO splitting in terms of recursion
      via generic_make_request().
      
      Fix this by subtracting the split bio's payload from the IO stats that
      were already accounted for by start_io_acct() upon dm_make_request()
      entry.  This repeat oscillation of the IO accounting, up then down,
      isn't ideal but refactoring DM core's IO splitting to pre-split bios
      _before_ they are accounted turned out to be an excessive amount of
      change that will need a full development cycle to refine and verify.
      
      Before this fix:
      
        /dev/mapper/stripe_dev is a 4-way stripe using a 32k chunksize, so
        bios are split on 32k boundaries.
      
        # fio --name=16M --filename=/dev/mapper/stripe_dev --rw=write --bs=64k --size=16M \
          	--iodepth=1 --ioengine=libaio --direct=1 --refill_buffers
      
        with debugging added:
        [103898.310264] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=0 len=128
        [103898.318704] device-mapper: core: __split_and_process_bio: recursing for following split bio:
        [103898.329136] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=64 len=64
        ...
      
        16M written yet 136M (278528 * 512b) accounted:
        # cat /sys/block/dm-2/stat | awk '{ print $7 }'
        278528
      
      After this fix:
      
        16M written and 16M (32768 * 512b) accounted:
        # cat /sys/block/dm-2/stat | awk '{ print $7 }'
        32768
      
      Fixes: 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk")
      Cc: stable@vger.kernel.org # 4.16+
      Reported-by: NBryan Gurney <bgurney@redhat.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      a1e1cb72
    • M
      dm: fix clone_bio() to trigger blk_recount_segments() · 57c36519
      Mike Snitzer 提交于
      DM's clone_bio() now benefits from using bio_trim() by fixing the fact
      that clone_bio() wasn't clearing BIO_SEG_VALID like bio_trim() does;
      which triggers blk_recount_segments() via bio_phys_segments().
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      57c36519
  4. 16 1月, 2019 1 次提交
    • J
      dm thin: fix passdown_double_checking_shared_status() · d445bd9c
      Joe Thornber 提交于
      Commit 00a0ea33 ("dm thin: do not queue freed thin mapping for next
      stage processing") changed process_prepared_discard_passdown_pt1() to
      increment all the blocks being discarded until after the passdown had
      completed to avoid them being prematurely reused.
      
      IO issued to a thin device that breaks sharing with a snapshot, followed
      by a discard issued to snapshot(s) that previously shared the block(s),
      results in passdown_double_checking_shared_status() being called to
      iterate through the blocks double checking their reference count is zero
      and issuing the passdown if so.  So a side effect of commit 00a0ea33
      is passdown_double_checking_shared_status() was broken.
      
      Fix this by checking if the block reference count is greater than 1.
      Also, rename dm_pool_block_is_used() to dm_pool_block_is_shared().
      
      Fixes: 00a0ea33 ("dm thin: do not queue freed thin mapping for next stage processing")
      Cc: stable@vger.kernel.org # 4.9+
      Reported-by: ryan.p.norwood@gmail.com
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      d445bd9c
  5. 14 1月, 2019 1 次提交
  6. 11 1月, 2019 1 次提交
    • M
      dm crypt: fix parsing of extended IV arguments · 1856b9f7
      Milan Broz 提交于
      The dm-crypt cipher specification in a mapping table is defined as:
        cipher[:keycount]-chainmode-ivmode[:ivopts]
      or (new crypt API format):
        capi:cipher_api_spec-ivmode[:ivopts]
      
      For ESSIV, the parameter includes hash specification, for example:
      aes-cbc-essiv:sha256
      
      The implementation expected that additional IV option to never include
      another dash '-' character.
      
      But, with SHA3, there are names like sha3-256; so the mapping table
      parser fails:
      
      dmsetup create test --table "0 8 crypt aes-cbc-essiv:sha3-256 9c1185a5c5e9fc54612808977ee8f5b9e 0 /dev/sdb 0"
        or (new crypt API format)
      dmsetup create test --table "0 8 crypt capi:cbc(aes)-essiv:sha3-256 9c1185a5c5e9fc54612808977ee8f5b9e 0 /dev/sdb 0"
      
        device-mapper: crypt: Ignoring unexpected additional cipher options
        device-mapper: table: 253:0: crypt: Error creating IV
        device-mapper: ioctl: error adding target to table
      
      Fix the dm-crypt constructor to ignore additional dash in IV options and
      also remove a bogus warning (that is ignored anyway).
      
      Cc: stable@vger.kernel.org # 4.12+
      Signed-off-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      1856b9f7
  7. 29 12月, 2018 1 次提交
  8. 21 12月, 2018 4 次提交
    • G
      md: fix raid10 hang issue caused by barrier · e820d55c
      Guoqing Jiang 提交于
      When both regular IO and resync IO happen at the same time,
      and if we also need to split regular. Then we can see tasks
      hang due to barrier.
      
      1. resync thread
      [ 1463.757205] INFO: task md1_resync:5215 blocked for more than 480 seconds.
      [ 1463.757207]       Not tainted 4.19.5-1-default #1
      [ 1463.757209] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1463.757212] md1_resync      D    0  5215      2 0x80000000
      [ 1463.757216] Call Trace:
      [ 1463.757223]  ? __schedule+0x29a/0x880
      [ 1463.757231]  ? raise_barrier+0x8d/0x140 [raid10]
      [ 1463.757236]  schedule+0x78/0x110
      [ 1463.757243]  raise_barrier+0x8d/0x140 [raid10]
      [ 1463.757248]  ? wait_woken+0x80/0x80
      [ 1463.757257]  raid10_sync_request+0x1f6/0x1e30 [raid10]
      [ 1463.757265]  ? _raw_spin_unlock_irq+0x22/0x40
      [ 1463.757284]  ? is_mddev_idle+0x125/0x137 [md_mod]
      [ 1463.757302]  md_do_sync.cold.78+0x404/0x969 [md_mod]
      [ 1463.757311]  ? wait_woken+0x80/0x80
      [ 1463.757336]  ? md_rdev_init+0xb0/0xb0 [md_mod]
      [ 1463.757351]  md_thread+0xe9/0x140 [md_mod]
      [ 1463.757358]  ? _raw_spin_unlock_irqrestore+0x2e/0x60
      [ 1463.757364]  ? __kthread_parkme+0x4c/0x70
      [ 1463.757369]  kthread+0x112/0x130
      [ 1463.757374]  ? kthread_create_worker_on_cpu+0x40/0x40
      [ 1463.757380]  ret_from_fork+0x3a/0x50
      
      2. regular IO
      [ 1463.760679] INFO: task kworker/0:8:5367 blocked for more than 480 seconds.
      [ 1463.760683]       Not tainted 4.19.5-1-default #1
      [ 1463.760684] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1463.760687] kworker/0:8     D    0  5367      2 0x80000000
      [ 1463.760718] Workqueue: md submit_flushes [md_mod]
      [ 1463.760721] Call Trace:
      [ 1463.760731]  ? __schedule+0x29a/0x880
      [ 1463.760741]  ? wait_barrier+0xdd/0x170 [raid10]
      [ 1463.760746]  schedule+0x78/0x110
      [ 1463.760753]  wait_barrier+0xdd/0x170 [raid10]
      [ 1463.760761]  ? wait_woken+0x80/0x80
      [ 1463.760768]  raid10_write_request+0xf2/0x900 [raid10]
      [ 1463.760774]  ? wait_woken+0x80/0x80
      [ 1463.760778]  ? mempool_alloc+0x55/0x160
      [ 1463.760795]  ? md_write_start+0xa9/0x270 [md_mod]
      [ 1463.760801]  ? try_to_wake_up+0x44/0x470
      [ 1463.760810]  raid10_make_request+0xc1/0x120 [raid10]
      [ 1463.760816]  ? wait_woken+0x80/0x80
      [ 1463.760831]  md_handle_request+0x121/0x190 [md_mod]
      [ 1463.760851]  md_make_request+0x78/0x190 [md_mod]
      [ 1463.760860]  generic_make_request+0x1c6/0x470
      [ 1463.760870]  raid10_write_request+0x77a/0x900 [raid10]
      [ 1463.760875]  ? wait_woken+0x80/0x80
      [ 1463.760879]  ? mempool_alloc+0x55/0x160
      [ 1463.760895]  ? md_write_start+0xa9/0x270 [md_mod]
      [ 1463.760904]  raid10_make_request+0xc1/0x120 [raid10]
      [ 1463.760910]  ? wait_woken+0x80/0x80
      [ 1463.760926]  md_handle_request+0x121/0x190 [md_mod]
      [ 1463.760931]  ? _raw_spin_unlock_irq+0x22/0x40
      [ 1463.760936]  ? finish_task_switch+0x74/0x260
      [ 1463.760954]  submit_flushes+0x21/0x40 [md_mod]
      
      So resync io is waiting for regular write io to complete to
      decrease nr_pending (conf->barrier++ is called before waiting).
      The regular write io splits another bio after call wait_barrier
      which call nr_pending++, then the splitted bio would continue
      with raid10_write_request -> wait_barrier, so the splitted bio
      has to wait for barrier to be zero, then deadlock happens as
      follows.
      
      	resync io		regular io
      
      	raise_barrier
      				wait_barrier
      				generic_make_request
      				wait_barrier
      
      To resolve the issue, we need to call allow_barrier to decrease
      nr_pending before generic_make_request since regular IO is not
      issued to underlying devices, and wait_barrier is called again
      to ensure no internal IO happening.
      
      Fixes: fc9977dd ("md/raid10: simplify the splitting of requests.")
      Reported-and-tested-by: NSiniša Bandin <sinisa@4net.rs>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      e820d55c
    • G
      raid10: refactor common wait code from regular read/write request · caea3c47
      Guoqing Jiang 提交于
      Both raid10_read_request and raid10_write_request share
      the same code at the beginning of them, so introduce
      regular_request_wait to clean up code, and call it in
      both request functions.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      caea3c47
    • C
      md: remvoe redundant condition check · 37b22c28
      Chengguang Xu 提交于
      mempool_destroy() can handle NULL pointer correctly,
      so there is no need to check NULL pointer before calling
      mempool_destroy().
      Signed-off-by: NChengguang Xu <cgxu519@gmx.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      37b22c28
    • Y
      md: remove set but not used variable 'bi_rdev' · f91389c8
      Yue Haibing 提交于
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      drivers/md/md.c: In function 'md_integrity_add_rdev':
      drivers/md/md.c:2149:24: warning:
       variable 'bi_rdev' set but not used [-Wunused-but-set-variable]
      
      It not used any more after commit
        1501efad ("md/raid: only permit hot-add of compatible integrity profiles")
      Signed-off-by: NYue Haibing <yuehaibing@huawei.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      f91389c8
  9. 20 12月, 2018 1 次提交
    • J
      dm: don't reuse bio for flushes · dbe3ece1
      Jens Axboe 提交于
      DM currently has a statically allocated bio that it uses to issue empty
      flushes. It doesn't submit this bio, it just uses it for maintaining
      state while setting up clones. Multiple users can access this bio at the
      same time. This wasn't previously an issue, even if it was a bit iffy,
      but with the blkg associations it can become one.
      
      We setup the blkg association, then clone bio's and submit, then remove
      the blkg assocation again. But since we can have multiple tasks doing
      this at the same time, against multiple blkg's, then we can either lose
      references to a blkg, or put it twice. The latter causes complaints on
      the percpu ref being <= 0 when released, and can cause use-after-free as
      well. Ming reports that xfstest generic/475 triggers this:
      
      ------------[ cut here ]------------
      percpu ref (blkg_release) <= 0 (0) after switching to atomic
      WARNING: CPU: 13 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x2c9/0x4a0
      
      Switch to just using an on-stack bio for this, and get rid of the
      embedded bio.
      
      Fixes: 5cdf2e3f ("blkcg: associate blkg when associating a device")
      Reported-by: NMing Lei <ming.lei@redhat.com>
      Tested-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dbe3ece1
  10. 19 12月, 2018 2 次提交
  11. 18 12月, 2018 16 次提交
    • M
      dm rq: cleanup leftover code from recently removed q->mq_ops branching · 34743bfd
      Mike Snitzer 提交于
      When commit 6a23e05c ("dm: remove legacy request-based IO path")
      removed some q->mq_ops branching from map_request() it left in place a
      goto that was only needed if that branching (and conditional 'r'
      assignment) existed.  Now that the branching is gone map_request()'s
      goto can be removed too.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      34743bfd
    • E
      dm verity: log the hash algorithm implementation · bbf6a566
      Eric Biggers 提交于
      Log the hash algorithm's driver name when a dm-verity target is created.
      This will help people determine whether the expected implementation is
      being used.  It can make an enormous difference; e.g., SHA-256 on ARM
      can be 8x faster with the crypto extensions than without.  It can also
      be useful to know if an implementation using an external crypto
      accelerator is being used instead of a software implementation.
      
      Example message:
      
      [   35.281945] device-mapper: verity: sha256 using implementation "sha256-ce"
      
      We've already found the similar message in fs/crypto/keyinfo.c to be
      very useful.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      bbf6a566
    • E
      dm crypt: log the encryption algorithm implementation · af331eba
      Eric Biggers 提交于
      Log the encryption algorithm's driver name when a dm-crypt target is
      created.  This will help people determine whether the expected
      implementation is being used.  In some cases we've seen people do
      benchmarks and reject using encryption for performance reasons, when in
      fact they used a much slower implementation than was possible on the
      hardware.  It can make an enormous difference; e.g., AES-XTS on ARM can
      be over 10x faster with the crypto extensions than without.  It can also
      be useful to know if an implementation using an external crypto
      accelerator is being used instead of a software implementation.
      
      Example message:
      
      [   29.307629] device-mapper: crypt: xts(aes) using implementation "xts-aes-ce"
      
      We've already found the similar message in fs/crypto/keyinfo.c to be
      very useful.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      af331eba
    • C
      dm integrity: fix spelling mistake in workqueue name · e8c2566f
      Colin Ian King 提交于
      Rename the workqueue from dm-intergrity-recalc to dm-integrity-recalc.
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      e8c2566f
    • S
      dm flakey: Properly corrupt multi-page bios. · a00f5276
      Sweet Tea 提交于
      The flakey target is documented to be able to corrupt the Nth byte in
      a bio, but does not corrupt byte indices after the first biovec in the
      bio. Change the corrupting function to actually corrupt the Nth byte
      no matter in which biovec that index falls.
      
      A test device generating two-page bios, atop a flakey device configured
      to corrupt a byte index on the second page, verified both the failure
      to corrupt before this patch and the expected corruption after this
      change.
      Signed-off-by: NJohn Dorminy <jdorminy@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      a00f5276
    • M
      dm: Check for device sector overflow if CONFIG_LBDAF is not set · ef87bfc2
      Milan Broz 提交于
      Reference to a device in device-mapper table contains offset in sectors.
      
      If the sector_t is 32bit integer (CONFIG_LBDAF is not set), then
      several device-mapper targets can overflow this offset and validity
      check is then performed on a wrong offset and a wrong table is activated.
      
      See for example (on 32bit without CONFIG_LBDAF) this overflow:
      
        # dmsetup create test --table "0 2048 linear /dev/sdg 4294967297"
        # dmsetup table test
        0 2048 linear 8:96 1
      
      This patch adds explicit check for overflow if the offset is sector_t type.
      Signed-off-by: NMilan Broz <gmazyland@gmail.com>
      Reviewed-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      ef87bfc2
    • A
      dm crypt: use u64 instead of sector_t to store iv_offset · 8d683dcd
      AliOS system security 提交于
      The iv_offset in the mapping table of crypt target is a 64bit number when
      IV algorithm is plain64, plain64be, essiv or benbi. It will be assigned to
      iv_offset of struct crypt_config, cc_sector of struct convert_context and
      iv_sector of struct dm_crypt_request. These structures members are defined
      as a sector_t. But sector_t is 32bit when CONFIG_LBDAF is not set in 32bit
      kernel. In this situation sector_t is not big enough to store the 64bit
      iv_offset.
      
      Here is a reproducer.
      Prepare test image and device (loop is automatically allocated by cryptsetup):
      
        # dd if=/dev/zero of=tst.img bs=1M count=1
        # echo "tst"|cryptsetup open --type plain -c aes-xts-plain64 \
        --skip 500000000000000000 tst.img test
      
      On 32bit system (use IV offset value that overflows to 64bit; CONFIG_LBDAF if off)
      and device checksum is wrong:
      
        # dmsetup table test --showkeys
        0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 3551657984 7:0 0
      
        # sha256sum /dev/mapper/test
        533e25c09176632b3794f35303488c4a8f3f965dffffa6ec2df347c168cb6c19 /dev/mapper/test
      
      On 64bit system (and on 32bit system with the patch), table and checksum is now correct:
      
        # dmsetup table test --showkeys
        0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 500000000000000000 7:0 0
      
        # sha256sum /dev/mapper/test
        5d16160f9d5f8c33d8051e65fdb4f003cc31cd652b5abb08f03aa6fce0df75fc /dev/mapper/test
      Signed-off-by: NAliOS system security <alios_sys_security@linux.alibaba.com>
      Tested-and-Reviewed-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      8d683dcd
    • N
      dm kcopyd: Fix bug causing workqueue stalls · d7e6b8df
      Nikos Tsironis 提交于
      When using kcopyd to run callbacks through dm_kcopyd_do_callback() or
      submitting copy jobs with a source size of 0, the jobs are pushed
      directly to the complete_jobs list, which could be under processing by
      the kcopyd thread. As a result, the kcopyd thread can continue running
      completed jobs indefinitely, without releasing the CPU, as long as
      someone keeps submitting new completed jobs through the aforementioned
      paths. Processing of work items, queued for execution on the same CPU as
      the currently running kcopyd thread, is thus stalled for excessive
      amounts of time, hurting performance.
      
      Running the following test, from the device mapper test suite [1],
      
        dmtest run --suite snapshot -n parallel_io_to_many_snaps_N
      
      , with 8 active snapshots, we get, in dmesg, messages like the
      following:
      
      [68899.948523] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 95s!
      [68899.949282] Showing busy workqueues and worker pools:
      [68899.949288] workqueue events: flags=0x0
      [68899.949295]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
      [68899.949306]     pending: vmstat_shepherd, cache_reap
      [68899.949331] workqueue mm_percpu_wq: flags=0x8
      [68899.949337]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949345]     pending: vmstat_update
      [68899.949387] workqueue dm_bufio_cache: flags=0x8
      [68899.949392]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
      [68899.949400]     pending: work_fn [dm_bufio]
      [68899.949423] workqueue kcopyd: flags=0x8
      [68899.949429]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949437]     pending: do_work [dm_mod]
      [68899.949452] workqueue kcopyd: flags=0x8
      [68899.949458]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
      [68899.949466]     in-flight: 13:do_work [dm_mod]
      [68899.949474]     pending: do_work [dm_mod]
      [68899.949487] workqueue kcopyd: flags=0x8
      [68899.949493]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949501]     pending: do_work [dm_mod]
      [68899.949515] workqueue kcopyd: flags=0x8
      [68899.949521]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949529]     pending: do_work [dm_mod]
      [68899.949541] workqueue kcopyd: flags=0x8
      [68899.949547]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949555]     pending: do_work [dm_mod]
      [68899.949568] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=95s workers=4 idle: 27130 27223 1084
      
      Fix this by splitting the complete_jobs list into two parts: A user
      facing part, named callback_jobs, and one used internally by kcopyd,
      retaining the name complete_jobs. dm_kcopyd_do_callback() and
      dispatch_job() now push their jobs to the callback_jobs list, which is
      spliced to the complete_jobs list once, every time the kcopyd thread
      wakes up. This prevents kcopyd from hogging the CPU indefinitely and
      causing workqueue stalls.
      
      Re-running the aforementioned test:
      
        * Workqueue stalls are eliminated
        * The maximum writing time among all targets is reduced from 09m37.10s
          to 06m04.85s and the total run time of the test is reduced from
          10m43.591s to 7m19.199s
      
      [1] https://github.com/jthornber/device-mapper-test-suiteSigned-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      d7e6b8df
    • N
      dm snapshot: Fix excessive memory usage and workqueue stalls · 721b1d98
      Nikos Tsironis 提交于
      kcopyd has no upper limit to the number of jobs one can allocate and
      issue. Under certain workloads this can lead to excessive memory usage
      and workqueue stalls. For example, when creating multiple dm-snapshot
      targets with a 4K chunk size and then writing to the origin through the
      page cache. Syncing the page cache causes a large number of BIOs to be
      issued to the dm-snapshot origin target, which itself issues an even
      larger (because of the BIO splitting taking place) number of kcopyd
      jobs.
      
      Running the following test, from the device mapper test suite [1],
      
        dmtest run --suite snapshot -n many_snapshots_of_same_volume_N
      
      , with 8 active snapshots, results in the kcopyd job slab cache growing
      to 10G. Depending on the available system RAM this can lead to the OOM
      killer killing user processes:
      
      [463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP),
                    nodemask=(null), order=1, oom_score_adj=0
      [463.492894] kthreadd cpuset=/ mems_allowed=0
      [463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3
      [463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
      [463.492952] Call Trace:
      [463.492964]  dump_stack+0x7d/0xbb
      [463.492973]  dump_header+0x6b/0x2fc
      [463.492987]  ? lockdep_hardirqs_on+0xee/0x190
      [463.493012]  oom_kill_process+0x302/0x370
      [463.493021]  out_of_memory+0x113/0x560
      [463.493030]  __alloc_pages_slowpath+0xf40/0x1020
      [463.493055]  __alloc_pages_nodemask+0x348/0x3c0
      [463.493067]  cache_grow_begin+0x81/0x8b0
      [463.493072]  ? cache_grow_begin+0x874/0x8b0
      [463.493078]  fallback_alloc+0x1e4/0x280
      [463.493092]  kmem_cache_alloc_node+0xd6/0x370
      [463.493098]  ? copy_process.part.31+0x1c5/0x20d0
      [463.493105]  copy_process.part.31+0x1c5/0x20d0
      [463.493115]  ? __lock_acquire+0x3cc/0x1550
      [463.493121]  ? __switch_to_asm+0x34/0x70
      [463.493129]  ? kthread_create_worker_on_cpu+0x70/0x70
      [463.493135]  ? finish_task_switch+0x90/0x280
      [463.493165]  _do_fork+0xe0/0x6d0
      [463.493191]  ? kthreadd+0x19f/0x220
      [463.493233]  kernel_thread+0x25/0x30
      [463.493235]  kthreadd+0x1bf/0x220
      [463.493242]  ? kthread_create_on_cpu+0x90/0x90
      [463.493248]  ret_from_fork+0x3a/0x50
      [463.493279] Mem-Info:
      [463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0
      [463.493285]  active_file:80216 inactive_file:80107 isolated_file:435
      [463.493285]  unevictable:0 dirty:51266 writeback:109372 unstable:0
      [463.493285]  slab_reclaimable:31191 slab_unreclaimable:3483521
      [463.493285]  mapped:526 shmem:4903 pagetables:1759 bounce:0
      [463.493285]  free:33623 free_pcp:2392 free_cma:0
      ...
      [463.493489] Unreclaimable slab info:
      [463.493513] Name                      Used          Total
      [463.493522] bio-6                   1028KB       1028KB
      [463.493525] bio-5                   1028KB       1028KB
      [463.493528] dm_snap_pending_exception     236783KB     243789KB
      [463.493531] dm_exception              41KB         42KB
      [463.493534] bio-4                   1216KB       1216KB
      [463.493537] bio-3                 439396KB     439396KB
      [463.493539] kcopyd_job           6973427KB    6973427KB
      ...
      [463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child
      [463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB
      [463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      Moreover, issuing a large number of kcopyd jobs results in kcopyd
      hogging the CPU, while processing them. As a result, processing of work
      items, queued for execution on the same CPU as the currently running
      kcopyd thread, is stalled for long periods of time, hurting performance.
      Running the aforementioned test we get, in dmesg, messages like the
      following:
      
      [67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s!
      [67501.195586] Showing busy workqueues and worker pools:
      [67501.195591] workqueue events: flags=0x0
      [67501.195597]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195611]     pending: cache_reap
      [67501.195641] workqueue mm_percpu_wq: flags=0x8
      [67501.195645]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195656]     pending: vmstat_update
      [67501.195682] workqueue kblockd: flags=0x18
      [67501.195687]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
      [67501.195698]     pending: blk_timeout_work
      [67501.195753] workqueue kcopyd: flags=0x8
      [67501.195757]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195768]     pending: do_work [dm_mod]
      [67501.195802] workqueue kcopyd: flags=0x8
      [67501.195806]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195817]     pending: do_work [dm_mod]
      [67501.195834] workqueue kcopyd: flags=0x8
      [67501.195838]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195848]     pending: do_work [dm_mod]
      [67501.195881] workqueue kcopyd: flags=0x8
      [67501.195885]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195896]     pending: do_work [dm_mod]
      [67501.195920] workqueue kcopyd: flags=0x8
      [67501.195924]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256
      [67501.195935]     in-flight: 67:do_work [dm_mod]
      [67501.195945]     pending: do_work [dm_mod]
      [67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765
      
      The root cause for these issues is the way dm-snapshot uses kcopyd. In
      particular, the lack of an explicit or implicit limit to the maximum
      number of in-flight COW jobs. The merging path is not affected because
      it implicitly limits the in-flight kcopyd jobs to one.
      
      Fix these issues by using a semaphore to limit the maximum number of
      in-flight kcopyd jobs. We grab the semaphore before allocating a new
      kcopyd job in start_copy() and start_full_bio() and release it after the
      job finishes in copy_callback().
      
      The initial semaphore value is configurable through a module parameter,
      to allow fine tuning the maximum number of in-flight COW jobs. Setting
      this parameter to zero initializes the semaphore to INT_MAX.
      
      A default value of 2048 maximum in-flight kcopyd jobs was chosen. This
      value was decided experimentally as a trade-off between memory
      consumption, stalling the kernel's workqueues and maintaining a high
      enough throughput.
      
      Re-running the aforementioned test:
      
        * Workqueue stalls are eliminated
        * kcopyd's job slab cache uses a maximum of 130MB
        * The time taken by the test to write to the snapshot-origin target is
          reduced from 05m20.48s to 03m26.38s
      
      [1] https://github.com/jthornber/device-mapper-test-suiteSigned-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      721b1d98
    • S
      dm bufio: update comment in dm-bufio.c · ef992373
      Shenghui Wang 提交于
      * Hashtable has been replaced by rbtree to manage buffers.
        Update the comment.
      * Fix typo in the comment for dm_bufio_issue_flush
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      ef992373
    • S
      dm writecache: fix typo in error msg for creating writecache_flush_thread · e8ea141a
      Shenghui Wang 提交于
      The error msg should be "flush thread" instead of "endio thread"
      for writecache_flush_thread.
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      e8ea141a
    • M
      dm: remove indirect calls from __send_changing_extent_only() · 53b47168
      Mike Snitzer 提交于
      No need to be so fancy.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      53b47168
    • W
      dm mpath: only flush workqueue when needed · 935fcc56
      wuzhouhui 提交于
      The workqueues are shared by many multipath devices, only flush whole
      workqueue when necessary.  Otherwise, we just flush works as needed.
      Signed-off-by: Nwuzhouhui <wuzhouhui14@mails.ucas.ac.cn>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      935fcc56
    • M
      dm rq: remove unused arguments from rq_completed() · 2adc5c55
      Mike Snitzer 提交于
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2adc5c55
    • M
      dm: avoid indirect call in __dm_make_request · 24113d48
      Mikulas Patocka 提交于
      Indirect calls are inefficient because of retpolines that are used for
      spectre workaround. This patch replaces an indirect call with a condition
      (that can be predicted by the branch predictor).
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      24113d48
    • J
      blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight() · 3c94d83c
      Jens Axboe 提交于
      There's a single user of this function, dm, and dm just wants
      to check if IO is inflight, not that it's just allocated.
      
      This fixes a hang with srp/002 in blktests with dm, where it tries
      to suspend but waits for inflight IO to finish first. As it checks
      for just allocated requests, this fails.
      Tested-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3c94d83c
  12. 13 12月, 2018 7 次提交
    • G
      bcache: print number of keys in trace_bcache_journal_write · e78bd0d2
      Guoju Fang 提交于
      Sometimes flush journal may be very frequent, so it's useful to dump
      number of keys every time write journal.
      Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e78bd0d2
    • C
      bcache: set writeback_percent in a flexible range · cc38ca7e
      Coly Li 提交于
      Because CUTOFF_WRITEBACK is defined as 40, so before the changes of
      dynamic cutoff writeback values, writeback_percent is limited to [0,
      CUTOFF_WRITEBACK]. Any value larger than CUTOFF_WRITEBACK will be fixed
      up to 40.
      
      Now cutof writeback limit is a dynamic value bch_cutoff_writeback, so
      the range of writeback_percent can be a more flexible range as [0,
      bch_cutoff_writeback]. The flexibility is, it can be expended to a
      larger or smaller range than [0, 40], depends on how value
      bch_cutoff_writeback is specified.
      
      The default value is still strongly recommended to most of users for
      most of workloads. But for people who want to do research on bcache
      writeback perforamnce tuning, they may have chance to specify more
      flexible writeback_percent in range [0, 70].
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cc38ca7e
    • C
      bcache: make cutoff_writeback and cutoff_writeback_sync tunable · 9aaf5165
      Coly Li 提交于
      Currently the cutoff writeback and cutoff writeback sync thresholds are
      defined by CUTOFF_WRITEBACK (40) and CUTOFF_WRITEBACK_SYNC (70) as
      static values. Most of time these they work fine, but when people want
      to do research on bcache writeback mode performance tuning, there is no
      chance to modify the soft and hard cutoff writeback values.
      
      This patch introduces two module parameters bch_cutoff_writeback_sync
      and bch_cutoff_writeback which permit people to tune the values when
      loading bcache.ko. If they are not specified by module loading, current
      values CUTOFF_WRITEBACK_SYNC and CUTOFF_WRITEBACK will be used as
      default and nothing changes.
      
      When people want to tune this two values,
      - cutoff_writeback can be set in range [1, 70]
      - cutoff_writeback_sync can be set in range [1, 90]
      - cutoff_writeback always <= cutoff_writeback_sync
      
      The default values are strongly recommended to most of users for most of
      workloads. Anyway, if people wants to take their own risk to do research
      on new writeback cutoff tuning for their own workload, now they can make
      it.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9aaf5165
    • C
      bcache: add MODULE_DESCRIPTION information · 009673d0
      Coly Li 提交于
      This patch moves MODULE_AUTHOR and MODULE_LICENSE to end of super.c, and
      add MODULE_DESCRIPTION("Bcache: a Linux block layer cache").
      
      This is preparation for adding module parameters.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      009673d0
    • C
      bcache: option to automatically run gc thread after writeback · 7a671d8e
      Coly Li 提交于
      The option gc_after_writeback is disabled by default, because garbage
      collection will discard SSD data which drops cached data.
      
      Echo 1 into /sys/fs/bcache/<UUID>/internal/gc_after_writeback will
      enable this option, which wakes up gc thread when writeback accomplished
      and all cached data is clean.
      
      This option is helpful for people who cares writing performance more. In
      heavy writing workload, all cached data can be clean only happens when
      writeback thread cleans all cached data in I/O idle time. In such
      situation a following gc running may help to shrink bcache B+ tree and
      discard more clean data, which may be helpful for future writing
      requests.
      
      If you are not sure whether this is helpful for your own workload,
      please leave it as disabled by default.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7a671d8e
    • C
      bcache: introduce force_wake_up_gc() · cb07ad63
      Coly Li 提交于
      Garbage collection thread starts to work when c->sectors_to_gc is
      negative value, otherwise nothing will happen even the gc thread is
      woken up by wake_up_gc().
      
      force_wake_up_gc() sets c->sectors_to_gc to -1 before calling
      wake_up_gc(), then gc thread may have chance to run if no one else sets
      c->sectors_to_gc to a positive value before gc_should_run().
      
      This routine can be called where the gc thread is woken up and required
      to run in force.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cb07ad63
    • S
      bcache: cannot set writeback_running via sysfs if no writeback kthread created · f383ae30
      Shenghui Wang 提交于
      "echo 1 > writeback_running" marks writeback_running even if no
      writeback kthread created as "d_strtoul(writeback_running)" will simply
      set dc-> writeback_running without checking the existence of
      dc->writeback_thread.
      
      Add check for setting writeback_running via sysfs: if no writeback
      kthread available, reject setting to 1.
      
      v2 -> v3:
        * Make message on wrong assignment more clear.
        * Print name of bcache device instead of name of backing device.
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f383ae30