1. 02 2月, 2018 1 次提交
    • J
      blk-mq: fix discard merge with scheduler attached · 445251d0
      Jens Axboe 提交于
      I ran into an issue on my laptop that triggered a bug on the
      discard path:
      
      WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3/0x430
       Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd intel_gtt
       CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G     U           4.15.0+ #176
       Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017
       RIP: 0010:nvme_setup_cmd+0x3d3/0x430
       RSP: 0018:ffff880423e9f838 EFLAGS: 00010217
       RAX: 0000000000000000 RBX: ffff880423e9f8c8 RCX: 0000000000010000
       RDX: ffff88022b200010 RSI: 0000000000000002 RDI: 00000000327f0000
       RBP: ffff880421251400 R08: ffff88022b200000 R09: 0000000000000009
       R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000ffff
       R13: ffff88042341e280 R14: 000000000000ffff R15: ffff880421251440
       FS:  0000000000000000(0000) GS:ffff880441500000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000055b684795030 CR3: 0000000002e09006 CR4: 00000000001606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        nvme_queue_rq+0x40/0xa00
        ? __sbitmap_queue_get+0x24/0x90
        ? blk_mq_get_tag+0xa3/0x250
        ? wait_woken+0x80/0x80
        ? blk_mq_get_driver_tag+0x97/0xf0
        blk_mq_dispatch_rq_list+0x7b/0x4a0
        ? deadline_remove_request+0x49/0xb0
        blk_mq_do_dispatch_sched+0x4f/0xc0
        blk_mq_sched_dispatch_requests+0x106/0x170
        __blk_mq_run_hw_queue+0x53/0xa0
        __blk_mq_delay_run_hw_queue+0x83/0xa0
        blk_mq_run_hw_queue+0x6c/0xd0
        blk_mq_sched_insert_request+0x96/0x140
        __blk_mq_try_issue_directly+0x3d/0x190
        blk_mq_try_issue_directly+0x30/0x70
        blk_mq_make_request+0x1a4/0x6a0
        generic_make_request+0xfd/0x2f0
        ? submit_bio+0x5c/0x110
        submit_bio+0x5c/0x110
        ? __blkdev_issue_discard+0x152/0x200
        submit_bio_wait+0x43/0x60
        ext4_process_freed_data+0x1cd/0x440
        ? account_page_dirtied+0xe2/0x1a0
        ext4_journal_commit_callback+0x4a/0xc0
        jbd2_journal_commit_transaction+0x17e2/0x19e0
        ? kjournald2+0xb0/0x250
        kjournald2+0xb0/0x250
        ? wait_woken+0x80/0x80
        ? commit_timeout+0x10/0x10
        kthread+0x111/0x130
        ? kthread_create_worker_on_cpu+0x50/0x50
        ? do_group_exit+0x3a/0xa0
        ret_from_fork+0x1f/0x30
       Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff
       ---[ end trace 50d361cc444506c8 ]---
       print_req_error: I/O error, dev nvme0n1, sector 847167488
      
      Decoding the assembly, the request claims to have 0xffff segments,
      while nvme counts two. This turns out to be because we don't check
      for a data carrying request on the mq scheduler path, and since
      blk_phys_contig_segment() returns true for a non-data request,
      we decrement the initial segment count of 0 and end up with
      0xffff in the unsigned short.
      
      There are a few issues here:
      
      1) We should initialize the segment count for a discard to 1.
      2) The discard merging is currently using the data limits for
         segments and sectors.
      
      Fix this up by having attempt_merge() correctly identify the
      request, and by initializing the segment count correctly
      for discards.
      
      This can only be triggered with mq-deadline on discard capable
      devices right now, which isn't a common configuration.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      445251d0
  2. 31 1月, 2018 1 次提交
    • M
      blk-mq: introduce BLK_STS_DEV_RESOURCE · 86ff7c2a
      Ming Lei 提交于
      This status is returned from driver to block layer if device related
      resource is unavailable, but driver can guarantee that IO dispatch
      will be triggered in future when the resource is available.
      
      Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
      returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
      a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
      3 ms because both scsi-mq and nvmefc are using that magic value.
      
      If a driver can make sure there is in-flight IO, it is safe to return
      BLK_STS_DEV_RESOURCE because:
      
      1) If all in-flight IOs complete before examining SCHED_RESTART in
      blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
      is run immediately in this case by blk_mq_dispatch_rq_list();
      
      2) if there is any in-flight IO after/when examining SCHED_RESTART
      in blk_mq_dispatch_rq_list():
      - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
      - otherwise, this request will be dispatched after any in-flight IO is
        completed via blk_mq_sched_restart()
      
      3) if SCHED_RESTART is set concurently in context because of
      BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
      cases and make sure IO hang can be avoided.
      
      One invariant is that queue will be rerun if SCHED_RESTART is set.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Tested-by: NLaurence Oberman <loberman@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      86ff7c2a
  3. 20 1月, 2018 2 次提交
  4. 19 1月, 2018 1 次提交
  5. 18 1月, 2018 1 次提交
    • M
      blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback · 396eaf21
      Ming Lei 提交于
      blk_insert_cloned_request() is called in the fast path of a dm-rq driver
      (e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
      blk_mq_request_bypass_insert() to directly append the request to the
      blk-mq hctx->dispatch_list of the underlying queue.
      
      1) This way isn't efficient enough because the hctx spinlock is always
      used.
      
      2) With blk_insert_cloned_request(), we completely bypass underlying
      queue's elevator and depend on the upper-level dm-rq driver's elevator
      to schedule IO.  But dm-rq currently can't get the underlying queue's
      dispatch feedback at all.  Without knowing whether a request was issued
      or not (e.g. due to underlying queue being busy) the dm-rq elevator will
      not be able to provide effective IO merging (as a side-effect of dm-rq
      currently blindly destaging a request from its elevator only to requeue
      it after a delay, which kills any opportunity for merging).  This
      obviously causes very bad sequential IO performance.
      
      Fix this by updating blk_insert_cloned_request() to use
      blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
      request to be issued directly to the underlying queue and returns the
      dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
      returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
      to _not_ destage the request.  Whereby preserving the opportunity to
      merge IO.
      
      With this, request-based DM's blk-mq sequential IO performance is vastly
      improved (as much as 3X in mpath/virtio-scsi testing).
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
      they were refactored to make them less fragile and easier to read/review]
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      396eaf21
  6. 11 1月, 2018 1 次提交
  7. 10 1月, 2018 1 次提交
    • T
      blk-mq: replace timeout synchronization with a RCU and generation based scheme · 1d9bd516
      Tejun Heo 提交于
      Currently, blk-mq timeout path synchronizes against the usual
      issue/completion path using a complex scheme involving atomic
      bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
      rules.  Unfortunately, it contains quite a few holes.
      
      There's a complex dancing around REQ_ATOM_STARTED and
      REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
      they don't have a synchronization point across request recycle
      instances and it isn't clear what the barriers add.
      blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
      deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
      
      In fact, it's pretty easy to make blk_mq_check_expired() terminate a
      later instance of a request.  If we induce 5 sec delay before
      time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
      2s, and issue back-to-back large IOs, blk-mq starts timing out
      requests spuriously pretty quickly.  Nothing actually timed out.  It
      just made the call on a recycle instance of a request and then
      terminated a later instance long after the original instance finished.
      The scenario isn't theoretical either.
      
      This patch replaces the broken synchronization mechanism with a RCU
      and generation number based one.
      
      1. Each request has a u64 generation + state value, which can be
         updated only by the request owner.  Whenever a request becomes
         in-flight, the generation number gets bumped up too.  This provides
         the basis for the timeout path to distinguish different recycle
         instances of the request.
      
         Also, marking a request in-flight and setting its deadline are
         protected with a seqcount so that the timeout path can fetch both
         values coherently.
      
      2. The timeout path fetches the generation, state and deadline.  If
         the verdict is timeout, it records the generation into a dedicated
         request abortion field and does RCU wait.
      
      3. The completion path is also protected by RCU (from the previous
         patch) and checks whether the current generation number and state
         match the abortion field.  If so, it skips completion.
      
      4. The timeout path, after RCU wait, scans requests again and
         terminates the ones whose generation and state still match the ones
         requested for abortion.
      
         By now, the timeout path knows that either the generation number
         and state changed if it lost the race or the completion will yield
         to it and can safely timeout the request.
      
      While it's more lines of code, it's conceptually simpler, doesn't
      depend on direct use of subtle memory ordering or coherence, and
      hopefully doesn't terminate the wrong instance.
      
      While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
      between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
      removed yet as it's still used in other places.  Future patches will
      move all state tracking to the new mechanism and remove all bitops in
      the hot paths.
      
      Note that this patch adds a comment explaining a race condition in
      BLK_EH_RESET_TIMER path.  The race has always been there and this
      patch doesn't change it.  It's just documenting the existing race.
      
      v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
          - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
          - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
      
      v3: - Fixed possible extended seqcount / u64_stats_sync read looping
            spotted by Peter.
          - MQ_RQ_IDLE was incorrectly being set in complete_request instead
            of free_request.  Fixed.
      
      v4: - Rebased on top of hctx_lock() refactoring patch.
          - Added comment explaining the use of hctx_lock() in completion path.
      
      v5: - Added comments requested by Bart.
          - Note the addition of BLK_EH_RESET_TIMER race condition in the
            commit message.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1d9bd516
  8. 07 1月, 2018 1 次提交
    • M
      blk-mq: quiesce queue before freeing queue · c2856ae2
      Ming Lei 提交于
      After queue is frozen, dispatch still may happen, for example:
      
      1) requests are submitted from several contexts
      2) requests from all these contexts are inserted to queue, but may dispatch
      to LLD in one of these paths, but other paths sill need to move on even all
      these requests are completed(that means blk_mq_freeze_queue_wait() returns
      at that time)
      3) dispatch after queue freezing still moves on and causes use-after-free,
      because request queue is freed
      
      This patch quiesces queue after it is frozen, and makes sure all
      in-progress dispatch are completed.
      
      This patch fixes the following kernel crash when running heavy IOs vs.
      deleting device:
      
      [   36.719251] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [   36.720318] IP: kyber_has_work+0x14/0x40
      [   36.720847] PGD 254bf5067 P4D 254bf5067 PUD 255e6a067 PMD 0
      [   36.721584] Oops: 0000 [#1] PREEMPT SMP
      [   36.722105] Dumping ftrace buffer:
      [   36.722570]    (ftrace buffer empty)
      [   36.723057] Modules linked in: scsi_debug ebtable_filter ebtables ip6table_filter ip6_tables tcm_loop iscsi_target_mod target_core_file target_core_iblock target_core_pscsi target_core_mod xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c bridge stp llc fuse iptable_filter ip_tables sd_mod sg btrfs xor zstd_decompress zstd_compress xxhash raid6_pq mptsas mptscsih bcache crc32c_intel ahci mptbase libahci serio_raw scsi_transport_sas nvme libata shpchp lpc_ich virtio_scsi nvme_core binfmt_misc dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi null_blk configs
      [   36.733438] CPU: 2 PID: 2374 Comm: fio Not tainted 4.15.0-rc2.blk_mq_quiesce+ #714
      [   36.735143] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-1.fc25 04/01/2014
      [   36.736688] RIP: 0010:kyber_has_work+0x14/0x40
      [   36.737515] RSP: 0018:ffffc9000209bca0 EFLAGS: 00010202
      [   36.738431] RAX: 0000000000000008 RBX: ffff88025578bfc8 RCX: ffff880257bf4ed0
      [   36.739581] RDX: 0000000000000038 RSI: ffffffff81a98c6d RDI: ffff88025578bfc8
      [   36.740730] RBP: ffff880253cebfc8 R08: ffffc9000209bda0 R09: ffff8802554f3480
      [   36.741885] R10: ffffc9000209be60 R11: ffff880263f72538 R12: ffff88025573e9e8
      [   36.743036] R13: ffff88025578bfd0 R14: 0000000000000001 R15: 0000000000000000
      [   36.744189] FS:  00007f9b9bee67c0(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
      [   36.746617] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   36.748483] CR2: 0000000000000008 CR3: 0000000254bf4001 CR4: 00000000003606e0
      [   36.750164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   36.751455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   36.752796] Call Trace:
      [   36.753992]  blk_mq_do_dispatch_sched+0x7f/0xe0
      [   36.755110]  blk_mq_sched_dispatch_requests+0x119/0x190
      [   36.756179]  __blk_mq_run_hw_queue+0x83/0x90
      [   36.757144]  __blk_mq_delay_run_hw_queue+0xaf/0x110
      [   36.758046]  blk_mq_run_hw_queue+0x24/0x70
      [   36.758845]  blk_mq_flush_plug_list+0x1e7/0x270
      [   36.759676]  blk_flush_plug_list+0xd6/0x240
      [   36.760463]  blk_finish_plug+0x27/0x40
      [   36.761195]  do_io_submit+0x19b/0x780
      [   36.761921]  ? entry_SYSCALL_64_fastpath+0x1a/0x7d
      [   36.762788]  entry_SYSCALL_64_fastpath+0x1a/0x7d
      [   36.763639] RIP: 0033:0x7f9b9699f697
      [   36.764352] RSP: 002b:00007ffc10f991b8 EFLAGS: 00000206 ORIG_RAX: 00000000000000d1
      [   36.765773] RAX: ffffffffffffffda RBX: 00000000008f6f00 RCX: 00007f9b9699f697
      [   36.766965] RDX: 0000000000a5e6c0 RSI: 0000000000000001 RDI: 00007f9b8462a000
      [   36.768377] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000008f6420
      [   36.769649] R10: 00007f9b846e5000 R11: 0000000000000206 R12: 00007f9b795d6a70
      [   36.770807] R13: 00007f9b795e4140 R14: 00007f9b795e3fe0 R15: 0000000100000000
      [   36.771955] Code: 83 c7 10 e9 3f 68 d1 ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 97 b0 00 00 00 48 8d 42 08 48 83 c2 38 <48> 3b 00 74 06 b8 01 00 00 00 c3 48 3b 40 08 75 f4 48 83 c0 10
      [   36.775004] RIP: kyber_has_work+0x14/0x40 RSP: ffffc9000209bca0
      [   36.776012] CR2: 0000000000000008
      [   36.776690] ---[ end trace 4045cbce364ff2a4 ]---
      [   36.777527] Kernel panic - not syncing: Fatal exception
      [   36.778526] Dumping ftrace buffer:
      [   36.779313]    (ftrace buffer empty)
      [   36.780081] Kernel Offset: disabled
      [   36.780877] ---[ end Kernel panic - not syncing: Fatal exception
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      Tested-by: NYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c2856ae2
  9. 06 1月, 2018 2 次提交
    • C
      block: introduce zoned block devices zone write locking · 6cc77e9c
      Christoph Hellwig 提交于
      Components relying only on the request_queue structure for accessing
      block devices (e.g. I/O schedulers) have a limited knowledged of the
      device characteristics. In particular, the device capacity cannot be
      easily discovered, which for a zoned block device also result in the
      inability to easily know the number of zones of the device (the zone
      size is indicated by the chunk_sectors field of the queue limits).
      
      Introduce the nr_zones field to the request_queue structure to simplify
      access to this information. Also, add the bitmap seq_zone_bitmap which
      indicates which zones of the device are sequential zones (write
      preferred or write required) and the bitmap seq_zones_wlock which
      indicates if a zone is write locked, that is, if a write request
      targeting a zone was dispatched to the device. These fields are
      initialized by the low level block device driver (sd.c for ZBC/ZAC
      disks). They are not initialized by stacking drivers (device mappers)
      handling zoned block devices (e.g. dm-linear).
      
      Using this, I/O schedulers can introduce zone write locking to control
      request dispatching to a zoned block device and avoid write request
      reordering by limiting to at most a single write request per zone
      outside of the scheduler at any time.
      
      Based on previous patches from Damien Le Moal.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      [Damien]
      * Fixed comments and identation in blkdev.h
      * Changed helper functions
      * Fixed this commit message
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6cc77e9c
    • M
      block: drain queue before waiting for q_usage_counter becoming zero · 454be724
      Ming Lei 提交于
      Now we track legacy requests with .q_usage_counter in commit 055f6e18
      ("block: Make q_usage_counter also track legacy requests"), but that
      commit never runs and drains legacy queue before waiting for this counter
      becoming zero, then IO hang is caused in the test of pulling disk during IO.
      
      This patch fixes the issue by draining requests before waiting for
      q_usage_counter becoming zero, both Mauricio and chenxiang reported this
      issue, and observed that it can be fixed by this patch.
      
      Link: https://marc.info/?l=linux-block&m=151192424731797&w=2
      Fixes: 055f6e18("block: Make q_usage_counter also track legacy requests")
      Cc: Wen Xiong <wenxiong@us.ibm.com>
      Tested-by: N"chenxiang (M)" <chenxiang66@hisilicon.com>
      Tested-by: NMauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      454be724
  10. 22 11月, 2017 1 次提交
    • K
      block/laptop_mode: Convert timers to use timer_setup() · bca237a5
      Kees Cook 提交于
      In preparation for unconditionally passing the struct timer_list pointer to
      all timer callbacks, switch to using the new timer_setup() and from_timer()
      to pass the timer pointer explicitly.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: linux-block@vger.kernel.org
      Cc: linux-mm@kvack.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      bca237a5
  11. 16 11月, 2017 1 次提交
  12. 11 11月, 2017 7 次提交
  13. 05 11月, 2017 1 次提交
  14. 04 11月, 2017 4 次提交
  15. 31 10月, 2017 1 次提交
  16. 11 10月, 2017 1 次提交
  17. 03 10月, 2017 1 次提交
  18. 25 9月, 2017 1 次提交
    • W
      blktrace: Fix potential deadlock between delete & sysfs ops · 5acb3cc2
      Waiman Long 提交于
      The lockdep code had reported the following unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(s_active#228);
                                     lock(&bdev->bd_mutex/1);
                                     lock(s_active#228);
        lock(&bdev->bd_mutex);
      
       *** DEADLOCK ***
      
      The deadlock may happen when one task (CPU1) is trying to delete a
      partition in a block device and another task (CPU0) is accessing
      tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
      partition.
      
      The s_active isn't an actual lock. It is a reference count (kn->count)
      on the sysfs (kernfs) file. Removal of a sysfs file, however, require
      a wait until all the references are gone. The reference count is
      treated like a rwsem using lockdep instrumentation code.
      
      The fact that a thread is in the sysfs callback method or in the
      ioctl call means there is a reference to the opended sysfs or device
      file. That should prevent the underlying block structure from being
      removed.
      
      Instead of using bd_mutex in the block_device structure, a new
      blk_trace_mutex is now added to the request_queue structure to protect
      access to the blk_trace structure.
      Suggested-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      
      Fix typo in patch subject line, and prune a comment detailing how
      the code used to work.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5acb3cc2
  19. 12 9月, 2017 1 次提交
    • J
      block: directly insert blk-mq request from blk_insert_cloned_request() · 157f377b
      Jens Axboe 提交于
      A NULL pointer crash was reported for the case of having the BFQ IO
      scheduler attached to the underlying blk-mq paths of a DM multipath
      device.  The crash occured in blk_mq_sched_insert_request()'s call to
      e->type->ops.mq.insert_requests().
      
      Paolo Valente correctly summarized why the crash occured with:
      "the call chain (dm_mq_queue_rq -> map_request -> setup_clone ->
      blk_rq_prep_clone) creates a cloned request without invoking
      e->type->ops.mq.prepare_request for the target elevator e.  The cloned
      request is therefore not initialized for the scheduler, but it is
      however inserted into the scheduler by blk_mq_sched_insert_request."
      
      All said, a request-based DM multipath device's IO scheduler should be
      the only one used -- when the original requests are issued to the
      underlying paths as cloned requests they are inserted directly in the
      underlying dispatch queue(s) rather than through an additional elevator.
      
      But commit bd166ef1 ("blk-mq-sched: add framework for MQ capable IO
      schedulers") switched blk_insert_cloned_request() from using
      blk_mq_insert_request() to blk_mq_sched_insert_request().  Which
      incorrectly added elevator machinery into a call chain that isn't
      supposed to have any.
      
      To fix this introduce a blk-mq private blk_mq_request_bypass_insert()
      that blk_insert_cloned_request() calls to insert the request without
      involving any elevator that may be attached to the cloned request's
      request_queue.
      
      Fixes: bd166ef1 ("blk-mq-sched: add framework for MQ capable IO schedulers")
      Cc: stable@vger.kernel.org
      Reported-by: NBart Van Assche <Bart.VanAssche@wdc.com>
      Tested-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      157f377b
  20. 29 8月, 2017 1 次提交
  21. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  22. 18 8月, 2017 1 次提交
    • B
      block: Relax a check in blk_start_queue() · 4ddd56b0
      Bart Van Assche 提交于
      Calling blk_start_queue() from interrupt context with the queue
      lock held and without disabling IRQs, as the skd driver does, is
      safe. This patch avoids that loading the skd driver triggers the
      following warning:
      
      WARNING: CPU: 11 PID: 1348 at block/blk-core.c:283 blk_start_queue+0x84/0xa0
      RIP: 0010:blk_start_queue+0x84/0xa0
      Call Trace:
       skd_unquiesce_dev+0x12a/0x1d0 [skd]
       skd_complete_internal+0x1e7/0x5a0 [skd]
       skd_complete_other+0xc2/0xd0 [skd]
       skd_isr_completion_posted.isra.30+0x2a5/0x470 [skd]
       skd_isr+0x14f/0x180 [skd]
       irq_forced_thread_fn+0x2a/0x70
       irq_thread+0x144/0x1a0
       kthread+0x125/0x140
       ret_from_fork+0x2a/0x40
      
      Fixes: commit a038e253 ("[PATCH] blk_start_queue() must be called with irq disabled - add warning")
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Andrew Morton <akpm@osdl.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4ddd56b0
  23. 10 8月, 2017 3 次提交
  24. 24 7月, 2017 1 次提交
  25. 04 7月, 2017 1 次提交
  26. 28 6月, 2017 2 次提交