1. 22 3月, 2018 1 次提交
  2. 20 3月, 2018 1 次提交
    • B
      block: Change a rcu_read_{lock,unlock}_sched() pair into rcu_read_{lock,unlock}() · 818e0fa2
      Bart Van Assche 提交于
      scsi_device_quiesce() uses synchronize_rcu() to guarantee that the
      effect of blk_set_preempt_only() will be visible for percpu_ref_tryget()
      calls that occur after the queue unfreeze by using the approach
      explained in https://lwn.net/Articles/573497/. The rcu read lock and
      unlock calls in blk_queue_enter() form a pair with the synchronize_rcu()
      call in scsi_device_quiesce(). Both scsi_device_quiesce() and
      blk_queue_enter() must either use regular RCU or RCU-sched.
      Since neither the RCU-protected code in blk_queue_enter() nor
      blk_queue_usage_counter_release() sleeps, regular RCU protection
      is sufficient. Note: scsi_device_quiesce() does not have to be
      modified since it already uses synchronize_rcu().
      Reported-by: NTejun Heo <tj@kernel.org>
      Fixes: 3a0a5299 ("block, scsi: Make SCSI quiesce and resume work reliably")
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
      Cc: Martin Steigerwald <martin@lichtvoll.de>
      Cc: stable@vger.kernel.org # v4.15
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      818e0fa2
  3. 18 3月, 2018 2 次提交
  4. 17 3月, 2018 2 次提交
    • J
      blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir() · 4c699480
      Joseph Qi 提交于
      We've triggered a WARNING in blk_throtl_bio() when throttling writeback
      io, which complains blkg->refcnt is already 0 when calling blkg_get(),
      and then kernel crashes with invalid page request.
      After investigating this issue, we've found it is caused by a race
      between blkcg_bio_issue_check() and cgroup_rmdir(), which is described
      below:
      
      writeback kworker               cgroup_rmdir
                                        cgroup_destroy_locked
                                          kill_css
                                            css_killed_ref_fn
                                              css_killed_work_fn
                                                offline_css
                                                  blkcg_css_offline
        blkcg_bio_issue_check
          rcu_read_lock
          blkg_lookup
                                                    spin_trylock(q->queue_lock)
                                                    blkg_destroy
                                                    spin_unlock(q->queue_lock)
          blk_throtl_bio
          spin_lock_irq(q->queue_lock)
          ...
          spin_unlock_irq(q->queue_lock)
        rcu_read_unlock
      
      Since rcu can only prevent blkg from releasing when it is being used,
      the blkg->refcnt can be decreased to 0 during blkg_destroy() and schedule
      blkg release.
      Then trying to blkg_get() in blk_throtl_bio() will complains the WARNING.
      And then the corresponding blkg_put() will schedule blkg release again,
      which result in double free.
      This race is introduced by commit ae118896 ("blkcg: consolidate blkg
      creation in blkcg_bio_issue_check()"). Before this commit, it will
      lookup first and then try to lookup/create again with queue_lock. Since
      revive this logic is a bit drastic, so fix it by only offlining pd during
      blkcg_css_offline(), and move the rest destruction (especially
      blkg_put()) into blkcg_css_free(), which should be the right way as
      discussed.
      
      Fixes: ae118896 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()")
      Reported-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4c699480
    • J
      block: sed-opal: fix u64 short atom length · 5f990d31
      Jonas Rabenstein 提交于
      The length must be given as bytes and not as 4 bit tuples.
      Reviewed-by: NScott Bauer <scott.bauer@intel.com>
      Signed-off-by: NJonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5f990d31
  5. 14 3月, 2018 3 次提交
  6. 09 3月, 2018 6 次提交
  7. 07 3月, 2018 1 次提交
  8. 01 3月, 2018 6 次提交
  9. 27 2月, 2018 4 次提交
    • J
      genhd: Fix BUG in blkdev_open() · 56c0908c
      Jan Kara 提交于
      When two blkdev_open() calls for a partition race with device removal
      and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
      blkdev_open(). The race can happen as follows:
      
      CPU0				CPU1			CPU2
      							del_gendisk()
      							  bdev_unhash_inode(part1);
      
      blkdev_open(part1, O_EXCL)	blkdev_open(part1, O_EXCL)
        bdev = bd_acquire()		  bdev = bd_acquire()
        blkdev_get(bdev)
          bd_start_claiming(bdev)
            - finds old inode 'whole'
            bd_prepare_to_claim() -> 0
      							  bdev_unhash_inode(whole);
      							<device removed>
      							<new device under same
      							 number created>
      				  blkdev_get(bdev);
      				    bd_start_claiming(bdev)
      				      - finds new inode 'whole'
      				      bd_prepare_to_claim()
      					- this also succeeds as we have
      					  different 'whole' here...
      					- bad things happen now as we
      					  have two exclusive openers of
      					  the same bdev
      
      The problem here is that block device opens can see various intermediate
      states while gendisk is shutting down and then being recreated.
      
      We fix the problem by introducing new lookup_sem in gendisk that
      synchronizes gendisk deletion with get_gendisk() and furthermore by
      making sure that get_gendisk() does not return gendisk that is being (or
      has been) deleted. This makes sure that once we ever manage to look up
      newly created bdev inode, we are also guaranteed that following
      get_gendisk() will either return failure (and we fail open) or it
      returns gendisk for the new device and following bdget_disk() will
      return new bdev inode (i.e., blkdev_open() follows the path as if it is
      completely run after new device is created).
      Reported-and-analyzed-by: NHou Tao <houtao1@huawei.com>
      Tested-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      56c0908c
    • J
      genhd: Add helper put_disk_and_module() · 9df6c299
      Jan Kara 提交于
      Add a proper counterpart to get_disk_and_module() -
      put_disk_and_module(). Currently it is opencoded in several places.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9df6c299
    • J
      genhd: Rename get_disk() to get_disk_and_module() · 3079c22e
      Jan Kara 提交于
      Rename get_disk() to get_disk_and_module() to make sure what the
      function does. It's not a great name but at least it is now clear that
      put_disk() is not it's counterpart.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3079c22e
    • J
      genhd: Fix leaked module reference for NVME devices · d52987b5
      Jan Kara 提交于
      Commit 8ddcd653 "block: introduce GENHD_FL_HIDDEN" added handling of
      hidden devices to get_gendisk() but forgot to drop module reference
      which is also acquired by get_disk(). Drop the reference as necessary.
      
      Arguably the function naming here is misleading as put_disk() is *not*
      the counterpart of get_disk() but let's fix that in the follow up
      commit since that will be more intrusive.
      
      Fixes: 8ddcd653
      CC: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d52987b5
  10. 25 2月, 2018 2 次提交
  11. 24 2月, 2018 1 次提交
  12. 14 2月, 2018 1 次提交
  13. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  14. 08 2月, 2018 1 次提交
    • P
      block, bfq: add requeue-request hook · a7877390
      Paolo Valente 提交于
      Commit 'a6a252e6 ("blk-mq-sched: decide how to handle flush rq via
      RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
      be re-inserted into the active I/O scheduler for that device. As a
      consequence, I/O schedulers may get the same request inserted again,
      even several times, without a finish_request invoked on that request
      before each re-insertion.
      
      This fact is the cause of the failure reported in [1]. For an I/O
      scheduler, every re-insertion of the same re-prepared request is
      equivalent to the insertion of a new request. For schedulers like
      mq-deadline or kyber, this fact causes no harm. In contrast, it
      confuses a stateful scheduler like BFQ, which keeps state for an I/O
      request, until the finish_request hook is invoked on the request. In
      particular, BFQ may get stuck, waiting forever for the number of
      request dispatches, of the same request, to be balanced by an equal
      number of request completions (while there will be one completion for
      that request). In this state, BFQ may refuse to serve I/O requests
      from other bfq_queues. The hang reported in [1] then follows.
      
      However, the above re-prepared requests undergo a requeue, thus the
      requeue_request hook of the active elevator is invoked for these
      requests, if set. This commit then addresses the above issue by
      properly implementing the hook requeue_request in BFQ.
      
      [1] https://marc.info/?l=linux-block&m=151211117608676Reported-by: NIvan Kozik <ivan@ludios.org>
      Reported-by: NAlban Browaeys <alban.browaeys@gmail.com>
      Tested-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NSerena Ziviani <ziviani.serena@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a7877390
  15. 07 2月, 2018 2 次提交
    • H
      block: Add should_fail_bio() for bpf error injection · 30abb3a6
      Howard McLauchlan 提交于
      The classic error injection mechanism, should_fail_request() does not
      support use cases where more information is required (from the entire
      struct bio, for example).
      
      To that end, this patch introduces should_fail_bio(), which calls
      should_fail_request() under the hood but provides a convenient
      place for kprobes to hook into if they require the entire struct bio.
      This patch also replaces some existing calls to should_fail_request()
      with should_fail_bio() with no degradation in performance.
      Signed-off-by: NHoward McLauchlan <hmclauchlan@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      30abb3a6
    • J
      blk-wbt: account flush requests correctly · 5235553d
      Jens Axboe 提交于
      Mikulas reported a workload that saw bad performance, and figured
      out what it was due to various other types of requests being
      accounted as reads. Flush requests, for instance. Due to the
      high latency of those, we heavily throttle the writes to keep
      the latencies in balance. But they really should be accounted
      as writes.
      
      Fix this by checking the exact type of the request. If it's a
      read, account as a read, if it's a write or a flush, account
      as a write. Any other request we disregard. Previously everything
      would have been mistakenly accounted as reads.
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5235553d
  16. 02 2月, 2018 2 次提交
    • K
      bea99a50
    • J
      blk-mq: fix discard merge with scheduler attached · 445251d0
      Jens Axboe 提交于
      I ran into an issue on my laptop that triggered a bug on the
      discard path:
      
      WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3/0x430
       Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd intel_gtt
       CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G     U           4.15.0+ #176
       Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017
       RIP: 0010:nvme_setup_cmd+0x3d3/0x430
       RSP: 0018:ffff880423e9f838 EFLAGS: 00010217
       RAX: 0000000000000000 RBX: ffff880423e9f8c8 RCX: 0000000000010000
       RDX: ffff88022b200010 RSI: 0000000000000002 RDI: 00000000327f0000
       RBP: ffff880421251400 R08: ffff88022b200000 R09: 0000000000000009
       R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000ffff
       R13: ffff88042341e280 R14: 000000000000ffff R15: ffff880421251440
       FS:  0000000000000000(0000) GS:ffff880441500000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000055b684795030 CR3: 0000000002e09006 CR4: 00000000001606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        nvme_queue_rq+0x40/0xa00
        ? __sbitmap_queue_get+0x24/0x90
        ? blk_mq_get_tag+0xa3/0x250
        ? wait_woken+0x80/0x80
        ? blk_mq_get_driver_tag+0x97/0xf0
        blk_mq_dispatch_rq_list+0x7b/0x4a0
        ? deadline_remove_request+0x49/0xb0
        blk_mq_do_dispatch_sched+0x4f/0xc0
        blk_mq_sched_dispatch_requests+0x106/0x170
        __blk_mq_run_hw_queue+0x53/0xa0
        __blk_mq_delay_run_hw_queue+0x83/0xa0
        blk_mq_run_hw_queue+0x6c/0xd0
        blk_mq_sched_insert_request+0x96/0x140
        __blk_mq_try_issue_directly+0x3d/0x190
        blk_mq_try_issue_directly+0x30/0x70
        blk_mq_make_request+0x1a4/0x6a0
        generic_make_request+0xfd/0x2f0
        ? submit_bio+0x5c/0x110
        submit_bio+0x5c/0x110
        ? __blkdev_issue_discard+0x152/0x200
        submit_bio_wait+0x43/0x60
        ext4_process_freed_data+0x1cd/0x440
        ? account_page_dirtied+0xe2/0x1a0
        ext4_journal_commit_callback+0x4a/0xc0
        jbd2_journal_commit_transaction+0x17e2/0x19e0
        ? kjournald2+0xb0/0x250
        kjournald2+0xb0/0x250
        ? wait_woken+0x80/0x80
        ? commit_timeout+0x10/0x10
        kthread+0x111/0x130
        ? kthread_create_worker_on_cpu+0x50/0x50
        ? do_group_exit+0x3a/0xa0
        ret_from_fork+0x1f/0x30
       Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff
       ---[ end trace 50d361cc444506c8 ]---
       print_req_error: I/O error, dev nvme0n1, sector 847167488
      
      Decoding the assembly, the request claims to have 0xffff segments,
      while nvme counts two. This turns out to be because we don't check
      for a data carrying request on the mq scheduler path, and since
      blk_phys_contig_segment() returns true for a non-data request,
      we decrement the initial segment count of 0 and end up with
      0xffff in the unsigned short.
      
      There are a few issues here:
      
      1) We should initialize the segment count for a discard to 1.
      2) The discard merging is currently using the data limits for
         segments and sectors.
      
      Fix this up by having attempt_merge() correctly identify the
      request, and by initializing the segment count correctly
      for discards.
      
      This can only be triggered with mq-deadline on discard capable
      devices right now, which isn't a common configuration.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      445251d0
  17. 31 1月, 2018 1 次提交
    • M
      blk-mq: introduce BLK_STS_DEV_RESOURCE · 86ff7c2a
      Ming Lei 提交于
      This status is returned from driver to block layer if device related
      resource is unavailable, but driver can guarantee that IO dispatch
      will be triggered in future when the resource is available.
      
      Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
      returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
      a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
      3 ms because both scsi-mq and nvmefc are using that magic value.
      
      If a driver can make sure there is in-flight IO, it is safe to return
      BLK_STS_DEV_RESOURCE because:
      
      1) If all in-flight IOs complete before examining SCHED_RESTART in
      blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
      is run immediately in this case by blk_mq_dispatch_rq_list();
      
      2) if there is any in-flight IO after/when examining SCHED_RESTART
      in blk_mq_dispatch_rq_list():
      - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
      - otherwise, this request will be dispatched after any in-flight IO is
        completed via blk_mq_sched_restart()
      
      3) if SCHED_RESTART is set concurently in context because of
      BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
      cases and make sure IO hang can be avoided.
      
      One invariant is that queue will be rerun if SCHED_RESTART is set.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Tested-by: NLaurence Oberman <loberman@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      86ff7c2a
  18. 25 1月, 2018 2 次提交
  19. 24 1月, 2018 1 次提交