1. 08 12月, 2018 1 次提交
    • J
      blk-mq: fix corruption with direct issue · 724ff9cb
      Jens Axboe 提交于
      commit ffe81d45 upstream.
      
      If we attempt a direct issue to a SCSI device, and it returns BUSY, then
      we queue the request up normally. However, the SCSI layer may have
      already setup SG tables etc for this particular command. If we later
      merge with this request, then the old tables are no longer valid. Once
      we issue the IO, we only read/write the original part of the request,
      not the new state of it.
      
      This causes data corruption, and is most often noticed with the file
      system complaining about the just read data being invalid:
      
      [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
      
      because most of it is garbage...
      
      This doesn't happen from the normal issue path, as we will simply defer
      the request to the hardware queue dispatch list if we fail. Once it's on
      the dispatch list, we never merge with it.
      
      Fix this from the direct issue path by flagging the request as
      REQ_NOMERGE so we don't change the size of it before issue.
      
      See also:
        https://bugzilla.kernel.org/show_bug.cgi?id=201685Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Fixes: 6ce3dd6e ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      724ff9cb
  2. 01 12月, 2018 1 次提交
  3. 27 11月, 2018 1 次提交
  4. 21 11月, 2018 1 次提交
    • M
      SCSI: fix queue cleanup race before queue initialization is done · 410306a0
      Ming Lei 提交于
      commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream.
      
      c2856ae2 ("blk-mq: quiesce queue before freeing queue") has
      already fixed this race, however the implied synchronize_rcu()
      in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
      performance regression.
      
      Then 1311326c ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
      tried to quiesce queue for avoiding unnecessary synchronize_rcu()
      only when queue initialization is done, because it is usual to see
      lots of inexistent LUNs which need to be probed.
      
      However, turns out it isn't safe to quiesce queue only when queue
      initialization is done. Because when one SCSI command is completed,
      the user of sending command can be waken up immediately, then the
      scsi device may be removed, meantime the run queue in scsi_end_request()
      is still in-progress, so kernel panic can be caused.
      
      In Red Hat QE lab, there are several reports about this kind of kernel
      panic triggered during kernel booting.
      
      This patch tries to address the issue by grabing one queue usage
      counter during freeing one request and the following run queue.
      
      Fixes: 1311326c ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
      Cc: Andrew Jones <drjones@redhat.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
      Cc: stable <stable@vger.kernel.org>
      Cc: jianchao.wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      410306a0
  5. 14 11月, 2018 4 次提交
  6. 18 10月, 2018 1 次提交
  7. 12 10月, 2018 1 次提交
  8. 28 9月, 2018 1 次提交
  9. 27 9月, 2018 1 次提交
    • D
      block: fix deadline elevator drain for zoned block devices · 854f31cc
      Damien Le Moal 提交于
      When the deadline scheduler is used with a zoned block device, writes
      to a zone will be dispatched one at a time. This causes the warning
      message:
      
      deadline: forced dispatching is broken (nr_sorted=X), please report this
      
      to be displayed when switching to another elevator with the legacy I/O
      path while write requests to a zone are being retained in the scheduler
      queue.
      
      Prevent this message from being displayed when executing
      elv_drain_elevator() for a zoned block device. __blk_drain_queue() will
      loop until all writes are dispatched and completed, resulting in the
      desired elevator queue drain without extensive modifications to the
      deadline code itself to handle forced-dispatch calls.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Fixes: 8dc8146f ("deadline-iosched: Introduce zone locking support")
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      854f31cc
  10. 26 9月, 2018 1 次提交
  11. 22 9月, 2018 1 次提交
    • O
      block: use nanosecond resolution for iostat · b57e99b4
      Omar Sandoval 提交于
      Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
      updating properly on 4.18. This is because we started using ktime to
      track elapsed time, and we convert nanoseconds to jiffies when we update
      the partition counter. However, this gets rounded down, so any I/Os that
      take less than a jiffy are not accounted for. Previously in this case,
      the value of jiffies would sometimes increment while we were doing I/O,
      so at least some I/Os were accounted for.
      
      Let's convert the stats to use nanoseconds internally. We still report
      milliseconds as before, now more accurately than ever. The value is
      still truncated to 32 bits for backwards compatibility.
      
      Fixes: 522a7775 ("block: consolidate struct request timestamp fields")
      Cc: stable@vger.kernel.org
      Reported-by: NKlaus Kusche <klaus.kusche@computerix.info>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b57e99b4
  12. 12 9月, 2018 1 次提交
  13. 07 9月, 2018 1 次提交
  14. 06 9月, 2018 1 次提交
  15. 01 9月, 2018 3 次提交
    • D
      blkcg: use tryget logic when associating a blkg with a bio · 31118850
      Dennis Zhou (Facebook) 提交于
      There is a very small change a bio gets caught up in a really
      unfortunate race between a task migration, cgroup exiting, and itself
      trying to associate with a blkg. This is due to css offlining being
      performed after the css->refcnt is killed which triggers removal of
      blkgs that reach their blkg->refcnt of 0.
      
      To avoid this, association with a blkg should use tryget and fallback to
      using the root_blkg.
      
      Fixes: 08e18eab ("block: add bi_blkg to the bio for cgroups")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      31118850
    • D
      blkcg: delay blkg destruction until after writeback has finished · 59b57717
      Dennis Zhou (Facebook) 提交于
      Currently, blkcg destruction relies on a sequence of events:
        1. Destruction starts. blkcg_css_offline() is called and blkgs
           release their reference to the blkcg. This immediately destroys
           the cgwbs (writeback).
        2. With blkgs giving up their reference, the blkcg ref count should
           become zero and eventually call blkcg_css_free() which finally
           frees the blkcg.
      
      Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
      and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
      on the completion of all writeback associated with the blkcg. A count of
      the number of cgwbs is maintained and once that goes to zero, blkg
      destruction can follow. This should prevent premature blkg destruction
      related to writeback.
      
      The new process for blkcg cleanup is as follows:
        1. Destruction starts. blkcg_css_offline() is called which offlines
           writeback. Blkg destruction is delayed on the cgwb_refcnt count to
           avoid punting potentially large amounts of outstanding writeback
           to root while maintaining any ongoing policies. Here, the base
           cgwb_refcnt is put back.
        2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
           and handles destruction of blkgs. This is where the css reference
           held by each blkg is released.
        3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
           This finally frees the blkg.
      
      It seems in the past blk-throttle didn't do the most understandable
      things with taking data from a blkg while associating with current. So,
      the simplification and unification of what blk-throttle is doing caused
      this.
      
      Fixes: 08e18eab ("block: add bi_blkg to the bio for cgroups")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      59b57717
    • D
      Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()" · 6b065462
      Dennis Zhou (Facebook) 提交于
      This reverts commit 4c699480.
      
      Destroying blkgs is tricky because of the nature of the relationship. A
      blkg should go away when either a blkcg or a request_queue goes away.
      However, blkg's pin the blkcg to ensure they remain valid. To break this
      cycle, when a blkcg is offlined, blkgs put back their css ref. This
      eventually lets css_free() get called which frees the blkcg.
      
      The above commit (4c699480) breaks this order of events by trying to
      destroy blkgs in css_free(). As the blkgs still hold references to the
      blkcg, css_free() is never called.
      
      The race between blkcg_bio_issue_check() and cgroup_rmdir() will be
      addressed in the following patch by delaying destruction of a blkg until
      all writeback associated with the blkcg has been finished.
      
      Fixes: 4c699480 ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6b065462
  16. 28 8月, 2018 5 次提交
  17. 23 8月, 2018 4 次提交
  18. 21 8月, 2018 2 次提交
    • J
      blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter · f5bbbbe4
      Jianchao Wang 提交于
      For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
      account the inflight requests. It will access the queue_hw_ctx and
      nr_hw_queues w/o any protection. When updating nr_hw_queues and
      blk_mq_in_flight/rw occur concurrently, panic comes up.
      
      Before update nr_hw_queues, the q will be frozen. So we could use
      q_usage_counter to avoid the race. percpu_ref_is_zero is used here
      so that we will not miss any in-flight request. The access to
      nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are
      under rcu critical section, __blk_mq_update_nr_hw_queues could use
      synchronize_rcu to ensure the zeroed q_usage_counter to be globally
      visible.
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f5bbbbe4
    • J
      blk-mq: init hctx sched after update ctx and hctx mapping · d48ece20
      Jianchao Wang 提交于
      Currently, when update nr_hw_queues, IO scheduler's init_hctx will
      be invoked before the mapping between ctx and hctx is adapted
      correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
      may depend on this mapping and get wrong result and panic finally.
      A simply way to fix this is that switch the IO scheduler to 'none'
      before update the nr_hw_queues, and then switch it back after
      update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
      to nobody use them any more.
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d48ece20
  19. 18 8月, 2018 1 次提交
  20. 17 8月, 2018 6 次提交
  21. 15 8月, 2018 2 次提交