1. 27 1月, 2018 2 次提交
  2. 26 1月, 2018 5 次提交
  3. 25 1月, 2018 3 次提交
  4. 24 1月, 2018 2 次提交
  5. 20 1月, 2018 6 次提交
  6. 19 1月, 2018 7 次提交
  7. 18 1月, 2018 13 次提交
    • P
      block, bfq: limit sectors served with interactive weight raising · 8a8747dc
      Paolo Valente 提交于
      To maximise responsiveness, BFQ raises the weight, and performs device
      idling, for bfq_queues associated with processes deemed as
      interactive. In particular, weight raising has a maximum duration,
      equal to the time needed to start a large application. If a
      weight-raised process goes on doing I/O beyond this maximum duration,
      it loses weight-raising.
      
      This mechanism is evidently vulnerable to the following false
      positives: I/O-bound applications that will go on doing I/O for much
      longer than the duration of weight-raising. These applications have
      basically no benefit from being weight-raised at the beginning of
      their I/O. On the opposite end, while being weight-raised, these
      applications
      a) unjustly steal throughput to applications that may truly need
      low latency;
      b) make BFQ uselessly perform device idling; device idling results
      in loss of device throughput with most flash-based storage, and may
      increase latencies when used purposelessly.
      
      This commit adds a countermeasure to reduce both the above
      problems. To introduce this countermeasure, we provide the following
      extra piece of information (full details in the comments added by this
      commit). During the start-up of the large application used as a
      reference to set the duration of weight-raising, involved processes
      transfer at most ~110K sectors each. Accordingly, a process initially
      deemed as interactive has no right to be weight-raised any longer,
      once transferred 110K sectors or more.
      
      Basing on this consideration, this commit early-ends weight-raising
      for a bfq_queue if the latter happens to have received an amount of
      service at least equal to 110K sectors (actually, a little bit more,
      to keep a safety margin). I/O-bound applications that reach a high
      throughput, such as file copy, get to this threshold much before the
      allowed weight-raising period finishes. Thus this early ending of
      weight-raising reduces the amount of time during which these
      applications cause the problems described above.
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8a8747dc
    • P
      block, bfq: limit tags for writes and async I/O · a52a69ea
      Paolo Valente 提交于
      Asynchronous I/O can easily starve synchronous I/O (both sync reads
      and sync writes), by consuming all request tags. Similarly, storms of
      synchronous writes, such as those that sync(2) may trigger, can starve
      synchronous reads. In their turn, these two problems may also cause
      BFQ to loose control on latency for interactive and soft real-time
      applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice
      Writer takes 0.6 seconds to start if the device is idle, but it takes
      more than 45 seconds (!) if there are sequential writes in the
      background.
      
      This commit addresses this issue by limiting the maximum percentage of
      tags that asynchronous I/O requests and synchronous write requests can
      consume. In particular, this commit grants a higher threshold to
      synchronous writes, to prevent the latter from being starved by
      asynchronous I/O.
      
      According to the above test, LibreOffice Writer now starts in about
      1.2 seconds on average, regardless of the background workload, and
      apart from some rare outlier. To check this improvement, run, e.g.,
      sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init"
      for the comm_startup_lat benchmark in the S suite [1].
      
      [1] https://github.com/Algodev-github/STested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a52a69ea
    • M
      blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy · 23d4ee19
      Ming Lei 提交于
      If we run into blk_mq_request_direct_issue(), when queue is busy, we
      don't want to dispatch this request into hctx->dispatch_list, and
      what we need to do is to return the queue busy info to caller, so
      that caller can deal with it well.
      
      Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
      Reported-by: NLaurence Oberman <loberman@redhat.com>
      Reviewed-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      23d4ee19
    • B
      block: Fix __bio_integrity_endio() documentation · de99a346
      Bart Van Assche 提交于
      Fixes: 4246a0b6 ("block: add a bi_error field to struct bio")
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      de99a346
    • C
      nvme-pci: clean up SMBSZ bit definitions · 88de4598
      Christoph Hellwig 提交于
      Define the bit positions instead of macros using the magic values,
      and move the expanded helpers to calculate the size and size unit into
      the implementation C file.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      88de4598
    • C
      nvme-pci: clean up CMB initialization · f65efd6d
      Christoph Hellwig 提交于
      Refactor the call to nvme_map_cmb, and change the conditions for probing
      for the CMB.  First remove the version check as NVMe TPs always apply
      to earlier versions of the spec as well.  Second check for the whole CMBSZ
      register for support of the CMB feature instead of just the size field
      inside of it to simplify the code a bit.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      f65efd6d
    • J
      nvme-fc: correct hang in nvme_ns_remove() · 0fd997d3
      James Smart 提交于
      When connectivity is lost to a device, the association is terminated
      and the blk-mq queues are quiesced/stopped. When connectivity is
      re-established, they are resumed.
      
      If connectivity is lost for a sufficient amount of time that the
      controller is then deleted, the delete path starts tearing down queues,
      and eventually calling nvme_ns_remove(). It appears that pending
      commands may cause blk_cleanup_queue() to never complete and the
      teardown stalls.
      
      Correct by starting the ns queues after transitioning to a DELETING
      state, allowing pending commands to be flushed with io failures. Thus
      the delete path is clear when reached.
      Signed-off-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      0fd997d3
    • J
      nvme-fc: fix rogue admin cmds stalling teardown · d625d05e
      James Smart 提交于
      When connectivity is lost to a device, the association is terminated
      and the blk-mq queues are quiesced/stopped. When connectivity is
      re-established, they are resumed.
      
      If an admin command is received while connectivity is list, the ioctl
      queues the command on the admin_q and the command stalls (the thread
      issuing the ioctl hangs/waits). if the connectivity is lost long
      enough such that the controller is then deleted, the delete code
      makes its calls to initiate the delete, which then expects the core
      layer to call the transport when all references are removed and the
      controller can be freed.  Unfortunately, nothing in this path dequeued
      the admin command, so a reference sits outstanding and things stop,
      hanging the delete indefinitely.
      
      Correct by unquiescing the admin queue in the delete association. This
      means any admin command (which should only be from an ioctl) issued
      after connectivity is lost will detect the controller is in a
      reconnecting state and will (fast) fail the command. Thus, a pending
      reference can no longer be created.  Once connectivity is re-established,
      a new ioctl/admin command would see proper device state and function again.
      Signed-off-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      d625d05e
    • M
      blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request · 9e97d295
      Mike Snitzer 提交于
      After commit:
      
      923218f6 ("blk-mq: don't allocate driver tag upfront for flush rq")
      
      we no longer use the 'can_block' argument in
      blk_mq_sched_insert_request(). Kill it.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      
      Added actual commit message as to why it's being removed.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9e97d295
    • M
      blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback · 396eaf21
      Ming Lei 提交于
      blk_insert_cloned_request() is called in the fast path of a dm-rq driver
      (e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
      blk_mq_request_bypass_insert() to directly append the request to the
      blk-mq hctx->dispatch_list of the underlying queue.
      
      1) This way isn't efficient enough because the hctx spinlock is always
      used.
      
      2) With blk_insert_cloned_request(), we completely bypass underlying
      queue's elevator and depend on the upper-level dm-rq driver's elevator
      to schedule IO.  But dm-rq currently can't get the underlying queue's
      dispatch feedback at all.  Without knowing whether a request was issued
      or not (e.g. due to underlying queue being busy) the dm-rq elevator will
      not be able to provide effective IO merging (as a side-effect of dm-rq
      currently blindly destaging a request from its elevator only to requeue
      it after a delay, which kills any opportunity for merging).  This
      obviously causes very bad sequential IO performance.
      
      Fix this by updating blk_insert_cloned_request() to use
      blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
      request to be issued directly to the underlying queue and returns the
      dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
      returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
      to _not_ destage the request.  Whereby preserving the opportunity to
      merge IO.
      
      With this, request-based DM's blk-mq sequential IO performance is vastly
      improved (as much as 3X in mpath/virtio-scsi testing).
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
      they were refactored to make them less fragile and easier to read/review]
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      396eaf21
    • M
      blk-mq: factor out a few helpers from __blk_mq_try_issue_directly · 0f95549c
      Mike Snitzer 提交于
      No functional change.  Just makes code flow more logically.
      
      In following commit, __blk_mq_try_issue_directly() will be used to
      return the dispatch result (blk_status_t) to DM.  DM needs this
      information to improve IO merging.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f95549c
    • M
      blk-mq: turn WARN_ON in __blk_mq_run_hw_queue into printk · 7df938fb
      Ming Lei 提交于
      We know this WARN_ON is harmless and in reality it may be trigged,
      so convert it to printk() and dump_stack() to avoid to confusing
      people.
      
      Also add comment about two releated races here.
      
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7df938fb
    • M
      blk-mq: make sure hctx->next_cpu is set correctly · 7bed4595
      Ming Lei 提交于
      When hctx->next_cpu is set from possible online CPUs, there is one
      race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
      break workqueue.
      
      The race can be triggered in the following two sitations:
      
      1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
      to dispatch requests from the DEAD cpu context, but at that
      time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
      CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
      a bad value.
      
      2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
      should be run on the other CPU A, then CPU A may become offline at the
      same time and all CPUs in hctx->cpumask become offline.
      
      This patch deals with this issue by re-selecting next CPU, and making
      sure it is set correctly.
      
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reported-by: N"jianchao.wang" <jianchao.w.wang@oracle.com>
      Tested-by: N"jianchao.wang" <jianchao.w.wang@oracle.com>
      Fixes: 20e4d813 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7bed4595
  8. 17 1月, 2018 1 次提交
    • T
      aoe: use ktime_t instead of timeval · 85cf955d
      Tina Ruchandani 提交于
      'struct frame' uses two variables to store the sent timestamp - 'struct
      timeval' and jiffies. jiffies is used to avoid discrepancies caused by
      updates to system time. 'struct timeval' is deprecated because it uses
      32-bit representation for seconds which will overflow in year 2038.
      
      This patch does the following:
      - Replace the use of 'struct timeval' and jiffies with ktime_t, which
        is the recommended type for timestamping
      - ktime_t provides both long range (like jiffies) and high resolution
        (like timeval). Using ktime_get (monotonic time) instead of wall-clock
        time prevents any discprepancies caused by updates to system time.
      
      [updates by Arnd below]
      The original patch from Tina never went anywhere as we discussed how
      to keep the impact on performance minimal. I've started over now but
      arrived at basically the same patch that she had originally, except for
      an slightly improved tsince_hr() function. I'm making it more robust
      against overflows, and also optimize explicitly for the common case
      in which a frame is less than 4.2 seconds old, using only a 32-bit
      division in that case.
      
      This should make the new version more efficient than the old code,
      since we replace the existing two 32-bit division in do_gettimeofday()
      plus one multiplication with a single single 32-bit division in
      tsince_hr() and drop the double bookkeeping. It's also more efficient
      than the ktime_get_us() API we discussed before, since that would
      also rely on multiple divisions.
      
      Link: https://lists.linaro.org/pipermail/y2038/2015-May/000276.htmlSigned-off-by: NTina Ruchandani <ruchandani.tina@gmail.com>
      Cc: Ed Cashin <ed.cashin@acm.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      85cf955d
  9. 16 1月, 2018 1 次提交
    • A
      blkcg: simplify statistic accumulation code · ddc21231
      Arnd Bergmann 提交于
      Some older compilers (gcc-4.4 through 4.6 in particular) struggle
      with the way that blkg_rwstat_read() returns a structure, leading
      to excessive stack usage and rather inefficient code:
      
      block/blk-cgroup.c: In function 'blkg_destroy':
      block/blk-cgroup.c:354:1: error: the frame size of 1296 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
      block/cfq-iosched.c: In function 'cfqg_stats_add_aux':
      block/cfq-iosched.c:753:1: error: the frame size of 1928 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
      block/bfq-cgroup.c: In function 'bfqg_stats_add_aux':
      block/bfq-cgroup.c:299:1: error: the frame size of 1928 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
      
      I also notice that there is no point in using atomic accesses
      for the local variables, so storing the temporaries in simple 'u64'
      variables not only avoids the stack usage on older compilers but
      also improves the object code on modern versions.
      
      Fixes: e6269c44 ("blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it")
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddc21231