1. 09 4月, 2017 5 次提交
  2. 08 4月, 2017 3 次提交
  3. 07 4月, 2017 3 次提交
    • N
      block: trace completion of all bios. · fbbaf700
      NeilBrown 提交于
      Currently only dm and md/raid5 bios trigger
      trace_block_bio_complete().  Now that we have bio_chain() and
      bio_inc_remaining(), it is not possible, in general, for a driver to
      know when the bio is really complete.  Only bio_endio() knows that.
      
      So move the trace_block_bio_complete() call to bio_endio().
      
      Now trace_block_bio_complete() pairs with trace_block_bio_queue().
      Any bio for which a 'queue' event is traced, will subsequently
      generate a 'complete' event.
      
      There are a few cases where completion tracing is not wanted.
      1/ If blk_update_request() has already generated a completion
         trace event at the 'request' level, there is no point generating
         one at the bio level too.  In this case the bi_sector and bi_size
         will have changed, so the bio level event would be wrong
      
      2/ If the bio hasn't actually been queued yet, but is being aborted
         early, then a trace event could be confusing.  Some filesystems
         call bio_endio() but do not want tracing.
      
      3/ The bio_integrity code interposes itself by replacing bi_end_io,
         then restoring it and calling bio_endio() again.  This would produce
         two identical trace events if left like that.
      
      To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
      produce the trace event when this is set.
      We address point 1 above by clearing the flag in blk_update_request().
      We address point 2 above by only setting the flag when
      generic_make_request() is called.
      We address point 3 above by clearing the flag after generating a
      completion event.
      
      When bio_split() is used on a bio, particularly in blk_queue_split(),
      there is an extra complication.  A new bio is split off the front, and
      may be handle directly without going through generic_make_request().
      The old bio, which has been advanced, is passed to
      generic_make_request(), so it will trigger a trace event a second
      time.
      Probably the best result when a split happens is to see a single
      'queue' event for the whole bio, then multiple 'complete' events - one
      for each component.  To achieve this was can:
      - copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
      - avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
      This way, the split-off bio won't create a queue event, the original
      won't either even if it re-submitted to generic_make_request(),
      but both will produce completion events, each for their own range.
      
      So if generic_make_request() is called (which generates a QUEUED
      event), then bi_endio() will create a single COMPLETE event for each
      range that the bio is split into, unless the driver has explicitly
      requested it not to.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fbbaf700
    • N
      block: simple improvements for bio->flags · dbde775c
      NeilBrown 提交于
      The comment for the 'flags' field of 'bio' mentions
      "command" which is no longer stored there, and doesn't
      mention the bvec pool number, which is.
      
      BIO_RESET_BITS is set in such a way that it would need to be
      updated if new bits were added, which is easy to miss.
      
      BVEC_POOL_BITS is larger than needed.  The BVEC_POOL_IDX()
      ranges from 0 to 6, so 3 bits are sufficient.
      
      This patch make improvements in each of these areas.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dbde775c
    • O
      blk-mq-sched: fix crash in switch error path · 54d5329d
      Omar Sandoval 提交于
      In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
      back to the original scheduler. However, at this point, we've already
      torn down the original scheduler's tags, so this causes a crash. Doing
      the fallback like the legacy elevator path is much harder for mq, so fix
      it by just falling back to none, instead.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      54d5329d
  4. 06 4月, 2017 2 次提交
  5. 05 4月, 2017 1 次提交
  6. 04 4月, 2017 3 次提交
  7. 02 4月, 2017 1 次提交
  8. 30 3月, 2017 1 次提交
  9. 29 3月, 2017 3 次提交
  10. 28 3月, 2017 3 次提交
    • S
      blk-throttle: add a mechanism to estimate IO latency · b9147dd1
      Shaohua Li 提交于
      User configures latency target, but the latency threshold for each
      request size isn't fixed. For a SSD, the IO latency highly depends on
      request size. To calculate latency threshold, we sample some data, eg,
      average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
      threshold of each request size will be the sample latency (I'll call it
      base latency) plus latency target. For example, the base latency for
      request size 4k is 80us and user configures latency target 60us. The 4k
      latency threshold will be 80 + 60 = 140us.
      
      To sample data, we calculate the order base 2 of rounded up IO sectors.
      If the IO size is bigger than 1M, it will be accounted as 1M. Since the
      calculation does round up, the base latency will be slightly smaller
      than actual value. Also if there isn't any IO dispatched for a specific
      IO size, we will use the base latency of smaller IO size for this IO
      size.
      
      But we shouldn't sample data at any time. The base latency is supposed
      to be latency where disk isn't congested, because we use latency
      threshold to schedule IOs between cgroups. If disk is congested, the
      latency is higher, using it for scheduling is meaningless. Hence we only
      do the sampling when block throttling is in the LOW limit, with
      assumption disk isn't congested in such state. If the assumption isn't
      true, eg, low limit is too high, calculated latency threshold will be
      higher.
      
      Hard disk is completely different. Latency depends on spindle seek
      instead of request size. Currently this feature is SSD only, we probably
      can use a fixed threshold like 4ms for hard disk though.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b9147dd1
    • S
      block: track request size in blk_issue_stat · 88eeca49
      Shaohua Li 提交于
      Currently there is no way to know the request size when the request is
      finished. Next patch will need this info. We could add extra field to
      record the size, but blk_issue_stat has enough space to record it, so
      this patch just overloads blk_issue_stat. With this, we will have 49bits
      to track time, which still is very long time.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      88eeca49
    • S
      blk-throttle: add a simple idle detection · 9e234eea
      Shaohua Li 提交于
      A cgroup gets assigned a low limit, but the cgroup could never dispatch
      enough IO to cross the low limit. In such case, the queue state machine
      will remain in LIMIT_LOW state and all other cgroups will be throttled
      according to low limit. This is unfair for other cgroups. We should
      treat the cgroup idle and upgrade the state machine to lower state.
      
      We also have a downgrade logic. If the state machine upgrades because of
      cgroup idle (real idle), the state machine will downgrade soon as the
      cgroup is below its low limit. This isn't what we want. A more
      complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
      when queue gets upgraded to lower state, other cgroups could dispatch
      more IO and this cgroup can't dispatch enough IO, so the cgroup is below
      its low limit and looks like idle (fake idle). In this case, the queue
      should downgrade soon. The key to determine if we should do downgrade is
      to detect if cgroup is truely idle.
      
      Unfortunately it's very hard to determine if a cgroup is real idle. This
      patch uses the 'think time check' idea from CFQ for the purpose. Please
      note, the idea doesn't work for all workloads. For example, a workload
      with io depth 8 has disk utilization 100%, hence think time is 0, eg,
      not idle. But the workload can run higher bandwidth with io depth 16.
      Compared to io depth 16, the io depth 8 workload is idle. We use the
      idea to roughly determine if a cgroup is idle.
      
      We treat a cgroup idle if its think time is above a threshold (by
      default 1ms for SSD and 100ms for HD). The idea is think time above the
      threshold will start to harm performance. HD is much slower so a longer
      think time is ok.
      
      The patch (and the latter patches) uses 'unsigned long' to track time.
      We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
      precision, should not a big deal.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9e234eea
  11. 25 3月, 2017 3 次提交
  12. 24 3月, 2017 1 次提交
  13. 23 3月, 2017 8 次提交
  14. 22 3月, 2017 3 次提交
    • R
      iommu: Disambiguate MSI region types · 9d3a4de4
      Robin Murphy 提交于
      The introduction of reserved regions has left a couple of rough edges
      which we could do with sorting out sooner rather than later. Since we
      are not yet addressing the potential dynamic aspect of software-managed
      reservations and presenting them at arbitrary fixed addresses, it is
      incongruous that we end up displaying hardware vs. software-managed MSI
      regions to userspace differently, especially since ARM-based systems may
      actually require one or the other, or even potentially both at once,
      (which iommu-dma currently has no hope of dealing with at all). Let's
      resolve the former user-visible inconsistency ASAP before the ABI has
      been baked into a kernel release, in a way that also lays the groundwork
      for the latter shortcoming to be addressed by follow-up patches.
      
      For clarity, rename the software-managed type to IOMMU_RESV_SW_MSI, use
      IOMMU_RESV_MSI to describe the hardware type, and document everything a
      little bit. Since the x86 MSI remapping hardware falls squarely under
      this meaning of IOMMU_RESV_MSI, apply that type to their regions as well,
      so that we tell the same story to userspace across all platforms.
      
      Secondly, as the various region types require quite different handling,
      and it really makes little sense to ever try combining them, convert the
      bitfield-esque #defines to a plain enum in the process before anyone
      gets the wrong impression.
      
      Fixes: d30ddcaa ("iommu: Add a new type field in iommu_resv_region")
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      CC: Alex Williamson <alex.williamson@redhat.com>
      CC: David Woodhouse <dwmw2@infradead.org>
      CC: kvm@vger.kernel.org
      Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      9d3a4de4
    • P
      hwmon: Add missing HWMON_T_ALARM · a5023a99
      Peter Huewe 提交于
      Unfortunately the HWMON_T_ALARM define was missing,
      although the associated entry was present in hwmon_temp_attributes.
      This is needed to convert drivers to the new interface which use channel
      based alarms.
      Signed-off-by: NPeter Huewe <peterhuewe@gmx.de>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      a5023a99
    • S
      tcp: mark skbs with SCM_TIMESTAMPING_OPT_STATS · 4ef1b286
      Soheil Hassas Yeganeh 提交于
      SOF_TIMESTAMPING_OPT_STATS can be enabled and disabled
      while packets are collected on the error queue.
      So, checking SOF_TIMESTAMPING_OPT_STATS in sk->sk_tsflags
      is not enough to safely assume that the skb contains
      OPT_STATS data.
      
      Add a bit in sock_exterr_skb to indicate whether the
      skb contains opt_stats data.
      
      Fixes: 1c885808 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING")
      Reported-by: NJongHwan Kim <zzoru007@gmail.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ef1b286