1. 01 5月, 2017 1 次提交
  2. 28 4月, 2017 3 次提交
  3. 21 4月, 2017 8 次提交
  4. 20 4月, 2017 3 次提交
  5. 19 4月, 2017 4 次提交
    • P
      block, bfq: improve responsiveness · 44e44a1b
      Paolo Valente 提交于
      This patch introduces a simple heuristic to load applications quickly,
      and to perform the I/O requested by interactive applications just as
      quickly. To this purpose, both a newly-created queue and a queue
      associated with an interactive application (we explain in a moment how
      BFQ decides whether the associated application is interactive),
      receive the following two special treatments:
      
      1) The weight of the queue is raised.
      
      2) The queue unconditionally enjoys device idling when it empties; in
      fact, if the requests of a queue are sync, then performing device
      idling for the queue is a necessary condition to guarantee that the
      queue receives a fraction of the throughput proportional to its weight
      (see [1] for details).
      
      For brevity, we call just weight-raising the combination of these
      two preferential treatments. For a newly-created queue,
      weight-raising starts immediately and lasts for a time interval that:
      1) depends on the device speed and type (rotational or
      non-rotational), and 2) is equal to the time needed to load (start up)
      a large-size application on that device, with cold caches and with no
      additional workload.
      
      Finally, as for guaranteeing a fast execution to interactive,
      I/O-related tasks (such as opening a file), consider that any
      interactive application blocks and waits for user input both after
      starting up and after executing some task. After a while, the user may
      trigger new operations, after which the application stops again, and
      so on. Accordingly, the low-latency heuristic weight-raises again a
      queue in case it becomes backlogged after being idle for a
      sufficiently long (configurable) time. The weight-raising then lasts
      for the same time as for a just-created queue.
      
      According to our experiments, the combination of this low-latency
      heuristic and of the improvements described in the previous patch
      allows BFQ to guarantee a high application responsiveness.
      
      [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
          Scheduler", Proceedings of the First Workshop on Mobile System
          Technologies (MST-2015), May 2015.
          http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdfSigned-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      44e44a1b
    • A
      block, bfq: add full hierarchical scheduling and cgroups support · e21b7a0b
      Arianna Avanzini 提交于
      Add complete support for full hierarchical scheduling, with a cgroups
      interface. Full hierarchical scheduling is implemented through the
      'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
      associated with processes, and groups are represented in general by
      entities. Given the bfq_queues associated with the processes belonging
      to a given group, the entities representing these queues are sons of
      the entity representing the group. At higher levels, if a group, say
      G, contains other groups, then the entity representing G is the parent
      entity of the entities representing the groups in G.
      
      Hierarchical scheduling is performed as follows: if the timestamps of
      a leaf entity (i.e., of a bfq_queue) change, and such a change lets
      the entity become the next-to-serve entity for its parent entity, then
      the timestamps of the parent entity are recomputed as a function of
      the budget of its new next-to-serve leaf entity. If the parent entity
      belongs, in its turn, to a group, and its new timestamps let it become
      the next-to-serve for its parent entity, then the timestamps of the
      latter parent entity are recomputed as well, and so on. When a new
      bfq_queue must be set in service, the reverse path is followed: the
      next-to-serve highest-level entity is chosen, then its next-to-serve
      child entity, and so on, until the next-to-serve leaf entity is
      reached, and the bfq_queue that this entity represents is set in
      service.
      
      Writeback is accounted for on a per-group basis, i.e., for each group,
      the async I/O requests of the processes of the group are enqueued in a
      distinct bfq_queue, and the entity associated with this queue is a
      child of the entity associated with the group.
      
      Weights can be assigned explicitly to groups and processes through the
      cgroups interface, differently from what happens, for single
      processes, if the cgroups interface is not used (as explained in the
      description of the previous patch). In particular, since each node has
      a full scheduler, each group can be assigned its own weight.
      Signed-off-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e21b7a0b
    • P
      block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler · aee69d78
      Paolo Valente 提交于
      We tag as v0 the version of BFQ containing only BFQ's engine plus
      hierarchical support. BFQ's engine is introduced by this commit, while
      hierarchical support is added by next commit. We use the v0 tag to
      distinguish this minimal version of BFQ from the versions containing
      also the features and the improvements added by next commits. BFQ-v0
      coincides with the version of BFQ submitted a few years ago [1], apart
      from the introduction of preemption, described below.
      
      BFQ is a proportional-share I/O scheduler, whose general structure,
      plus a lot of code, are borrowed from CFQ.
      
      - Each process doing I/O on a device is associated with a weight and a
        (bfq_)queue.
      
      - BFQ grants exclusive access to the device, for a while, to one queue
        (process) at a time, and implements this service model by
        associating every queue with a budget, measured in number of
        sectors.
      
        - After a queue is granted access to the device, the budget of the
          queue is decremented, on each request dispatch, by the size of the
          request.
      
        - The in-service queue is expired, i.e., its service is suspended,
          only if one of the following events occurs: 1) the queue finishes
          its budget, 2) the queue empties, 3) a "budget timeout" fires.
      
          - The budget timeout prevents processes doing random I/O from
            holding the device for too long and dramatically reducing
            throughput.
      
          - Actually, as in CFQ, a queue associated with a process issuing
            sync requests may not be expired immediately when it empties. In
            contrast, BFQ may idle the device for a short time interval,
            giving the process the chance to go on being served if it issues
            a new request in time. Device idling typically boosts the
            throughput on rotational devices, if processes do synchronous
            and sequential I/O. In addition, under BFQ, device idling is
            also instrumental in guaranteeing the desired throughput
            fraction to processes issuing sync requests (see [2] for
            details).
      
            - With respect to idling for service guarantees, if several
              processes are competing for the device at the same time, but
              all processes (and groups, after the following commit) have
              the same weight, then BFQ guarantees the expected throughput
              distribution without ever idling the device. Throughput is
              thus as high as possible in this common scenario.
      
        - Queues are scheduled according to a variant of WF2Q+, named
          B-WF2Q+, and implemented using an augmented rb-tree to preserve an
          O(log N) overall complexity.  See [2] for more details. B-WF2Q+ is
          also ready for hierarchical scheduling. However, for a cleaner
          logical breakdown, the code that enables and completes
          hierarchical support is provided in the next commit, which focuses
          exactly on this feature.
      
        - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
          perfectly fair, and smooth service. In particular, B-WF2Q+
          guarantees that each queue receives a fraction of the device
          throughput proportional to its weight, even if the throughput
          fluctuates, and regardless of: the device parameters, the current
          workload and the budgets assigned to the queue.
      
        - The last, budget-independence, property (although probably
          counterintuitive in the first place) is definitely beneficial, for
          the following reasons:
      
          - First, with any proportional-share scheduler, the maximum
            deviation with respect to an ideal service is proportional to
            the maximum budget (slice) assigned to queues. As a consequence,
            BFQ can keep this deviation tight not only because of the
            accurate service of B-WF2Q+, but also because BFQ *does not*
            need to assign a larger budget to a queue to let the queue
            receive a higher fraction of the device throughput.
      
          - Second, BFQ is free to choose, for every process (queue), the
            budget that best fits the needs of the process, or best
            leverages the I/O pattern of the process. In particular, BFQ
            updates queue budgets with a simple feedback-loop algorithm that
            allows a high throughput to be achieved, while still providing
            tight latency guarantees to time-sensitive applications. When
            the in-service queue expires, this algorithm computes the next
            budget of the queue so as to:
      
            - Let large budgets be eventually assigned to the queues
              associated with I/O-bound applications performing sequential
              I/O: in fact, the longer these applications are served once
              got access to the device, the higher the throughput is.
      
            - Let small budgets be eventually assigned to the queues
              associated with time-sensitive applications (which typically
              perform sporadic and short I/O), because, the smaller the
              budget assigned to a queue waiting for service is, the sooner
              B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
      
      - Weights can be assigned to processes only indirectly, through I/O
        priorities, and according to the relation:
        weight = 10 * (IOPRIO_BE_NR - ioprio).
        The next patch provides, instead, a cgroups interface through which
        weights can be assigned explicitly.
      
      - If several processes are competing for the device at the same time,
        but all processes and groups have the same weight, then BFQ
        guarantees the expected throughput distribution without ever idling
        the device. It uses preemption instead. Throughput is then much
        higher in this common scenario.
      
      - ioprio classes are served in strict priority order, i.e.,
        lower-priority queues are not served as long as there are
        higher-priority queues.  Among queues in the same class, the
        bandwidth is distributed in proportion to the weight of each
        queue. A very thin extra bandwidth is however guaranteed to the Idle
        class, to prevent it from starving.
      
      - If the strict_guarantees parameter is set (default: unset), then BFQ
           - always performs idling when the in-service queue becomes empty;
           - forces the device to serve one I/O request at a time, by
             dispatching a new request only if there is no outstanding
             request.
        In the presence of differentiated weights or I/O-request sizes,
        both the above conditions are needed to guarantee that every
        queue receives its allotted share of the bandwidth (see
        Documentation/block/bfq-iosched.txt for more details). Setting
        strict_guarantees may evidently affect throughput.
      
      [1] https://lkml.org/lkml/2008/4/1/234
          https://lkml.org/lkml/2008/11/11/148
      
      [2] P. Valente and M. Andreolini, "Improving Application
          Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
          the 5th Annual International Systems and Storage Conference
          (SYSTOR '12), June 2012.
          Slightly extended version:
          http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
      							results.pdf
      Signed-off-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      aee69d78
    • C
      ACPI / doc: linuxized-acpica.txt: fix typos · bf4f5bf1
      Cao jin 提交于
      Fix some typos in the linuxized-acpica.txt document.
      Signed-off-by: NCao jin <caoj.fnst@cn.fujitsu.com>
      [ rjw: Subject / changelog ]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      bf4f5bf1
  6. 17 4月, 2017 1 次提交
    • J
      lightnvm: physical block device (pblk) target · a4bd217b
      Javier González 提交于
      This patch introduces pblk, a host-side translation layer for
      Open-Channel SSDs to expose them like block devices. The translation
      layer allows data placement decisions, and I/O scheduling to be
      managed by the host, enabling users to optimize the SSD for their
      specific workloads.
      
      An open-channel SSD has a set of LUNs (parallel units) and a
      collection of blocks. Each block can be read in any order, but
      writes must be sequential. Writes may also fail, and if a block
      requires it, must also be reset before new writes can be
      applied.
      
      To manage the constraints, pblk maintains a logical to
      physical address (L2P) table,  write cache, garbage
      collection logic, recovery scheme, and logic to rate-limit
      user I/Os versus garbage collection I/Os.
      
      The L2P table is fully-associative and manages sectors at a
      4KB granularity. Pblk stores the L2P table in two places, in
      the out-of-band area of the media and on the last page of a
      line. In the cause of a power failure, pblk will perform a
      scan to recover the L2P table.
      
      The user data is organized into lines. A line is data
      striped across blocks and LUNs. The lines enable the host to
      reduce the amount of metadata to maintain besides the user
      data and makes it easier to implement RAID or erasure coding
      in the future.
      
      pblk implements multi-tenant support and can be instantiated
      multiple times on the same drive. Each instance owns a
      portion of the SSD - both regarding I/O bandwidth and
      capacity - providing I/O isolation for each case.
      
      Finally, pblk also exposes a sysfs interface that allows
      user-space to peek into the internals of pblk. The interface
      is available at /dev/block/*/pblk/ where * is the block
      device name exposed.
      
      This work also contains contributions from:
        Matias Bjørling <matias@cnexlabs.com>
        Simon A. F. Lund <slund@cnexlabs.com>
        Young Tack Jin <youngtack.jin@gmail.com>
        Huaicheng Li <huaicheng@cs.uchicago.edu>
      Signed-off-by: NJavier González <javier@cnexlabs.com>
      Signed-off-by: NMatias Bjørling <matias@cnexlabs.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a4bd217b
  7. 15 4月, 2017 2 次提交
    • O
      blk-mq: introduce Kyber multiqueue I/O scheduler · 00e04393
      Omar Sandoval 提交于
      The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
      scale to multiple queues. Users configure only two knobs, the target
      read and synchronous write latencies, and the scheduler tunes itself to
      achieve that latency goal.
      
      The implementation is based on "tokens", built on top of the scalable
      bitmap library. Tokens serve as a mechanism for limiting requests. There
      are two tiers of tokens: queueing tokens and dispatch tokens.
      
      A queueing token is required to allocate a request. In fact, these
      tokens are actually the blk-mq internal scheduler tags, but the
      scheduler manages the allocation directly in order to implement its
      policy.
      
      Dispatch tokens are device-wide and split up into two scheduling
      domains: reads vs. writes. Each hardware queue dispatches batches
      round-robin between the scheduling domains as long as tokens are
      available for that domain.
      
      These tokens can be used as the mechanism to enable various policies.
      The policy Kyber uses is inspired by active queue management techniques
      for network routing, similar to blk-wbt. The scheduler monitors
      latencies and scales the number of dispatch tokens accordingly. Queueing
      tokens are used to prevent starvation of synchronous requests by
      asynchronous requests.
      
      Various extensions are possible, including better heuristics and ionice
      support. The new scheduler isn't set as the default yet.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      00e04393
    • C
      remove the mg_disk driver · 84253394
      Christoph Hellwig 提交于
      This drivers was added in 2008, but as far as a I can tell we never had a
      single platform that actually registered resources for the platform driver.
      
      It's also been unmaintained for a long time and apparently has a ATA mode
      that can be driven using the IDE/libata subsystem.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      84253394
  8. 14 4月, 2017 8 次提交
  9. 13 4月, 2017 1 次提交
    • M
      ACPI / scan: Drop support for force_remove · ffc10d82
      Michal Hocko 提交于
      /sys/firmware/acpi/hotplug/force_remove was presumably added to support
      auto offlining in the past. This is, however, inherently dangerous for
      some hotplugable resources like memory. The memory offlining fails when
      the memory is still in use and cannot be dropped or migrated. If we
      ignore the failure we are basically allowing for subtle memory
      corruption or a crash.
      
      We have actually noticed the later while hitting BUG() during the memory
      hotremove (remove_memory):
      	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
      			check_memblock_offlined_cb);
      	if (ret)
      		BUG();
      
      it took us quite non-trivial time realize that the customer had
      force_remove enabled. Even if the BUG was removed here and we could
      propagate the error up the call chain it wouldn't help at all because
      then we would hit a crash or a memory corruption later and harder to
      debug. So force_remove is unfixable for the memory hotremove. We haven't
      checked other hotplugable resources to be prone to a similar problems.
      
      Remove the force_remove functionality because it is not fixable currently.
      Keep the sysfs file and report an error if somebody tries to enable it.
      Encourage users to report about the missing functionality and work with
      them with an alternative solution.
      Reviewed-by: NLee, Chun-Yi <jlee@suse.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      ffc10d82
  10. 12 4月, 2017 9 次提交