1. 03 8月, 2018 1 次提交
  2. 02 8月, 2018 1 次提交
  3. 01 8月, 2018 3 次提交
  4. 30 7月, 2018 2 次提交
  5. 27 7月, 2018 3 次提交
  6. 25 7月, 2018 4 次提交
  7. 23 7月, 2018 2 次提交
  8. 18 7月, 2018 5 次提交
    • T
      blkcg: Track DISCARD statistics and output them in cgroup io.stat · 636620b6
      Tejun Heo 提交于
      Add tracking of REQ_OP_DISCARD ios to the per-cgroup io.stat.  Two
      fields, dbytes and dios, to respectively count the total bytes and
      number of discards are added.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Cc: Michael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      636620b6
    • M
      block: Track DISCARD statistics and output them in stat and diskstat · bdca3c87
      Michael Callahan 提交于
      Add tracking of REQ_OP_DISCARD ios to the partition statistics and
      append them to the various stat files in /sys as well as
      /proc/diskstats.  These are tracked with the same four stats as reads
      and writes:
      
      Number of discard ios completed.
      Number of discard ios merged
      Number of discard sectors completed
      Milliseconds spent on discard requests
      
      This is done via adding a new STAT_DISCARD define to genhd.h and then
      using it to index that stat field for discard requests.
      
      tj: Refreshed on top of v4.17 and other previous updates.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bdca3c87
    • M
      block: Add and use op_stat_group() for indexing disk_stat fields. · ddcf35d3
      Michael Callahan 提交于
      Add and use a new op_stat_group() function for indexing partition stat
      fields rather than indexing them by rq_data_dir() or bio_data_dir().
      This function works similarly to op_is_sync() in that it takes the
      request::cmd_flags or bio::bi_opf flags and determines which stats
      should et updated.
      
      In addition, the second parameter to generic_start_io_acct() and
      generic_end_io_acct() is now a REQ_OP rather than simply a read or
      write bit and it uses op_stat_group() on the parameter to determine
      the stat group.
      
      Note that the partition in_flight counts are not part of the per-cpu
      statistics and as such are not indexed via this function.  It's now
      indexed by op_is_write().
      
      tj: Refreshed on top of v4.17.  Updated to pass around REQ_OP.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Matias Bjorling <mb@lightnvm.io>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddcf35d3
    • M
      block: Define and use STAT_READ and STAT_WRITE · dbae2c55
      Michael Callahan 提交于
      Add defines for STAT_READ and STAT_WRITE for indexing the partition
      stat entries. This clarifies some fs/ code which has hardcoded 1 for
      STAT_WRITE and will make it easier to extend the stats with additional
      fields.
      
      tj: Refreshed on top of v4.17.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dbae2c55
    • M
      blk-mq: issue directly if hw queue isn't busy in case of 'none' · 6ce3dd6e
      Ming Lei 提交于
      In case of 'none' io scheduler, when hw queue isn't busy, it isn't
      necessary to enqueue request to sw queue and dequeue it from
      sw queue because request may be submitted to hw queue asap without
      extra cost, meantime there shouldn't be much request in sw queue,
      and we don't need to worry about effect on IO merge.
      
      There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
      which may connect high performance devices, so 'none' is often required
      for obtaining good performance.
      
      This patch improves IOPS and decreases CPU unilization on megaraid_sas,
      per Kashyap's test.
      
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Laurence Oberman <loberman@redhat.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Reported-by: NKashyap Desai <kashyap.desai@broadcom.com>
      Tested-by: NKashyap Desai <kashyap.desai@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6ce3dd6e
  9. 17 7月, 2018 2 次提交
    • J
      blk-iolatency: truncate our current time · 71e9690b
      Josef Bacik 提交于
      In our longer tests we noticed that some boxes would degrade to the
      point of uselessness.  This is because we truncate the current time when
      saving it in our bio, but I was using the raw current time to subtract
      from.  So once the box had been up a certain amount of time it would
      appear as if our IO's were taking several years to complete.  Fix this
      by truncating the current time so it matches the issue time.  Verified
      this worked by running with this patch for a week on our test tier.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      71e9690b
    • J
      blk-iolatency: don't change the latency window · d607eefa
      Josef Bacik 提交于
      Early versions of these patches had us waiting for seconds at a time
      during submission, so we had to adjust the timing window we monitored
      for latency.  Now we don't do things like that so this is unnecessary
      code.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d607eefa
  10. 12 7月, 2018 1 次提交
    • C
      bsg: remove read/write support · 28519c89
      Christoph Hellwig 提交于
      The code poses a security risk due to user memory access in ->release
      and had an API that can't be used reliably.  As far as we know it was
      never used for real, but if that turns out wrong we'll have to revert
      this commit and come up with a band aid.
      
      Jann Horn did look software archives for users of this interface,
      and the only users found were example code in sg3_utils, and optional
      support in an optional module of the tgt user space iscsi target,
      which looks like a proof of concept extension of the /dev/sg
      read/write support.
      
      Tony Battersby chimes in that the code is basically unsafe to use in
      general:
      
        The read/write interface on /dev/bsg is impossible to use safely
        because the list of completed commands is per-device (bd->done_list)
        rather than per-fd like it is with /dev/sg.  So if program A and
        program B are both using the write/read interface on the same bsg
        device, then their command responses will get mixed up, and program
        A will read() some command results from program B and vice versa.
        So no, I don't use read/write on /dev/bsg.  From a security standpoint,
        it should definitely be fixed or removed.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      28519c89
  11. 11 7月, 2018 2 次提交
  12. 09 7月, 2018 14 次提交
    • G
      block: Add default switch case to blk_pm_allow_request() to kill warning · e9a83853
      Geert Uytterhoeven 提交于
      With gcc 4.9.0 and 7.3.0:
      
          block/blk-core.c: In function 'blk_pm_allow_request':
          block/blk-core.c:2747:2: warning: enumeration value 'RPM_ACTIVE' not handled in switch [-Wswitch]
            switch (rq->q->rpm_status) {
            ^
      
      Convert the return statement below the switch() block into a default
      case to fix this.
      
      Fixes: e4f36b24 ("block: fix peeking requests during PM")
      Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e9a83853
    • M
      block: fix infinite loop if the device loses discard capability · b88aef36
      Mikulas Patocka 提交于
      If __blkdev_issue_discard is in progress and a device mapper device is
      reloaded with a table that doesn't support discard,
      q->limits.max_discard_sectors is set to zero. This results in infinite
      loop in __blkdev_issue_discard.
      
      This patch checks if max_discard_sectors is zero and aborts with
      -EOPNOTSUPP.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Tested-by: NZdenek Kabelac <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b88aef36
    • S
      block, mm: remove unnecessary __GFP_HIGH flag · c137969b
      Shakeel Butt 提交于
      The flag GFP_ATOMIC already contains __GFP_HIGH. There is no need to
      explicitly or __GFP_HIGH again. So, just remove unnecessary __GFP_HIGH.
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c137969b
    • J
      block: introduce blk-iolatency io controller · d7067512
      Josef Bacik 提交于
      Current IO controllers for the block layer are less than ideal for our
      use case.  The io.max controller is great at hard limiting, but it is
      not work conserving.  This patch introduces io.latency.  You provide a
      latency target for your group and we monitor the io in short windows to
      make sure we are not exceeding those latency targets.  This makes use of
      the rq-qos infrastructure and works much like the wbt stuff.  There are
      a few differences from wbt
      
       - It's bio based, so the latency covers the whole block layer in addition to
         the actual io.
       - We will throttle all IO types that comes in here if we need to.
       - We use the mean latency over the 100ms window.  This is because writes can
         be particularly fast, which could give us a false sense of the impact of
         other workloads on our protected workload.
       - By default there's no throttling, we set the queue_depth to INT_MAX so that
         we can have as many outstanding bio's as we're allowed to.  Only at
         throttle time do we pay attention to the actual queue depth.
       - We backcharge cgroups for root cg issued IO and induce artificial
         delays in order to deal with cases like metadata only or swap heavy
         workloads.
      
      In testing this has worked out relatively well.  Protected workloads
      will throttle noisy workloads down to 1 io at time if they are doing
      normal IO on their own, or induce up to a 1 second delay per syscall if
      they are doing a lot of root issued IO (metadata/swap IO).
      
      Our testing has revolved mostly around our production web servers where
      we have hhvm (the web server application) in a protected group and
      everything else in another group.  We see slightly higher requests per
      second (RPS) on the test tier vs the control tier, and much more stable
      RPS across all machines in the test tier vs the control tier.
      
      Another test we run is a slow memory allocator in the unprotected group.
      Before this would eventually push us into swap and cause the whole box
      to die and not recover at all.  With these patches we see slight RPS
      drops (usually 10-15%) before the memory consumer is properly killed and
      things recover within seconds.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d7067512
    • J
      rq-qos: introduce dio_bio callback · 67b42d0b
      Josef Bacik 提交于
      wbt cares only about request completion time, but controllers may need
      information that is on the bio itself, so add a done_bio callback for
      rq-qos so things like blk-iolatency can use it to have the bio when it
      completes.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      67b42d0b
    • J
      block: remove external dependency on wbt_flags · c1c80384
      Josef Bacik 提交于
      We don't really need to save this stuff in the core block code, we can
      just pass the bio back into the helpers later on to derive the same
      flags and update the rq->wbt_flags appropriately.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c1c80384
    • J
      blk-rq-qos: refactor out common elements of blk-wbt · a7905043
      Josef Bacik 提交于
      blkcg-qos is going to do essentially what wbt does, only on a cgroup
      basis.  Break out the common code that will be shared between blkcg-qos
      and wbt into blk-rq-qos.* so they can both utilize the same
      infrastructure.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a7905043
    • J
      blk-stat: export helpers for modifying blk_rq_stat · 2ecbf456
      Josef Bacik 提交于
      We need to use blk_rq_stat in the blkcg qos stuff, so export some of
      these helpers so they can be used by other things.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2ecbf456
    • J
      blkcg: add generic throttling mechanism · d09d8df3
      Josef Bacik 提交于
      Since IO can be issued from literally anywhere it's almost impossible to
      do throttling without having some sort of adverse effect somewhere else
      in the system because of locking or other dependencies.  The best way to
      solve this is to do the throttling when we know we aren't holding any
      other kernel resources.  Do this by tracking throttling in a per-blkg
      basis, and if we require throttling flag the task that it needs to check
      before it returns to user space and possibly sleep there.
      
      This is to address the case where a process is doing work that is
      generating IO that can't be throttled, whether that is directly with a
      lot of REQ_META IO, or indirectly by allocating so much memory that it
      is swamping the disk with REQ_SWAP.  We can't use task_add_work as we
      don't want to induce a memory allocation in the IO path, so simply
      saving the request queue in the task and flagging it to do the
      notify_resume thing achieves the same result without the overhead of a
      memory allocation.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d09d8df3
    • T
      swap,blkcg: issue swap io with the appropriate context · 0d3bd88d
      Tejun Heo 提交于
      For backcharging we need to know who the page belongs to when swapping
      it out.  We don't worry about things that do ->rw_page (zram etc) at the
      moment, we're only worried about pages that actually go to a block
      device.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0d3bd88d
    • J
      blk-cgroup: allow controllers to output their own stats · 903d23f0
      Josef Bacik 提交于
      blk-iolatency has a few stats that it would like to print out, and
      instead of adding a bunch of crap to the generic code just provide a
      helper so that controllers can add stuff to the stat line if they want
      to.
      
      Hide it behind a boot option since it changes the output of io.stat from
      normal, and these stats are only interesting to developers.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      903d23f0
    • J
      block: add bi_blkg to the bio for cgroups · 08e18eab
      Josef Bacik 提交于
      Currently io.low uses a bi_cg_private to stash its private data for the
      blkg, however other blkcg policies may want to use this as well.  Since
      we can get the private data out of the blkg, move this to bi_blkg in the
      bio and make it generic, then we can use bio_associate_blkg() to attach
      the blkg to the bio.
      
      Theoretically we could simply replace the bi_css with this since we can
      get to all the same information from the blkg, however you have to
      lookup the blkg, so for example wbc_init_bio() would have to lookup and
      possibly allocate the blkg for the css it was trying to attach to the
      bio.  This could be problematic and result in us either not attaching
      the css at all to the bio, or falling back to the root blkcg if we are
      unable to allocate the corresponding blkg.
      
      So for now do this, and in the future if possible we could just replace
      the bi_css with bi_blkg and update the helpers to do the correct
      translation.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      08e18eab
    • M
      blk-mq: dequeue request one by one from sw queue if hctx is busy · 6e768717
      Ming Lei 提交于
      It won't be efficient to dequeue request one by one from sw queue,
      but we have to do that when queue is busy for better merge performance.
      
      This patch takes the Exponential Weighted Moving Average(EWMA) to figure
      out if queue is busy, then only dequeue request one by one from sw queue
      when queue is busy.
      
      Fixes: b347689f ("blk-mq-sched: improve dispatching from sw queue")
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Laurence Oberman <loberman@redhat.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Reported-by: NKashyap Desai <kashyap.desai@broadcom.com>
      Tested-by: NKashyap Desai <kashyap.desai@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6e768717
    • M
      blk-mq: only attempt to merge bio if there is rq in sw queue · b04f50ab
      Ming Lei 提交于
      Only attempt to merge bio iff the ctx->rq_list isn't empty, because:
      
      1) for high-performance SSD, most of times dispatch may succeed, then
      there may be nothing left in ctx->rq_list, so don't try to merge over
      sw queue if it is empty, then we can save one acquiring of ctx->lock
      
      2) we can't expect good merge performance on per-cpu sw queue, and missing
      one merge on sw queue won't be a big deal since tasks can be scheduled from
      one CPU to another.
      
      Cc: Laurence Oberman <loberman@redhat.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Tested-by: NKashyap Desai <kashyap.desai@broadcom.com>
      Reported-by: NKashyap Desai <kashyap.desai@broadcom.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b04f50ab