1. 26 8月, 2017 1 次提交
  2. 11 7月, 2017 1 次提交
    • S
      block: call bio_uninit in bio_endio · b222dd2f
      Shaohua Li 提交于
      bio_free isn't a good place to free cgroup info. There are a
      lot of cases bio is allocated in special way (for example, in stack) and
      never gets called by bio_put hence bio_free, we are leaking memory. This
      patch moves the free to bio endio, which should be called anyway. The
      bio_uninit call in bio_free is kept, in case the bio never gets called
      bio endio.
      
      This assumes ->bi_end_io() doesn't access cgroup info, which seems true
      in my audit.
      
      This along with Christoph's integrity patch should fix the memory leak
      issue.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b222dd2f
  3. 04 7月, 2017 3 次提交
  4. 29 6月, 2017 1 次提交
    • J
      block: provide bio_uninit() free freeing integrity/task associations · 9ae3b3f5
      Jens Axboe 提交于
      Wen reports significant memory leaks with DIF and O_DIRECT:
      
      "With nvme devive + T10 enabled, On a system it has 256GB and started
      logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
      it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
      leaking.
      
      /proc/meminfo | grep SUnreclaim...
      
      SUnreclaim:      6752128 kB
      SUnreclaim:      6874880 kB
      SUnreclaim:      7238080 kB
      ....
      SUnreclaim:     22307264 kB
      SUnreclaim:     22485888 kB
      SUnreclaim:     22720256 kB
      
      When testcases with T10 enabled call into __blkdev_direct_IO_simple,
      code doesn't free memory allocated by bio_integrity_alloc. The patch
      fixes the issue. HTX has been run with +60 hours without failure."
      
      Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
      doesn't go through the regular bio free. This means that any ancillary
      data allocated with the bio through the stack is not freed. Hence, we
      can leak the integrity data associated with the bio, if the device is
      using DIF/DIX.
      
      Fix this by providing a bio_uninit() and export it, so that we can use
      it to free this data. Note that this is a minimal fix for this issue.
      Any current user of bio's that are allocated outside of
      bio_alloc_bioset() suffers from this issue, most notably some drivers.
      We will fix those in a more comprehensive patch for 4.13. This also
      means that the commit marked as being fixed by this isn't the real
      culprit, it's just the most obvious one out there.
      
      Fixes: 542ff7bf ("block: new direct I/O implementation")
      Reported-by: NWen Xiong <wenxiong@linux.vnet.ibm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9ae3b3f5
  5. 28 6月, 2017 1 次提交
  6. 19 6月, 2017 3 次提交
  7. 16 6月, 2017 1 次提交
  8. 09 6月, 2017 1 次提交
  9. 12 4月, 2017 1 次提交
  10. 07 4月, 2017 1 次提交
    • N
      block: trace completion of all bios. · fbbaf700
      NeilBrown 提交于
      Currently only dm and md/raid5 bios trigger
      trace_block_bio_complete().  Now that we have bio_chain() and
      bio_inc_remaining(), it is not possible, in general, for a driver to
      know when the bio is really complete.  Only bio_endio() knows that.
      
      So move the trace_block_bio_complete() call to bio_endio().
      
      Now trace_block_bio_complete() pairs with trace_block_bio_queue().
      Any bio for which a 'queue' event is traced, will subsequently
      generate a 'complete' event.
      
      There are a few cases where completion tracing is not wanted.
      1/ If blk_update_request() has already generated a completion
         trace event at the 'request' level, there is no point generating
         one at the bio level too.  In this case the bi_sector and bi_size
         will have changed, so the bio level event would be wrong
      
      2/ If the bio hasn't actually been queued yet, but is being aborted
         early, then a trace event could be confusing.  Some filesystems
         call bio_endio() but do not want tracing.
      
      3/ The bio_integrity code interposes itself by replacing bi_end_io,
         then restoring it and calling bio_endio() again.  This would produce
         two identical trace events if left like that.
      
      To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
      produce the trace event when this is set.
      We address point 1 above by clearing the flag in blk_update_request().
      We address point 2 above by only setting the flag when
      generic_make_request() is called.
      We address point 3 above by clearing the flag after generating a
      completion event.
      
      When bio_split() is used on a bio, particularly in blk_queue_split(),
      there is an extra complication.  A new bio is split off the front, and
      may be handle directly without going through generic_make_request().
      The old bio, which has been advanced, is passed to
      generic_make_request(), so it will trigger a trace event a second
      time.
      Probably the best result when a split happens is to see a single
      'queue' event for the whole bio, then multiple 'complete' events - one
      for each component.  To achieve this was can:
      - copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
      - avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
      This way, the split-off bio won't create a queue event, the original
      won't either even if it re-submitted to generic_make_request(),
      but both will produce completion events, each for their own range.
      
      So if generic_make_request() is called (which generates a QUEUED
      event), then bi_endio() will create a single COMPLETE event for each
      range that the bio is split into, unless the driver has explicitly
      requested it not to.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fbbaf700
  11. 28 3月, 2017 1 次提交
    • S
      blk-throttle: add a simple idle detection · 9e234eea
      Shaohua Li 提交于
      A cgroup gets assigned a low limit, but the cgroup could never dispatch
      enough IO to cross the low limit. In such case, the queue state machine
      will remain in LIMIT_LOW state and all other cgroups will be throttled
      according to low limit. This is unfair for other cgroups. We should
      treat the cgroup idle and upgrade the state machine to lower state.
      
      We also have a downgrade logic. If the state machine upgrades because of
      cgroup idle (real idle), the state machine will downgrade soon as the
      cgroup is below its low limit. This isn't what we want. A more
      complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
      when queue gets upgraded to lower state, other cgroups could dispatch
      more IO and this cgroup can't dispatch enough IO, so the cgroup is below
      its low limit and looks like idle (fake idle). In this case, the queue
      should downgrade soon. The key to determine if we should do downgrade is
      to detect if cgroup is truely idle.
      
      Unfortunately it's very hard to determine if a cgroup is real idle. This
      patch uses the 'think time check' idea from CFQ for the purpose. Please
      note, the idea doesn't work for all workloads. For example, a workload
      with io depth 8 has disk utilization 100%, hence think time is 0, eg,
      not idle. But the workload can run higher bandwidth with io depth 16.
      Compared to io depth 16, the io depth 8 workload is idle. We use the
      idea to roughly determine if a cgroup is idle.
      
      We treat a cgroup idle if its think time is above a threshold (by
      default 1ms for SSD and 100ms for HD). The idea is think time above the
      threshold will start to harm performance. HD is much slower so a longer
      think time is ok.
      
      The patch (and the latter patches) uses 'unsigned long' to track time.
      We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
      precision, should not a big deal.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9e234eea
  12. 26 3月, 2017 1 次提交
  13. 25 3月, 2017 1 次提交
  14. 23 3月, 2017 1 次提交
  15. 12 3月, 2017 1 次提交
    • N
      blk: Ensure users for current->bio_list can see the full list. · f5fe1b51
      NeilBrown 提交于
      Commit 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      changed current->bio_list so that it did not contain *all* of the
      queued bios, but only those submitted by the currently running
      make_request_fn.
      
      There are two places which walk the list and requeue selected bios,
      and others that check if the list is empty.  These are no longer
      correct.
      
      So redefine current->bio_list to point to an array of two lists, which
      contain all queued bios, and adjust various code to test or walk both
      lists.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Fixes: 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f5fe1b51
  16. 16 2月, 2017 1 次提交
  17. 02 2月, 2017 1 次提交
  18. 01 2月, 2017 1 次提交
    • C
      block: fold cmd_type into the REQ_OP_ space · aebf526b
      Christoph Hellwig 提交于
      Instead of keeping two levels of indirection for requests types, fold it
      all into the operations.  The little caveat here is that previously
      cmd_type only applied to struct request, while the request and bio op
      fields were set to plain REQ_OP_READ/WRITE even for passthrough
      operations.
      
      Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
      private requests, althought it has to add two for each so that we
      can communicate the data in/out nature of the request.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      aebf526b
  19. 09 12月, 2016 1 次提交
    • C
      block: improve handling of the magic discard payload · f9d03f96
      Christoph Hellwig 提交于
      Instead of allocating a single unused biovec for discard requests, send
      them down without any payload.  Instead we allow the driver to add a
      "special" payload using a biovec embedded into struct request (unioned
      over other fields never used while in the driver), and overloading
      the number of segments for this case.
      
      This has a couple of advantages:
      
       - we don't have to allocate the bio_vec
       - the amount of special casing for discard requests in the block
         layer is significantly reduced
       - using this same scheme for other request types is trivial,
         which will be important for implementing the new WRITE_ZEROES
         op on devices where it actually requires a payload (e.g. SCSI)
       - we can get rid of playing games with the request length, as
         we'll never touch it and completions will work just fine
       - it will allow us to support ranged discard operations in the
         future by merging non-contiguous discard bios into a single
         request
       - last but not least it removes a lot of code
      
      This patch is the common base for my WIP series for ranges discards and to
      remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
      so it would be good to get it in quickly.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f9d03f96
  20. 01 12月, 2016 1 次提交
  21. 30 11月, 2016 1 次提交
    • K
      block: add bio_iov_iter_get_pages() · 38161995
      Kent Overstreet 提交于
      This is a helper that pins down a range from an iov_iter and adds it to
      a bio without requiring a separate memory allocation for the page array.
      It will be used for upcoming direct I/O implementations for block devices
      and iomap based file systems.
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [hch: ported to the iov_iter interface, renamed and added comments.
            All blame should be directed to me and all fame should go to Kent
            after this!]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      
      (cherry picked from commit 9cd56d916aa481ce8f56d9c5302a6ed90c2e0b5f)
      38161995
  22. 22 11月, 2016 1 次提交
  23. 03 11月, 2016 1 次提交
  24. 22 9月, 2016 1 次提交
  25. 14 9月, 2016 1 次提交
  26. 16 8月, 2016 1 次提交
  27. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  28. 05 8月, 2016 1 次提交
  29. 21 7月, 2016 2 次提交
  30. 08 6月, 2016 2 次提交
  31. 06 5月, 2016 1 次提交
    • M
      block: make bio_inc_remaining() interface accessible again · 0ef5a50c
      Mike Snitzer 提交于
      Commit 326e1dbb ("block: remove management of bi_remaining when
      restoring original bi_end_io") made bio_inc_remaining() private to bio.c
      because the only use-case that made sense was confined to the
      bio_chain() interface.
      
      Since that time DM thinp went on to use bio_chain() in its relatively
      complex implementation of async discard support.  That implementation,
      even when converted over to use the new async __blkdev_issue_discard()
      interface, depends on deferred completion of the original discard bio --
      which is most appropriately implemented using bio_inc_remaining().
      
      DM thinp foolishly duplicated bio_inc_remaining(), local to dm-thin.c as
      __bio_inc_remaining(), so re-exporting bio_inc_remaining() allows us to
      put an end to that foolishness.
      
      All said, bio_inc_remaining() should really only be used in conjunction
      with bio_chain().  It isn't intended for generic bio reference counting.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0ef5a50c
  32. 05 4月, 2016 2 次提交
    • K
      mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage · ea1754a0
      Kirill A. Shutemov 提交于
      Mostly direct substitution with occasional adjustment or removing
      outdated comments.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea1754a0
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  33. 14 3月, 2016 1 次提交