1. 11 1月, 2018 3 次提交
  2. 10 1月, 2018 2 次提交
    • T
      blk-mq: remove REQ_ATOM_STARTED · 5a61c363
      Tejun Heo 提交于
      After the recent updates to use generation number and state based
      synchronization, we can easily replace REQ_ATOM_STARTED usages by
      adding an extra state to distinguish completed but not yet freed
      state.
      
      Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
      blk_mq_rq_state() tests.  REQ_ATOM_STARTED no longer has any users
      left and is removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5a61c363
    • T
      blk-mq: replace timeout synchronization with a RCU and generation based scheme · 1d9bd516
      Tejun Heo 提交于
      Currently, blk-mq timeout path synchronizes against the usual
      issue/completion path using a complex scheme involving atomic
      bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
      rules.  Unfortunately, it contains quite a few holes.
      
      There's a complex dancing around REQ_ATOM_STARTED and
      REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
      they don't have a synchronization point across request recycle
      instances and it isn't clear what the barriers add.
      blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
      deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
      
      In fact, it's pretty easy to make blk_mq_check_expired() terminate a
      later instance of a request.  If we induce 5 sec delay before
      time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
      2s, and issue back-to-back large IOs, blk-mq starts timing out
      requests spuriously pretty quickly.  Nothing actually timed out.  It
      just made the call on a recycle instance of a request and then
      terminated a later instance long after the original instance finished.
      The scenario isn't theoretical either.
      
      This patch replaces the broken synchronization mechanism with a RCU
      and generation number based one.
      
      1. Each request has a u64 generation + state value, which can be
         updated only by the request owner.  Whenever a request becomes
         in-flight, the generation number gets bumped up too.  This provides
         the basis for the timeout path to distinguish different recycle
         instances of the request.
      
         Also, marking a request in-flight and setting its deadline are
         protected with a seqcount so that the timeout path can fetch both
         values coherently.
      
      2. The timeout path fetches the generation, state and deadline.  If
         the verdict is timeout, it records the generation into a dedicated
         request abortion field and does RCU wait.
      
      3. The completion path is also protected by RCU (from the previous
         patch) and checks whether the current generation number and state
         match the abortion field.  If so, it skips completion.
      
      4. The timeout path, after RCU wait, scans requests again and
         terminates the ones whose generation and state still match the ones
         requested for abortion.
      
         By now, the timeout path knows that either the generation number
         and state changed if it lost the race or the completion will yield
         to it and can safely timeout the request.
      
      While it's more lines of code, it's conceptually simpler, doesn't
      depend on direct use of subtle memory ordering or coherence, and
      hopefully doesn't terminate the wrong instance.
      
      While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
      between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
      removed yet as it's still used in other places.  Future patches will
      move all state tracking to the new mechanism and remove all bitops in
      the hot paths.
      
      Note that this patch adds a comment explaining a race condition in
      BLK_EH_RESET_TIMER path.  The race has always been there and this
      patch doesn't change it.  It's just documenting the existing race.
      
      v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
          - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
          - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
      
      v3: - Fixed possible extended seqcount / u64_stats_sync read looping
            spotted by Peter.
          - MQ_RQ_IDLE was incorrectly being set in complete_request instead
            of free_request.  Fixed.
      
      v4: - Rebased on top of hctx_lock() refactoring patch.
          - Added comment explaining the use of hctx_lock() in completion path.
      
      v5: - Added comments requested by Bart.
          - Note the addition of BLK_EH_RESET_TIMER race condition in the
            commit message.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1d9bd516
  3. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  4. 05 10月, 2017 1 次提交
  5. 03 10月, 2017 1 次提交
  6. 29 8月, 2017 1 次提交
  7. 24 8月, 2017 1 次提交
  8. 04 7月, 2017 1 次提交
  9. 28 6月, 2017 1 次提交
  10. 21 6月, 2017 1 次提交
  11. 02 6月, 2017 1 次提交
    • B
      block: Avoid that blk_exit_rl() triggers a use-after-free · b425e504
      Bart Van Assche 提交于
      Since the introduction of .init_rq_fn() and .exit_rq_fn() it is
      essential that the memory allocated for struct request_queue
      stays around until all blk_exit_rl() calls have finished. Hence
      make blk_init_rl() take a reference on struct request_queue.
      
      This patch fixes the following crash:
      
      general protection fault: 0000 [#2] SMP
      CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G      D         4.12.0-rc2-dbg+ #2
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
      task: ffff88013a108040 task.stack: ffffc9000071c000
      RIP: 0010:free_request_size+0x1a/0x30
      RSP: 0018:ffffc9000071fd38 EFLAGS: 00010202
      RAX: 6b6b6b6b6b6b6b6b RBX: ffff880067362a88 RCX: 0000000000000003
      RDX: ffff880067464178 RSI: ffff880067362a88 RDI: ffff880135ea4418
      RBP: ffffc9000071fd40 R08: 0000000000000000 R09: 0000000100180009
      R10: ffffc9000071fd38 R11: ffffffff81110800 R12: ffff88006752d3d8
      R13: ffff88006752d3d8 R14: ffff88013a108040 R15: 000000000000000a
      FS:  0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fa8ec1edb00 CR3: 0000000138ee8000 CR4: 00000000001406e0
      Call Trace:
       mempool_destroy.part.10+0x21/0x40
       mempool_destroy+0xe/0x10
       blk_exit_rl+0x12/0x20
       blkg_free+0x4d/0xa0
       __blkg_release_rcu+0x59/0x170
       rcu_process_callbacks+0x260/0x4e0
       __do_softirq+0x116/0x250
       smpboot_thread_fn+0x123/0x1e0
       kthread+0x109/0x140
       ret_from_fork+0x31/0x40
      
      Fixes: commit e9c787e6 ("scsi: allocate scsi_cmnd structures as part of struct request")
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org> # v4.11+
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b425e504
  12. 20 4月, 2017 2 次提交
  13. 28 3月, 2017 4 次提交
    • S
      blk-throttle: add a mechanism to estimate IO latency · b9147dd1
      Shaohua Li 提交于
      User configures latency target, but the latency threshold for each
      request size isn't fixed. For a SSD, the IO latency highly depends on
      request size. To calculate latency threshold, we sample some data, eg,
      average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
      threshold of each request size will be the sample latency (I'll call it
      base latency) plus latency target. For example, the base latency for
      request size 4k is 80us and user configures latency target 60us. The 4k
      latency threshold will be 80 + 60 = 140us.
      
      To sample data, we calculate the order base 2 of rounded up IO sectors.
      If the IO size is bigger than 1M, it will be accounted as 1M. Since the
      calculation does round up, the base latency will be slightly smaller
      than actual value. Also if there isn't any IO dispatched for a specific
      IO size, we will use the base latency of smaller IO size for this IO
      size.
      
      But we shouldn't sample data at any time. The base latency is supposed
      to be latency where disk isn't congested, because we use latency
      threshold to schedule IOs between cgroups. If disk is congested, the
      latency is higher, using it for scheduling is meaningless. Hence we only
      do the sampling when block throttling is in the LOW limit, with
      assumption disk isn't congested in such state. If the assumption isn't
      true, eg, low limit is too high, calculated latency threshold will be
      higher.
      
      Hard disk is completely different. Latency depends on spindle seek
      instead of request size. Currently this feature is SSD only, we probably
      can use a fixed threshold like 4ms for hard disk though.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b9147dd1
    • S
      blk-throttle: add a simple idle detection · 9e234eea
      Shaohua Li 提交于
      A cgroup gets assigned a low limit, but the cgroup could never dispatch
      enough IO to cross the low limit. In such case, the queue state machine
      will remain in LIMIT_LOW state and all other cgroups will be throttled
      according to low limit. This is unfair for other cgroups. We should
      treat the cgroup idle and upgrade the state machine to lower state.
      
      We also have a downgrade logic. If the state machine upgrades because of
      cgroup idle (real idle), the state machine will downgrade soon as the
      cgroup is below its low limit. This isn't what we want. A more
      complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
      when queue gets upgraded to lower state, other cgroups could dispatch
      more IO and this cgroup can't dispatch enough IO, so the cgroup is below
      its low limit and looks like idle (fake idle). In this case, the queue
      should downgrade soon. The key to determine if we should do downgrade is
      to detect if cgroup is truely idle.
      
      Unfortunately it's very hard to determine if a cgroup is real idle. This
      patch uses the 'think time check' idea from CFQ for the purpose. Please
      note, the idea doesn't work for all workloads. For example, a workload
      with io depth 8 has disk utilization 100%, hence think time is 0, eg,
      not idle. But the workload can run higher bandwidth with io depth 16.
      Compared to io depth 16, the io depth 8 workload is idle. We use the
      idea to roughly determine if a cgroup is idle.
      
      We treat a cgroup idle if its think time is above a threshold (by
      default 1ms for SSD and 100ms for HD). The idea is think time above the
      threshold will start to harm performance. HD is much slower so a longer
      think time is ok.
      
      The patch (and the latter patches) uses 'unsigned long' to track time.
      We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
      precision, should not a big deal.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9e234eea
    • S
      blk-throttle: choose a small throtl_slice for SSD · d61fcfa4
      Shaohua Li 提交于
      The throtl_slice is 100ms by default. This is a long time for SSD, a lot
      of IO can run. To make cgroups have smoother throughput, we choose a
      small value (20ms) for SSD.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d61fcfa4
    • S
      blk-throttle: make throtl_slice tunable · 297e3d85
      Shaohua Li 提交于
      throtl_slice is important for blk-throttling. It's called slice
      internally but it really is a time window blk-throttling samples data.
      blk-throttling will make decision based on the samplings. An example is
      bandwidth measurement. A cgroup's bandwidth is measured in the time
      interval of throtl_slice.
      
      A small throtl_slice meanse cgroups have smoother throughput but burn
      more CPUs. It has 100ms default value, which is not appropriate for all
      disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
      it tunable.
      
      Since throtl_slice isn't a time slice, the sysfs name
      'throttle_sample_time' reflects its character better.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      297e3d85
  14. 09 2月, 2017 3 次提交
  15. 04 2月, 2017 1 次提交
    • J
      blk-merge: return the merged request · b973cb7e
      Jens Axboe 提交于
      When we attempt to merge request-to-request, we return a 0/1 if we
      ended up merging or not. Change that to return the pointer to the
      request that we freed. We will use this to move the freeing of
      that request out of the merge logic, so that callers can drop
      locks before freeing the request.
      
      There should be no functional changes in this patch.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      b973cb7e
  16. 03 2月, 2017 1 次提交
  17. 01 2月, 2017 1 次提交
  18. 18 1月, 2017 2 次提交
  19. 18 11月, 2016 1 次提交
    • J
      blk-mq: implement hybrid poll mode for sync O_DIRECT · 06426adf
      Jens Axboe 提交于
      This patch enables a hybrid polling mode. Instead of polling after IO
      submission, we can induce an artificial delay, and then poll after that.
      For example, if the IO is presumed to complete in 8 usecs from now, we
      can sleep for 4 usecs, wake up, and then do our polling. This still puts
      a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
      after the IO has completed, it'll happen before. With this hybrid
      scheme, we can achieve big latency reductions while still using the same
      (or less) amount of CPU.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Tested-By: NStephen Bates <sbates@raithlin.com>
      Reviewed-By: NStephen Bates <sbates@raithlin.com>
      06426adf
  20. 28 10月, 2016 1 次提交
  21. 15 9月, 2016 1 次提交
  22. 21 7月, 2016 1 次提交
  23. 23 12月, 2015 1 次提交
    • C
      block: defer timeouts to a workqueue · 287922eb
      Christoph Hellwig 提交于
      Timer context is not very useful for drivers to perform any meaningful abort
      action from.  So instead of calling the driver from this useless context
      defer it to a workqueue as soon as possible.
      
      Note that while a delayed_work item would seem the right thing here I didn't
      dare to use it due to the magic in blk_add_timer that pokes deep into timer
      internals.  But maybe this encourages Tejun to add a sensible API for that to
      the workqueue API and we'll all be fine in the end :)
      
      Contains a major update from Keith Bush:
      
      "This patch removes synchronizing the timeout work so that the timer can
       start a freeze on its own queue. The timer enters the queue, so timer
       context can only start a freeze, but not wait for frozen."
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      287922eb
  24. 20 11月, 2015 1 次提交
    • D
      block: protect rw_page against device teardown · 2e6edc95
      Dan Williams 提交于
      Fix use after free crashes like the following:
      
       general protection fault: 0000 [#1] SMP
       Call Trace:
        [<ffffffffa0050216>] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
        [<ffffffffa0050ba2>] pmem_rw_page+0x42/0x80 [nd_pmem]
        [<ffffffff8128fd90>] bdev_read_page+0x50/0x60
        [<ffffffff812972f0>] do_mpage_readpage+0x510/0x770
        [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
        [<ffffffff811d86dc>] ? lru_cache_add+0x1c/0x50
        [<ffffffff81297657>] mpage_readpages+0x107/0x170
        [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
        [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
        [<ffffffff8129058d>] blkdev_readpages+0x1d/0x20
        [<ffffffff811d615f>] __do_page_cache_readahead+0x28f/0x310
        [<ffffffff811d6039>] ? __do_page_cache_readahead+0x169/0x310
        [<ffffffff811c5abd>] ? pagecache_get_page+0x2d/0x1d0
        [<ffffffff811c76f6>] filemap_fault+0x396/0x530
        [<ffffffff811f816e>] __do_fault+0x4e/0xf0
        [<ffffffff811fce7d>] handle_mm_fault+0x11bd/0x1b50
      
      Cc: <stable@vger.kernel.org>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMatthew Wilcox <willy@linux.intel.com>
      [willy: symmetry fixups]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      2e6edc95
  25. 22 10月, 2015 3 次提交
    • J
      block: fix plug list flushing for nomerge queues · 0809e3ac
      Jeff Moyer 提交于
      Request queues with merging disabled will not flush the plug list after
      BLK_MAX_REQUEST_COUNT requests have been queued, since the code relies
      on blk_attempt_plug_merge to compute the request_count.  Fix this by
      computing the number of queued requests even for nomerge queues.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0809e3ac
    • D
      block: blk_flush_integrity() for bio-based drivers · 5a48fc14
      Dan Williams 提交于
      Since they lack requests to pin the request_queue active, synchronous
      bio-based drivers may have in-flight integrity work from
      bio_integrity_endio() that is not flushed by blk_freeze_queue().  Flush
      that work to prevent races to free the queue and the final usage of the
      blk_integrity profile.
      
      This is temporary unless/until bio-based drivers start to generically
      take a q_usage_counter reference while a bio is in-flight.
      
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case]
      Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5a48fc14
    • D
      block: generic request_queue reference counting · 3ef28e83
      Dan Williams 提交于
      Allow pmem, and other synchronous/bio-based block drivers, to fallback
      on a per-cpu reference count managed by the core for tracking queue
      live/dead state.
      
      The existing per-cpu reference count for the blk_mq case is promoted to
      be used in all block i/o scenarios.  This involves initializing it by
      default, waiting for it to drop to zero at exit, and holding a live
      reference over the invocation of q->make_request_fn() in
      generic_make_request().  The blk_mq code continues to take its own
      reference per blk_mq request and retains the ability to freeze the
      queue, but the check that the queue is frozen is moved to
      generic_make_request().
      
      This fixes crash signatures like the following:
      
       BUG: unable to handle kernel paging request at ffff880140000000
       [..]
       Call Trace:
        [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70
        [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
        [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem]
        [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0
        [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110
        [<ffffffff8141f966>] submit_bio+0x76/0x170
        [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160
        [<ffffffff81286e62>] submit_bh+0x12/0x20
        [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170
        [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90
        [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270
        [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30
        [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250
        [<ffffffff81303494>] ext4_put_super+0x64/0x360
        [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      3ef28e83
  26. 19 8月, 2015 1 次提交
    • T
      blkcg: consolidate blkg creation in blkcg_bio_issue_check() · ae118896
      Tejun Heo 提交于
      blkg (blkcg_gq) currently is created by blkcg policies invoking
      blkg_lookup_create() which ends up repeating about the same code in
      different policies.  Theoretically, this can avoid the overhead of
      looking and/or creating blkg's if blkcg is enabled but no policy is in
      use; however, the cost of blkg lookup / creation is very low
      especially if only the root blkcg is in use which is highly likely if
      no blkcg policy is in active use - it boils down to a single very
      predictable conditional and surrounding RCU protection.
      
      This patch consolidates blkg creation to a new function
      blkcg_bio_issue_check() which is called during bio issue from
      generic_make_request_checks().  blkcg_bio_issue_check() is now the
      only function which tries to create missing blkg's.  The subsequent
      policy and request_list operations just perform blkg_lookup() and if
      missing falls back to the root.
      
      * blk_get_rl() no longer tries to create blkg.  It uses blkg_lookup()
        instead of blkg_lookup_create().
      
      * blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
        read locked and blkg already looked up.  Both throtl_lookup_tg() and
        throtl_lookup_create_tg() are dropped.
      
      * cfq is similarly updated.  cfq_lookup_create_cfqg() is replaced with
        cfq_lookup_cfqg()which uses blkg_lookup().
      
      This consolidates blkg handling and avoids unnecessary blkg creation
      retries under memory pressure.  In addition, this provides a common
      bio entry point into blkcg where things like common accounting can be
      performed.
      
      v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
          !CONFIG_BLK_DEV_THROTTLING.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ae118896
  27. 15 8月, 2015 1 次提交
    • M
      blk-mq: fix race between timeout and freeing request · 0048b483
      Ming Lei 提交于
      Inside timeout handler, blk_mq_tag_to_rq() is called
      to retrieve the request from one tag. This way is obviously
      wrong because the request can be freed any time and some
      fiedds of the request can't be trusted, then kernel oops
      might be triggered[1].
      
      Currently wrt. blk_mq_tag_to_rq(), the only special case is
      that the flush request can share same tag with the request
      cloned from, and the two requests can't be active at the same
      time, so this patch fixes the above issue by updating tags->rqs[tag]
      with the active request(either flush rq or the request cloned
      from) of the tag.
      
      Also blk_mq_tag_to_rq() gets much simplified with this patch.
      
      Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
      make sure the request can't be freed, so in bt_for_each() this
      helper is replaced with tags->rqs[tag].
      
      [1] kernel oops log
      [  439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
      [  439.697162] IP: [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
      [  439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
      [  439.700653] Dumping ftrace buffer:^M
      [  439.700653]    (ftrace buffer empty)^M
      [  439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
      [  439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
      [  439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
      [  439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
      [  439.730500] RIP: 0010:[<ffffffff812d89ba>]  [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.730500] RSP: 0018:ffff880819203da0  EFLAGS: 00010283^M
      [  439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
      [  439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
      [  439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
      [  439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
      [  439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
      [  439.730500] FS:  00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
      [  439.730500] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
      [  439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
      [  439.730500] Stack:^M
      [  439.730500]  0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
      [  439.755663]  ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
      [  439.755663] Call Trace:^M
      [  439.755663]  <IRQ> ^M
      [  439.755663]  [<ffffffff812dc104>] bt_for_each+0x6e/0xc8^M
      [  439.755663]  [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
      [  439.755663]  [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
      [  439.755663]  [<ffffffff812dc1b3>] blk_mq_tag_busy_iter+0x55/0x5e^M
      [  439.755663]  [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
      [  439.755663]  [<ffffffff812d8911>] blk_mq_rq_timer+0x5d/0xd4^M
      [  439.755663]  [<ffffffff810a3e10>] call_timer_fn+0xf7/0x284^M
      [  439.755663]  [<ffffffff810a3d1e>] ? call_timer_fn+0x5/0x284^M
      [  439.755663]  [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
      [  439.755663]  [<ffffffff810a46d6>] run_timer_softirq+0x1ce/0x1f8^M
      [  439.755663]  [<ffffffff8104c367>] __do_softirq+0x181/0x3a4^M
      [  439.755663]  [<ffffffff8104c76e>] irq_exit+0x40/0x94^M
      [  439.755663]  [<ffffffff81031482>] smp_apic_timer_interrupt+0x33/0x3e^M
      [  439.755663]  [<ffffffff815559a4>] apic_timer_interrupt+0x84/0x90^M
      [  439.755663]  <EOI> ^M
      [  439.755663]  [<ffffffff81554350>] ? _raw_spin_unlock_irq+0x32/0x4a^M
      [  439.755663]  [<ffffffff8106a98b>] finish_task_switch+0xe0/0x163^M
      [  439.755663]  [<ffffffff8106a94d>] ? finish_task_switch+0xa2/0x163^M
      [  439.755663]  [<ffffffff81550066>] __schedule+0x469/0x6cd^M
      [  439.755663]  [<ffffffff8155039b>] schedule+0x82/0x9a^M
      [  439.789267]  [<ffffffff8119b28b>] signalfd_read+0x186/0x49a^M
      [  439.790911]  [<ffffffff8106d86a>] ? wake_up_q+0x47/0x47^M
      [  439.790911]  [<ffffffff811618c2>] __vfs_read+0x28/0x9f^M
      [  439.790911]  [<ffffffff8117a289>] ? __fget_light+0x4d/0x74^M
      [  439.790911]  [<ffffffff811620a7>] vfs_read+0x7a/0xc6^M
      [  439.790911]  [<ffffffff8116292b>] SyS_read+0x49/0x7f^M
      [  439.790911]  [<ffffffff81554c17>] entry_SYSCALL_64_fastpath+0x12/0x6f^M
      [  439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
      f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
      53 38 <48> 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
      ^M
      [  439.790911] RIP  [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.790911]  RSP <ffff880819203da0>^M
      [  439.790911] CR2: 0000000000000158^M
      [  439.790911] ---[ end trace d40af58949325661 ]---^M
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0048b483
  28. 09 5月, 2015 1 次提交
    • S
      blk-mq: make plug work for mutiple disks and queues · 5b3f341f
      Shaohua Li 提交于
      Last patch makes plug work for multiple queue case. However it only
      works for single disk case, because it assumes only one request in the
      plug list. If a task is accessing multiple disks, eg MD/DM, the
      assumption is wrong. Let blk_attempt_plug_merge() record request from
      the same queue.
      
      V2: use NULL parameter in !mq case. Fix a bug. Add comments in
      blk_attempt_plug_merge to make it less (hopefully) confusion.
      
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5b3f341f