1. 20 1月, 2018 2 次提交
    • B
      block: Remove kblockd_schedule_delayed_work{,_on}() · f5ced52a
      Bart Van Assche 提交于
      The previous patch removed all users of these two functions. Hence
      also remove the functions themselves.
      Reviewed-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f5ced52a
    • B
      lib/scatterlist: Fix chaining support in sgl_alloc_order() · 8c7a8d1c
      Bart Van Assche 提交于
      This patch avoids that workloads with large block sizes (megabytes)
      can trigger the following call stack with the ib_srpt driver (that
      driver is the only driver that chains scatterlists allocated by
      sgl_alloc_order()):
      
      BUG: Bad page state in process kworker/0:1H  pfn:2423a78
      page:fffffb03d08e9e00 count:-3 mapcount:0 mapping:          (null) index:0x0
      flags: 0x57ffffc0000000()
      raw: 0057ffffc0000000 0000000000000000 0000000000000000 fffffffdffffffff
      raw: dead000000000100 dead000000000200 0000000000000000 0000000000000000
      page dumped because: nonzero _count
      CPU: 0 PID: 733 Comm: kworker/0:1H Tainted: G          I      4.15.0-rc7.bart+ #1
      Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
      Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
      Call Trace:
       dump_stack+0x5c/0x83
       bad_page+0xf5/0x10f
       get_page_from_freelist+0xa46/0x11b0
       __alloc_pages_nodemask+0x103/0x290
       sgl_alloc_order+0x101/0x180
       target_alloc_sgl+0x2c/0x40 [target_core_mod]
       srpt_alloc_rw_ctxs+0x173/0x2d0 [ib_srpt]
       srpt_handle_new_iu+0x61e/0x7f0 [ib_srpt]
       __ib_process_cq+0x55/0xa0 [ib_core]
       ib_cq_poll_work+0x1b/0x60 [ib_core]
       process_one_work+0x141/0x340
       worker_thread+0x47/0x3e0
       kthread+0xf5/0x130
       ret_from_fork+0x1f/0x30
      
      Fixes: e80a0af4 ("lib/scatterlist: Introduce sgl_alloc() and sgl_free()")
      Reported-by: NLaurence Oberman <loberman@redhat.com>
      Tested-by: NLaurence Oberman <loberman@redhat.com>
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Cc: Laurence Oberman <loberman@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8c7a8d1c
  2. 19 1月, 2018 1 次提交
  3. 18 1月, 2018 1 次提交
  4. 16 1月, 2018 1 次提交
    • A
      blkcg: simplify statistic accumulation code · ddc21231
      Arnd Bergmann 提交于
      Some older compilers (gcc-4.4 through 4.6 in particular) struggle
      with the way that blkg_rwstat_read() returns a structure, leading
      to excessive stack usage and rather inefficient code:
      
      block/blk-cgroup.c: In function 'blkg_destroy':
      block/blk-cgroup.c:354:1: error: the frame size of 1296 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
      block/cfq-iosched.c: In function 'cfqg_stats_add_aux':
      block/cfq-iosched.c:753:1: error: the frame size of 1928 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
      block/bfq-cgroup.c: In function 'bfqg_stats_add_aux':
      block/bfq-cgroup.c:299:1: error: the frame size of 1928 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
      
      I also notice that there is no point in using atomic accesses
      for the local variables, so storing the temporaries in simple 'u64'
      variables not only avoids the stack usage on older compilers but
      also improves the object code on modern versions.
      
      Fixes: e6269c44 ("blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it")
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddc21231
  5. 15 1月, 2018 1 次提交
    • M
      block: allow gendisk's request_queue registration to be deferred · fa70d2e2
      Mike Snitzer 提交于
      Since I can remember DM has forced the block layer to allow the
      allocation and initialization of the request_queue to be distinct
      operations.  Reason for this is block/genhd.c:add_disk() has requires
      that the request_queue (and associated bdi) be tied to the gendisk
      before add_disk() is called -- because add_disk() also deals with
      exposing the request_queue via blk_register_queue().
      
      DM's dynamic creation of arbitrary device types (and associated
      request_queue types) requires the DM device's gendisk be available so
      that DM table loads can establish a master/slave relationship with
      subordinate devices that are referenced by loaded DM tables -- using
      bd_link_disk_holder().  But until these DM tables, and their associated
      subordinate devices, are known DM cannot know what type of request_queue
      it needs -- nor what its queue_limits should be.
      
      This chicken and egg scenario has created all manner of problems for DM
      and, at times, the block layer.
      
      Summary of changes:
      
      - Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
        that drivers may use to add a disk without also calling
        blk_register_queue().  Driver must call blk_register_queue() once its
        request_queue is fully initialized.
      
      - Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
        is not set.  It won't be set if driver used add_disk_no_queue_reg()
        but driver encounters an error and must del_gendisk() before calling
        blk_register_queue().
      
      - Export blk_register_queue().
      
      These changes allow DM to use add_disk_no_queue_reg() to anchor its
      gendisk as the "master" for master/slave relationships DM must establish
      with subordinate devices referenced in DM tables that get loaded.  Once
      all "slave" devices for a DM device are known its request_queue can be
      properly initialized and then advertised via sysfs -- important
      improvement being that no request_queue resource initialization
      performed by blk_register_queue() is missed for DM devices anymore.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fa70d2e2
  6. 11 1月, 2018 5 次提交
  7. 10 1月, 2018 3 次提交
    • T
      blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcu · 05707b64
      Tejun Heo 提交于
      The RCU protection has been expanded to cover both queueing and
      completion paths making ->queue_rq_srcu a misnomer.  Rename it to
      ->srcu as suggested by Bart.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05707b64
    • T
      blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq · 634f9e46
      Tejun Heo 提交于
      After the recent updates to use generation number and state based
      synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
      to avoid firing the same timeout multiple times.
      
      Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
      RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
      times.  This removes atomic bitops from hot paths too.
      
      v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().
      
      v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      634f9e46
    • T
      blk-mq: replace timeout synchronization with a RCU and generation based scheme · 1d9bd516
      Tejun Heo 提交于
      Currently, blk-mq timeout path synchronizes against the usual
      issue/completion path using a complex scheme involving atomic
      bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
      rules.  Unfortunately, it contains quite a few holes.
      
      There's a complex dancing around REQ_ATOM_STARTED and
      REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
      they don't have a synchronization point across request recycle
      instances and it isn't clear what the barriers add.
      blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
      deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
      
      In fact, it's pretty easy to make blk_mq_check_expired() terminate a
      later instance of a request.  If we induce 5 sec delay before
      time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
      2s, and issue back-to-back large IOs, blk-mq starts timing out
      requests spuriously pretty quickly.  Nothing actually timed out.  It
      just made the call on a recycle instance of a request and then
      terminated a later instance long after the original instance finished.
      The scenario isn't theoretical either.
      
      This patch replaces the broken synchronization mechanism with a RCU
      and generation number based one.
      
      1. Each request has a u64 generation + state value, which can be
         updated only by the request owner.  Whenever a request becomes
         in-flight, the generation number gets bumped up too.  This provides
         the basis for the timeout path to distinguish different recycle
         instances of the request.
      
         Also, marking a request in-flight and setting its deadline are
         protected with a seqcount so that the timeout path can fetch both
         values coherently.
      
      2. The timeout path fetches the generation, state and deadline.  If
         the verdict is timeout, it records the generation into a dedicated
         request abortion field and does RCU wait.
      
      3. The completion path is also protected by RCU (from the previous
         patch) and checks whether the current generation number and state
         match the abortion field.  If so, it skips completion.
      
      4. The timeout path, after RCU wait, scans requests again and
         terminates the ones whose generation and state still match the ones
         requested for abortion.
      
         By now, the timeout path knows that either the generation number
         and state changed if it lost the race or the completion will yield
         to it and can safely timeout the request.
      
      While it's more lines of code, it's conceptually simpler, doesn't
      depend on direct use of subtle memory ordering or coherence, and
      hopefully doesn't terminate the wrong instance.
      
      While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
      between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
      removed yet as it's still used in other places.  Future patches will
      move all state tracking to the new mechanism and remove all bitops in
      the hot paths.
      
      Note that this patch adds a comment explaining a race condition in
      BLK_EH_RESET_TIMER path.  The race has always been there and this
      patch doesn't change it.  It's just documenting the existing race.
      
      v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
          - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
          - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
      
      v3: - Fixed possible extended seqcount / u64_stats_sync read looping
            spotted by Peter.
          - MQ_RQ_IDLE was incorrectly being set in complete_request instead
            of free_request.  Fixed.
      
      v4: - Rebased on top of hctx_lock() refactoring patch.
          - Added comment explaining the use of hctx_lock() in completion path.
      
      v5: - Added comments requested by Bart.
          - Note the addition of BLK_EH_RESET_TIMER race condition in the
            commit message.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1d9bd516
  8. 07 1月, 2018 4 次提交
  9. 06 1月, 2018 1 次提交
    • C
      block: introduce zoned block devices zone write locking · 6cc77e9c
      Christoph Hellwig 提交于
      Components relying only on the request_queue structure for accessing
      block devices (e.g. I/O schedulers) have a limited knowledged of the
      device characteristics. In particular, the device capacity cannot be
      easily discovered, which for a zoned block device also result in the
      inability to easily know the number of zones of the device (the zone
      size is indicated by the chunk_sectors field of the queue limits).
      
      Introduce the nr_zones field to the request_queue structure to simplify
      access to this information. Also, add the bitmap seq_zone_bitmap which
      indicates which zones of the device are sequential zones (write
      preferred or write required) and the bitmap seq_zones_wlock which
      indicates if a zone is write locked, that is, if a write request
      targeting a zone was dispatched to the device. These fields are
      initialized by the low level block device driver (sd.c for ZBC/ZAC
      disks). They are not initialized by stacking drivers (device mappers)
      handling zoned block devices (e.g. dm-linear).
      
      Using this, I/O schedulers can introduce zone write locking to control
      request dispatching to a zoned block device and avoid write request
      reordering by limiting to at most a single write request per zone
      outside of the scheduler at any time.
      
      Based on previous patches from Damien Le Moal.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      [Damien]
      * Fixed comments and identation in blkdev.h
      * Changed helper functions
      * Fixed this commit message
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6cc77e9c
  10. 05 1月, 2018 6 次提交
  11. 17 12月, 2017 3 次提交
  12. 15 12月, 2017 6 次提交
  13. 12 12月, 2017 3 次提交
    • M
      compiler.h: Remove ACCESS_ONCE() · b899a850
      Mark Rutland 提交于
      There are no longer any kernelspace uses of ACCESS_ONCE(), so we can
      remove the definition from <linux/compiler.h>.
      
      This patch removes the ACCESS_ONCE() definition, and updates comments
      which referred to it. At the same time, some inconsistent and redundant
      whitespace is removed from comments.
      Tested-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: apw@canonical.com
      Link: http://lkml.kernel.org/r/20171127103824.36526-4-mark.rutland@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b899a850
    • I
      locking/lockdep: Remove the cross-release locking checks · e966eaee
      Ingo Molnar 提交于
      This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
      while it found a number of old bugs initially, was also causing too many
      false positives that caused people to disable lockdep - which is arguably
      a worse overall outcome.
      
      If we disable cross-release by default but keep the code upstream then
      in practice the most likely outcome is that we'll allow the situation
      to degrade gradually, by allowing entropy to introduce more and more
      false positives, until it overwhelms maintenance capacity.
      
      Another bad side effect was that people were trying to work around
      the false positives by uglifying/complicating unrelated code. There's
      a marked difference between annotating locking operations and
      uglifying good code just due to bad lock debugging code ...
      
      This gradual decrease in quality happened to a number of debugging
      facilities in the kernel, and lockdep is pretty complex already,
      so we cannot risk this outcome.
      
      Either cross-release checking can be done right with no false positives,
      or it should not be included in the upstream kernel.
      
      ( Note that it might make sense to maintain it out of tree and go through
        the false positives every now and then and see whether new bugs were
        introduced. )
      
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e966eaee
    • W
      locking/core: Remove break_lock field when CONFIG_GENERIC_LOCKBREAK=y · d89c7035
      Will Deacon 提交于
      When CONFIG_GENERIC_LOCKBEAK=y, locking structures grow an extra int ->break_lock
      field which is used to implement raw_spin_is_contended() by setting the field
      to 1 when waiting on a lock and clearing it to zero when holding a lock.
      However, there are a few problems with this approach:
      
        - There is a write-write race between a CPU successfully taking the lock
          (and subsequently writing break_lock = 0) and a waiter waiting on
          the lock (and subsequently writing break_lock = 1). This could result
          in a contended lock being reported as uncontended and vice-versa.
      
        - On machines with store buffers, nothing guarantees that the writes
          to break_lock are visible to other CPUs at any particular time.
      
        - READ_ONCE/WRITE_ONCE are not used, so the field is potentially
          susceptible to harmful compiler optimisations,
      
      Consequently, the usefulness of this field is unclear and we'd be better off
      removing it and allowing architectures to implement raw_spin_is_contended() by
      providing a definition of arch_spin_is_contended(), as they can when
      CONFIG_GENERIC_LOCKBREAK=n.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1511894539-7988-3-git-send-email-will.deacon@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d89c7035
  14. 11 12月, 2017 2 次提交
  15. 09 12月, 2017 1 次提交