1. 10 1月, 2018 1 次提交
    • T
      blk-mq: replace timeout synchronization with a RCU and generation based scheme · 1d9bd516
      Tejun Heo 提交于
      Currently, blk-mq timeout path synchronizes against the usual
      issue/completion path using a complex scheme involving atomic
      bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
      rules.  Unfortunately, it contains quite a few holes.
      
      There's a complex dancing around REQ_ATOM_STARTED and
      REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
      they don't have a synchronization point across request recycle
      instances and it isn't clear what the barriers add.
      blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
      deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
      
      In fact, it's pretty easy to make blk_mq_check_expired() terminate a
      later instance of a request.  If we induce 5 sec delay before
      time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
      2s, and issue back-to-back large IOs, blk-mq starts timing out
      requests spuriously pretty quickly.  Nothing actually timed out.  It
      just made the call on a recycle instance of a request and then
      terminated a later instance long after the original instance finished.
      The scenario isn't theoretical either.
      
      This patch replaces the broken synchronization mechanism with a RCU
      and generation number based one.
      
      1. Each request has a u64 generation + state value, which can be
         updated only by the request owner.  Whenever a request becomes
         in-flight, the generation number gets bumped up too.  This provides
         the basis for the timeout path to distinguish different recycle
         instances of the request.
      
         Also, marking a request in-flight and setting its deadline are
         protected with a seqcount so that the timeout path can fetch both
         values coherently.
      
      2. The timeout path fetches the generation, state and deadline.  If
         the verdict is timeout, it records the generation into a dedicated
         request abortion field and does RCU wait.
      
      3. The completion path is also protected by RCU (from the previous
         patch) and checks whether the current generation number and state
         match the abortion field.  If so, it skips completion.
      
      4. The timeout path, after RCU wait, scans requests again and
         terminates the ones whose generation and state still match the ones
         requested for abortion.
      
         By now, the timeout path knows that either the generation number
         and state changed if it lost the race or the completion will yield
         to it and can safely timeout the request.
      
      While it's more lines of code, it's conceptually simpler, doesn't
      depend on direct use of subtle memory ordering or coherence, and
      hopefully doesn't terminate the wrong instance.
      
      While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
      between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
      removed yet as it's still used in other places.  Future patches will
      move all state tracking to the new mechanism and remove all bitops in
      the hot paths.
      
      Note that this patch adds a comment explaining a race condition in
      BLK_EH_RESET_TIMER path.  The race has always been there and this
      patch doesn't change it.  It's just documenting the existing race.
      
      v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
          - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
          - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
      
      v3: - Fixed possible extended seqcount / u64_stats_sync read looping
            spotted by Peter.
          - MQ_RQ_IDLE was incorrectly being set in complete_request instead
            of free_request.  Fixed.
      
      v4: - Rebased on top of hctx_lock() refactoring patch.
          - Added comment explaining the use of hctx_lock() in completion path.
      
      v5: - Added comments requested by Bart.
          - Note the addition of BLK_EH_RESET_TIMER race condition in the
            commit message.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1d9bd516
  2. 07 1月, 2018 4 次提交
  3. 06 1月, 2018 1 次提交
    • C
      block: introduce zoned block devices zone write locking · 6cc77e9c
      Christoph Hellwig 提交于
      Components relying only on the request_queue structure for accessing
      block devices (e.g. I/O schedulers) have a limited knowledged of the
      device characteristics. In particular, the device capacity cannot be
      easily discovered, which for a zoned block device also result in the
      inability to easily know the number of zones of the device (the zone
      size is indicated by the chunk_sectors field of the queue limits).
      
      Introduce the nr_zones field to the request_queue structure to simplify
      access to this information. Also, add the bitmap seq_zone_bitmap which
      indicates which zones of the device are sequential zones (write
      preferred or write required) and the bitmap seq_zones_wlock which
      indicates if a zone is write locked, that is, if a write request
      targeting a zone was dispatched to the device. These fields are
      initialized by the low level block device driver (sd.c for ZBC/ZAC
      disks). They are not initialized by stacking drivers (device mappers)
      handling zoned block devices (e.g. dm-linear).
      
      Using this, I/O schedulers can introduce zone write locking to control
      request dispatching to a zoned block device and avoid write request
      reordering by limiting to at most a single write request per zone
      outside of the scheduler at any time.
      
      Based on previous patches from Damien Le Moal.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      [Damien]
      * Fixed comments and identation in blkdev.h
      * Changed helper functions
      * Fixed this commit message
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6cc77e9c
  4. 05 1月, 2018 6 次提交
  5. 17 12月, 2017 3 次提交
  6. 16 12月, 2017 2 次提交
  7. 15 12月, 2017 6 次提交
  8. 14 12月, 2017 2 次提交
    • D
      drm: rework delayed connector cleanup in connector_iter · ea497bb9
      Daniel Vetter 提交于
      PROBE_DEFER also uses system_wq to reprobe drivers, which means when
      that again fails, and we try to flush the overall system_wq (to get
      all the delayed connectore cleanup work_struct completed), we
      deadlock.
      
      Fix this by using just a single cleanup work, so that we can only
      flush that one and don't block on anything else. That means a free
      list plus locking, a standard pattern.
      
      v2:
      - Correctly free connectors only on last ref. Oops (Chris).
      - use llist_head/node (Chris).
      
      v3
      - Add init_llist_head (Chris).
      
      Fixes: a703c550 ("drm: safely free connectors from connector_iter")
      Fixes: 613051da ("drm: locking&new iterators for connector_list")
      Cc: Ben Widawsky <ben@bwidawsk.net>
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Sean Paul <seanpaul@chromium.org>
      Cc: <stable@vger.kernel.org> # v4.11+: 613051da ("drm: locking&new iterators for connector_list"
      Cc: <stable@vger.kernel.org> # v4.11+
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Gustavo Padovan <gustavo@padovan.org>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Javier Martinez Canillas <javier@dowhile0.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Guillaume Tucker <guillaume.tucker@collabora.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Kevin Hilman <khilman@baylibre.com>
      Cc: Matt Hart <matthew.hart@linaro.org>
      Cc: Thierry Escande <thierry.escande@collabora.co.uk>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Enric Balletbo i Serra <enric.balletbo@collabora.com>
      Tested-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20171213124936.17914-1-daniel.vetter@ffwll.ch
      ea497bb9
    • E
      ipv4: igmp: guard against silly MTU values · b5476022
      Eric Dumazet 提交于
      IPv4 stack reacts to changes to small MTU, by disabling itself under
      RTNL.
      
      But there is a window where threads not using RTNL can see a wrong
      device mtu. This can lead to surprises, in igmp code where it is
      assumed the mtu is suitable.
      
      Fix this by reading device mtu once and checking IPv4 minimal MTU.
      
      This patch adds missing IPV4_MIN_MTU define, to not abuse
      ETH_MIN_MTU anymore.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5476022
  9. 13 12月, 2017 1 次提交
    • K
      drm: Update edid-derived drm_display_info fields at edid property set [v2] · 4b4df570
      Keith Packard 提交于
      There are a set of values in the drm_display_info structure for each
      connector which hold information derived from EDID. These are computed
      in drm_add_display_info. Before this patch, that was only called in
      drm_add_edid_modes. This meant that they were only set when EDID was
      present and never reset when EDID was not, as happened when the
      display was disconnected.
      
      One of these fields, non_desktop, is used from
      drm_mode_connector_update_edid_property, the function responsible for
      assigning the new edid value to the application-visible property.
      
      Various drivers call these two functions (drm_add_edid_modes and
      drm_mode_connector_update_edid_property) in different orders. This
      means that even when EDID is present, the drm_display_info fields may
      not have been computed at the time that
      drm_mode_connector_update_edid_property used the non_desktop value to
      set the non_desktop property.
      
      I've added a public function (drm_reset_display_info) that resets the
      drm_display_info field values to default values and then made the
      drm_add_display_info function public. These two functions are now
      called directly from drm_mode_connector_update_edid_property so that
      the drm_display_info fields are always computed from the current EDID
      information before being used in that function.
      
      This means that the drm_display_info values are often computed twice,
      once when the EDID property it set and a second time when EDID is used
      to compute modes for the device. The alternative would be to uniformly
      ensure that the values were computed once before being used, which
      would require that all drivers reliably invoke the two paths in the
      same order. The computation is inexpensive enough that it seems more
      maintainable in the long term to simply compute them in both paths.
      
      The API to drm_add_display_info has been changed so that it no longer
      takes the set of edid-based quirks as a parameter. Rather, it now
      computes those quirks itself and returns them for further use by
      drm_add_edid_modes.
      
      This patch also includes a number of 'const' additions caused by
      drm_mode_connector_update_edid_property taking a 'const struct edid *'
      parameter and wanting to pass that along to drm_add_display_info.
      
      v2: after review by Daniel Vetter <daniel.vetter@ffwll.ch>
      
      	Removed EXPORT_SYMBOL_GPL for drm_reset_display_info and
      	drm_add_display_info.
      
      	Added FIXME in drm_mode_connector_update_edid_property about
      	potentially merging that with drm_add_edid_modes to avoid
      	the need for two driver calls.
      Signed-off-by: NKeith Packard <keithp@keithp.com>
      Reviewed-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: https://patchwork.freedesktop.org/patch/msgid/20171213084427.31199-1-keithp@keithp.com
      (danvet: cherry picked from commit 12a889bf4bca ("drm: rework delayed
      connector cleanup in connector_iter") from drm-misc-next since
      functional conflict with changes in -next and we need to make sure
      both have the right version and nothing gets lost.)
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      4b4df570
  10. 12 12月, 2017 4 次提交
    • M
      compiler.h: Remove ACCESS_ONCE() · b899a850
      Mark Rutland 提交于
      There are no longer any kernelspace uses of ACCESS_ONCE(), so we can
      remove the definition from <linux/compiler.h>.
      
      This patch removes the ACCESS_ONCE() definition, and updates comments
      which referred to it. At the same time, some inconsistent and redundant
      whitespace is removed from comments.
      Tested-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: apw@canonical.com
      Link: http://lkml.kernel.org/r/20171127103824.36526-4-mark.rutland@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b899a850
    • I
      locking/lockdep: Remove the cross-release locking checks · e966eaee
      Ingo Molnar 提交于
      This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
      while it found a number of old bugs initially, was also causing too many
      false positives that caused people to disable lockdep - which is arguably
      a worse overall outcome.
      
      If we disable cross-release by default but keep the code upstream then
      in practice the most likely outcome is that we'll allow the situation
      to degrade gradually, by allowing entropy to introduce more and more
      false positives, until it overwhelms maintenance capacity.
      
      Another bad side effect was that people were trying to work around
      the false positives by uglifying/complicating unrelated code. There's
      a marked difference between annotating locking operations and
      uglifying good code just due to bad lock debugging code ...
      
      This gradual decrease in quality happened to a number of debugging
      facilities in the kernel, and lockdep is pretty complex already,
      so we cannot risk this outcome.
      
      Either cross-release checking can be done right with no false positives,
      or it should not be included in the upstream kernel.
      
      ( Note that it might make sense to maintain it out of tree and go through
        the false positives every now and then and see whether new bugs were
        introduced. )
      
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e966eaee
    • W
      locking/core: Remove break_lock field when CONFIG_GENERIC_LOCKBREAK=y · d89c7035
      Will Deacon 提交于
      When CONFIG_GENERIC_LOCKBEAK=y, locking structures grow an extra int ->break_lock
      field which is used to implement raw_spin_is_contended() by setting the field
      to 1 when waiting on a lock and clearing it to zero when holding a lock.
      However, there are a few problems with this approach:
      
        - There is a write-write race between a CPU successfully taking the lock
          (and subsequently writing break_lock = 0) and a waiter waiting on
          the lock (and subsequently writing break_lock = 1). This could result
          in a contended lock being reported as uncontended and vice-versa.
      
        - On machines with store buffers, nothing guarantees that the writes
          to break_lock are visible to other CPUs at any particular time.
      
        - READ_ONCE/WRITE_ONCE are not used, so the field is potentially
          susceptible to harmful compiler optimisations,
      
      Consequently, the usefulness of this field is unclear and we'd be better off
      removing it and allowing architectures to implement raw_spin_is_contended() by
      providing a definition of arch_spin_is_contended(), as they can when
      CONFIG_GENERIC_LOCKBREAK=n.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1511894539-7988-3-git-send-email-will.deacon@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d89c7035
    • X
      fou: fix some member types in guehdr · 20080971
      Xin Long 提交于
      guehdr struct is used to build or parse gue packets, which
      are always in big endian. It's better to define all guehdr
      members as __beXX types.
      
      Also, in validate_gue_flags it's not good to use a __be32
      variable for both Standard flags(__be16) and Private flags
      (__be32), and pass it to other funcions.
      
      This patch could fix a bunch of sparse warnings from fou.
      
      Fixes: 5024c33a ("gue: Add infrastructure for flags and options")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20080971
  11. 11 12月, 2017 2 次提交
  12. 09 12月, 2017 1 次提交
  13. 08 12月, 2017 2 次提交
  14. 07 12月, 2017 2 次提交
  15. 06 12月, 2017 3 次提交
    • D
      drm: safely free connectors from connector_iter · a703c550
      Daniel Vetter 提交于
      In
      
      commit 613051da
      Author: Daniel Vetter <daniel.vetter@ffwll.ch>
      Date:   Wed Dec 14 00:08:06 2016 +0100
      
          drm: locking&new iterators for connector_list
      
      we've went to extreme lengths to make sure connector iterations works
      in any context, without introducing any additional locking context.
      This worked, except for a small fumble in the implementation:
      
      When we actually race with a concurrent connector unplug event, and
      our temporary connector reference turns out to be the final one, then
      everything breaks: We call the connector release function from
      whatever context we happen to be in, which can be an irq/atomic
      context. And connector freeing grabs all kinds of locks and stuff.
      
      Fix this by creating a specially safe put function for connetor_iter,
      which (in this rare case) punts the cleanup to a worker.
      Reported-by: NBen Widawsky <ben@bwidawsk.net>
      Cc: Ben Widawsky <ben@bwidawsk.net>
      Fixes: 613051da ("drm: locking&new iterators for connector_list")
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Sean Paul <seanpaul@chromium.org>
      Cc: <stable@vger.kernel.org> # v4.11+
      Reviewed-by: NDave Airlie <airlied@gmail.com>
      Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20171204204818.24745-1-daniel.vetter@ffwll.ch
      a703c550
    • C
      KVM: s390: mark irq_state.flags as non-usable · bb64da9a
      Christian Borntraeger 提交于
      Old kernels did not check for zero in the irq_state.flags field and old
      QEMUs did not zero the flag/reserved fields when calling
      KVM_S390_*_IRQ_STATE.  Let's add comments to prevent future uses of
      these fields.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: NThomas Huth <thuth@redhat.com>
      Reviewed-by: NCornelia Huck <cohuck@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      bb64da9a
    • E
      net: remove hlist_nulls_add_tail_rcu() · d7efc6c1
      Eric Dumazet 提交于
      Alexander Potapenko reported use of uninitialized memory [1]
      
      This happens when inserting a request socket into TCP ehash,
      in __sk_nulls_add_node_rcu(), since sk_reuseport is not initialized.
      
      Bug was added by commit d894ba18 ("soreuseport: fix ordering for
      mixed v4/v6 sockets")
      
      Note that d296ba60 ("soreuseport: Resolve merge conflict for v4/v6
      ordering fix") missed the opportunity to get rid of
      hlist_nulls_add_tail_rcu() :
      
      Both UDP sockets and TCP/DCCP listeners no longer use
      __sk_nulls_add_node_rcu() for their hash insertion.
      
      Since all other sockets have unique 4-tuple, the reuseport status
      has no special meaning, so we can always use hlist_nulls_add_head_rcu()
      for them and save few cycles/instructions.
      
      [1]
      
      ==================================================================
      BUG: KMSAN: use of uninitialized memory in inet_ehash_insert+0xd40/0x1050
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #3288
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:16
       dump_stack+0x185/0x1d0 lib/dump_stack.c:52
       kmsan_report+0x13f/0x1c0 mm/kmsan/kmsan.c:1016
       __msan_warning_32+0x69/0xb0 mm/kmsan/kmsan_instr.c:766
       __sk_nulls_add_node_rcu ./include/net/sock.h:684
       inet_ehash_insert+0xd40/0x1050 net/ipv4/inet_hashtables.c:413
       reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:754
       inet_csk_reqsk_queue_hash_add+0x1cc/0x300 net/ipv4/inet_connection_sock.c:765
       tcp_conn_request+0x31e7/0x36f0 net/ipv4/tcp_input.c:6414
       tcp_v4_conn_request+0x16d/0x220 net/ipv4/tcp_ipv4.c:1314
       tcp_rcv_state_process+0x42a/0x7210 net/ipv4/tcp_input.c:5917
       tcp_v4_do_rcv+0xa6a/0xcd0 net/ipv4/tcp_ipv4.c:1483
       tcp_v4_rcv+0x3de0/0x4ab0 net/ipv4/tcp_ipv4.c:1763
       ip_local_deliver_finish+0x6bb/0xcb0 net/ipv4/ip_input.c:216
       NF_HOOK ./include/linux/netfilter.h:248
       ip_local_deliver+0x3fa/0x480 net/ipv4/ip_input.c:257
       dst_input ./include/net/dst.h:477
       ip_rcv_finish+0x6fb/0x1540 net/ipv4/ip_input.c:397
       NF_HOOK ./include/linux/netfilter.h:248
       ip_rcv+0x10f6/0x15c0 net/ipv4/ip_input.c:488
       __netif_receive_skb_core+0x36f6/0x3f60 net/core/dev.c:4298
       __netif_receive_skb net/core/dev.c:4336
       netif_receive_skb_internal+0x63c/0x19c0 net/core/dev.c:4497
       napi_skb_finish net/core/dev.c:4858
       napi_gro_receive+0x629/0xa50 net/core/dev.c:4889
       e1000_receive_skb drivers/net/ethernet/intel/e1000/e1000_main.c:4018
       e1000_clean_rx_irq+0x1492/0x1d30
      drivers/net/ethernet/intel/e1000/e1000_main.c:4474
       e1000_clean+0x43aa/0x5970 drivers/net/ethernet/intel/e1000/e1000_main.c:3819
       napi_poll net/core/dev.c:5500
       net_rx_action+0x73c/0x1820 net/core/dev.c:5566
       __do_softirq+0x4b4/0x8dd kernel/softirq.c:284
       invoke_softirq kernel/softirq.c:364
       irq_exit+0x203/0x240 kernel/softirq.c:405
       exiting_irq+0xe/0x10 ./arch/x86/include/asm/apic.h:638
       do_IRQ+0x15e/0x1a0 arch/x86/kernel/irq.c:263
       common_interrupt+0x86/0x86
      
      Fixes: d894ba18 ("soreuseport: fix ordering for mixed v4/v6 sockets")
      Fixes: d296ba60 ("soreuseport: Resolve merge conflict for v4/v6 ordering fix")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7efc6c1