1. 10 8月, 2017 34 次提交
    • B
      locking/lockdep: Handle non(or multi)-acquisition of a crosslock · 28a903f6
      Byungchul Park 提交于
      No acquisition might be in progress on commit of a crosslock. Completion
      operations enabling crossrelease are the case like:
      
         CONTEXT X                         CONTEXT Y
         ---------                         ---------
         trigger completion context
                                           complete AX
                                              commit AX
         wait_for_complete AX
            acquire AX
            wait
      
         where AX is a crosslock.
      
      When no acquisition is in progress, we should not perform commit because
      the lock does not exist, which might cause incorrect memory access. So
      we have to track the number of acquisitions of a crosslock to handle it.
      
      Moreover, in case that more than one acquisition of a crosslock are
      overlapped like:
      
         CONTEXT W        CONTEXT X        CONTEXT Y        CONTEXT Z
         ---------        ---------        ---------        ---------
         acquire AX (gen_id: 1)
                                           acquire A
                          acquire AX (gen_id: 10)
                                           acquire B
                                           commit AX
                                                            acquire C
                                                            commit AX
      
         where A, B and C are typical locks and AX is a crosslock.
      
      Current crossrelease code performs commits in Y and Z with gen_id = 10.
      However, we can use gen_id = 1 to do it, since not only 'acquire AX in X'
      but 'acquire AX in W' also depends on each acquisition in Y and Z until
      their commits. So make it use gen_id = 1 instead of 10 on their commits,
      which adds an additional dependency 'AX -> A' in the example above.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Cc: willy@infradead.org
      Link: http://lkml.kernel.org/r/1502089981-21272-8-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      28a903f6
    • B
      locking/lockdep: Detect and handle hist_lock ring buffer overwrite · 23f873d8
      Byungchul Park 提交于
      The ring buffer can be overwritten by hardirq/softirq/work contexts.
      That cases must be considered on rollback or commit. For example,
      
                |<------ hist_lock ring buffer size ----->|
                ppppppppppppiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
      wrapped > iiiiiiiiiiiiiiiiiiiiiii....................
      
                where 'p' represents an acquisition in process context,
                'i' represents an acquisition in irq context.
      
      On irq exit, crossrelease tries to rollback idx to original position,
      but it should not because the entry already has been invalid by
      overwriting 'i'. Avoid rollback or commit for entries overwritten.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Cc: willy@infradead.org
      Link: http://lkml.kernel.org/r/1502089981-21272-7-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      23f873d8
    • B
      locking/lockdep: Implement the 'crossrelease' feature · b09be676
      Byungchul Park 提交于
      Lockdep is a runtime locking correctness validator that detects and
      reports a deadlock or its possibility by checking dependencies between
      locks. It's useful since it does not report just an actual deadlock but
      also the possibility of a deadlock that has not actually happened yet.
      That enables problems to be fixed before they affect real systems.
      
      However, this facility is only applicable to typical locks, such as
      spinlocks and mutexes, which are normally released within the context in
      which they were acquired. However, synchronization primitives like page
      locks or completions, which are allowed to be released in any context,
      also create dependencies and can cause a deadlock.
      
      So lockdep should track these locks to do a better job. The 'crossrelease'
      implementation makes these primitives also be tracked.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Cc: willy@infradead.org
      Link: http://lkml.kernel.org/r/1502089981-21272-6-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b09be676
    • B
      locking/lockdep: Make check_prev_add() able to handle external stack_trace · ce07a941
      Byungchul Park 提交于
      Currently, a space for stack_trace is pinned in check_prev_add(), that
      makes us not able to use external stack_trace. The simplest way to
      achieve it is to pass an external stack_trace as an argument.
      
      A more suitable solution is to pass a callback additionally along with
      a stack_trace so that callers can decide the way to save or whether to
      save. Actually crossrelease needs to do other than saving a stack_trace.
      So pass a stack_trace and callback to handle it, to check_prev_add().
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Cc: willy@infradead.org
      Link: http://lkml.kernel.org/r/1502089981-21272-5-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ce07a941
    • B
      locking/lockdep: Change the meaning of check_prev_add()'s return value · 70911fdc
      Byungchul Park 提交于
      Firstly, return 1 instead of 2 when 'prev -> next' dependency already
      exists. Since the value 2 is not referenced anywhere, just return 1
      indicating success in this case.
      
      Secondly, return 2 instead of 1 when successfully added a lock_list
      entry with saving stack_trace. With that, a caller can decide whether
      to avoid redundant save_trace() on the caller site.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Cc: willy@infradead.org
      Link: http://lkml.kernel.org/r/1502089981-21272-4-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      70911fdc
    • B
      locking/lockdep: Add a function building a chain between two classes · 49347a98
      Byungchul Park 提交于
      Crossrelease needs to build a chain between two classes regardless of
      their contexts. However, add_chain_cache() cannot be used for that
      purpose since it assumes that it's called in the acquisition context
      of the hlock. So this patch introduces a new function doing it.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Cc: willy@infradead.org
      Link: http://lkml.kernel.org/r/1502089981-21272-3-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      49347a98
    • B
      locking/lockdep: Refactor lookup_chain_cache() · 545c23f2
      Byungchul Park 提交于
      Currently, lookup_chain_cache() provides both 'lookup' and 'add'
      functionalities in a function. However, each is useful. So this
      patch makes lookup_chain_cache() only do 'lookup' functionality and
      makes add_chain_cahce() only do 'add' functionality. And it's more
      readable than before.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Cc: willy@infradead.org
      Link: http://lkml.kernel.org/r/1502089981-21272-2-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      545c23f2
    • P
      locking/lockdep: Avoid creating redundant links · ae813308
      Peter Zijlstra 提交于
      Two boots + a make defconfig, the first didn't have the redundant bit
      in, the second did:
      
       lock-classes:                         1168       1169 [max: 8191]
       direct dependencies:                  7688       5812 [max: 32768]
       indirect dependencies:               25492      25937
       all direct dependencies:            220113     217512
       dependency chains:                    9005       9008 [max: 65536]
       dependency chain hlocks:             34450      34366 [max: 327680]
       in-hardirq chains:                      55         51
       in-softirq chains:                     371        378
       in-process chains:                    8579       8579
       stack-trace entries:                108073      88474 [max: 524288]
       combined max dependencies:       178738560  169094640
      
       max locking depth:                      15         15
       max bfs queue depth:                   320        329
      
       cyclic checks:                        9123       9190
      
       redundant checks:                                5046
       redundant links:                                 1828
      
       find-mask forwards checks:            2564       2599
       find-mask backwards checks:          39521      39789
      
      So it saves nearly 2k links and a fair chunk of stack-trace entries, but
      as expected, makes no real difference on the indirect dependencies.
      
      At the same time, you see the max BFS depth increase, which is also
      expected, although it could easily be boot variance -- these numbers are
      not entirely stable between boots.
      
      The down side is that the cycles in the graph become larger and thus
      the reports harder to read.
      
      XXX: do we want this as a CONFIG variable, implied by LOCKDEP_SMALL?
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: iamjoonsoo.kim@lge.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Link: http://lkml.kernel.org/r/20170303091338.GH6536@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ae813308
    • P
      locking/lockdep: Rework FS_RECLAIM annotation · d92a8cfc
      Peter Zijlstra 提交于
      A while ago someone, and I cannot find the email just now, asked if we
      could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
      like we use for other things like workqueues etc. I think this should
      be possible which allows reducing the 'irq' states and will reduce the
      amount of __bfs() lookups we do.
      
      Removing the 1 IRQ state results in 4 less __bfs() walks per
      dependency, improving lockdep performance. And by moving this
      annotation out of the lockdep code it becomes easier for the mm people
      to extend.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: iamjoonsoo.kim@lge.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d92a8cfc
    • P
      locking: Remove smp_mb__before_spinlock() · a9668cd6
      Peter Zijlstra 提交于
      Now that there are no users of smp_mb__before_spinlock() left, remove
      it entirely.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a9668cd6
    • P
      locking: Introduce smp_mb__after_spinlock() · d89e588c
      Peter Zijlstra 提交于
      Since its inception, our understanding of ACQUIRE, esp. as applied to
      spinlocks, has changed somewhat. Also, I wonder if, with a simple
      change, we cannot make it provide more.
      
      The problem with the comment is that the STORE done by spin_lock isn't
      itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
      it and cross with any prior STORE, rendering the default WMB
      insufficient (pointed out by Alan).
      
      Now, this is only really a problem on PowerPC and ARM64, both of
      which already defined smp_mb__before_spinlock() as a smp_mb().
      
      At the same time, we can get a much stronger construct if we place
      that same barrier _inside_ the spin_lock(). In that case we upgrade
      the RCpc spinlock to an RCsc.  That would make all schedule() calls
      fully transitive against one another.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d89e588c
    • P
      overlayfs, locking: Remove smp_mb__before_spinlock() usage · ff7a5fb0
      Peter Zijlstra 提交于
      While we could replace the smp_mb__before_spinlock() with the new
      smp_mb__after_spinlock(), the normal pattern is to use
      smp_store_release() to publish an object that is used for
      lockless_dereference() -- and mirrors the regular rcu_assign_pointer()
      / rcu_dereference() patterns.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ff7a5fb0
    • P
      mm, locking: Rework {set,clear,mm}_tlb_flush_pending() · 8b1b436d
      Peter Zijlstra 提交于
      Commit:
      
        af2c1401 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")
      
      added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
      can solve the same problem without this barrier.
      
      If instead we mandate that mm_tlb_flush_pending() is used while
      holding the PTL we're guaranteed to observe prior
      set_tlb_flush_pending() instances.
      
      For this to work we need to rework migrate_misplaced_transhuge_page()
      a little and move the test up into do_huge_pmd_numa_page().
      
      NOTE: this relies on flush_tlb_range() to guarantee:
      
         (1) it ensures that prior page table updates are visible to the
             page table walker and
         (2) it ensures that subsequent memory accesses are only made
             visible after the invalidation has completed
      
      This is required for architectures that implement TRANSPARENT_HUGEPAGE
      (arc, arm, arm64, mips, powerpc, s390, sparc, x86) or otherwise use
      mm_tlb_flush_pending() in their page-table operations (arm, arm64,
      x86).
      
      This appears true for:
      
       - arm (DSB ISB before and after),
       - arm64 (DSB ISHST before, and DSB ISH after),
       - powerpc (PTESYNC before and after),
       - s390 and x86 TLB invalidate are serializing instructions
      
      But I failed to understand the situation for:
      
       - arc, mips, sparc
      
      Now SPARC64 is a wee bit special in that flush_tlb_range() is a no-op
      and it flushes the TLBs using arch_{enter,leave}_lazy_mmu_mode()
      inside the PTL. It still needs to guarantee the PTL unlock happens
      _after_ the invalidate completes.
      
      Vineet, Ralf and Dave could you guys please have a look?
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8b1b436d
    • P
      Documentation/locking/atomic: Add documents for new atomic_t APIs · 706eeb3e
      Peter Zijlstra 提交于
      Since we've vastly expanded the atomic_t interface in recent years the
      existing documentation is woefully out of date and people seem to get
      confused a bit.
      
      Start a new document to hopefully better explain the current state of
      affairs.
      
      The old atomic_ops.txt also covers bitmaps and a few more details so
      this is not a full replacement and we'll therefore keep that document
      around until such a time that we've managed to write more text to cover
      its entire.
      
      Also please, ReST people, go away.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      706eeb3e
    • M
      clocksource/arm_arch_timer: Use static_branch_enable_cpuslocked() · 450f9689
      Marc Zyngier 提交于
      Use the new static_branch_enable_cpuslocked() function to switch
      the workaround static key on the CPU hotplug path.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Leo Yan <leo.yan@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/20170801080257.5056-5-marc.zyngier@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      450f9689
    • M
      jump_label: Provide hotplug context variants · 5a40527f
      Marc Zyngier 提交于
      As using the normal static key API under the hotplug lock is
      pretty much impossible, let's provide a variant of some of them
      that require the hotplug lock to have already been taken.
      
      These function are only meant to be used in CPU hotplug callbacks.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Leo Yan <leo.yan@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/20170801080257.5056-4-marc.zyngier@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5a40527f
    • M
      jump_label: Split out code under the hotplug lock · 8b7b4128
      Marc Zyngier 提交于
      In order to later introduce an "already locked" version of some
      of the static key funcions, let's split the code into the core stuff
      (the *_cpuslocked functions) and the usual helpers, which now
      take/release the hotplug lock and call into the _cpuslocked
      versions.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Leo Yan <leo.yan@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/20170801080257.5056-3-marc.zyngier@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8b7b4128
    • M
      jump_label: Move CPU hotplug locking · b70cecf4
      Marc Zyngier 提交于
      As we're about to rework the locking, let's move the taking and
      release of the CPU hotplug lock to locations that will make its
      reworking completely obvious.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Leo Yan <leo.yan@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/20170801080257.5056-2-marc.zyngier@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b70cecf4
    • P
      jump_label: Add RELEASE barrier after text changes · d0646a6f
      Peter Zijlstra 提交于
      In the unlikely case text modification does not fully order things,
      add some extra ordering of our own to ensure we only enabled the fast
      path after all text is visible.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d0646a6f
    • P
      cpuset: Make nr_cpusets private · be040bea
      Paolo Bonzini 提交于
      Any use of key->enabled (that is static_key_enabled and static_key_count)
      outside jump_label_lock should handle its own serialization.  In the case
      of cpusets_enabled_key, the key is always incremented/decremented under
      cpuset_mutex, and hence the same rule applies to nr_cpusets.  The rule
      *is* respected currently, but the mutex is static so nr_cpusets should
      be static too.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1501601046-35683-4-git-send-email-pbonzini@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      be040bea
    • P
      jump_label: Do not use unserialized static_key_enabled() · 7a34bcb8
      Paolo Bonzini 提交于
      Any use of key->enabled (that is static_key_enabled and static_key_count)
      outside jump_label_lock should handle its own serialization.  The only
      two that are not doing so are the UDP encapsulation static keys.  Change
      them to use static_key_enable, which now correctly tests key->enabled under
      the jump label lock.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1501601046-35683-3-git-send-email-pbonzini@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7a34bcb8
    • P
      jump_label: Fix concurrent static_key_enable/disable() · 1dbb6704
      Paolo Bonzini 提交于
      static_key_enable/disable are trying to cap the static key count to
      0/1.  However, their use of key->enabled is outside jump_label_lock
      so they do not really ensure that.
      
      Rewrite them to do a quick check for an already enabled (respectively,
      already disabled), and then recheck under the jump label lock.  Unlike
      static_key_slow_inc/dec, a failed check under the jump label lock does
      not modify key->enabled.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1501601046-35683-2-git-send-email-pbonzini@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1dbb6704
    • K
      locking/rwsem-xadd: Add killable versions of rwsem_down_read_failed() · 83ced169
      Kirill Tkhai 提交于
      Rename rwsem_down_read_failed() in __rwsem_down_read_failed_common()
      and teach it to abort waiting in case of pending signals and killable
      state argument passed.
      
      Note, that we shouldn't wake anybody up in EINTR path, as:
      
      We check for (waiter.task) under spinlock before we go to out_nolock
      path. Current task wasn't able to be woken up, so there are
      a writer, owning the sem, or a writer, which is the first waiter.
      In the both cases we shouldn't wake anybody. If there is a writer,
      owning the sem, and we were the only waiter, remove RWSEM_WAITING_BIAS,
      as there are no waiters anymore.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arnd@arndb.de
      Cc: avagin@virtuozzo.com
      Cc: davem@davemloft.net
      Cc: fenghua.yu@intel.com
      Cc: gorcunov@virtuozzo.com
      Cc: heiko.carstens@de.ibm.com
      Cc: hpa@zytor.com
      Cc: ink@jurassic.park.msu.ru
      Cc: mattst88@gmail.com
      Cc: rth@twiddle.net
      Cc: schwidefsky@de.ibm.com
      Cc: tony.luck@intel.com
      Link: http://lkml.kernel.org/r/149789534632.9059.2901382369609922565.stgit@localhost.localdomainSigned-off-by: NIngo Molnar <mingo@kernel.org>
      83ced169
    • K
      locking/rwsem-spinlock: Add killable versions of __down_read() · 0aa1125f
      Kirill Tkhai 提交于
      Rename __down_read() in __down_read_common() and teach it
      to abort waiting in case of pending signals and killable
      state argument passed.
      
      Note, that we shouldn't wake anybody up in EINTR path, as:
      
      We check for signal_pending_state() after (!waiter.task)
      test and under spinlock. So, current task wasn't able to
      be woken up. It may be in two cases: a writer is owner
      of the sem, or a writer is a first waiter of the sem.
      
      If a writer is owner of the sem, no one else may work
      with it in parallel. It will wake somebody, when it
      call up_write() or downgrade_write().
      
      If a writer is the first waiter, it will be woken up,
      when the last active reader releases the sem, and
      sem->count became 0.
      
      Also note, that set_current_state() may be moved down
      to schedule() (after !waiter.task check), as all
      assignments in this type of semaphore (including wake_up),
      occur under spinlock, so we can't miss anything.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arnd@arndb.de
      Cc: avagin@virtuozzo.com
      Cc: davem@davemloft.net
      Cc: fenghua.yu@intel.com
      Cc: gorcunov@virtuozzo.com
      Cc: heiko.carstens@de.ibm.com
      Cc: hpa@zytor.com
      Cc: ink@jurassic.park.msu.ru
      Cc: mattst88@gmail.com
      Cc: rth@twiddle.net
      Cc: schwidefsky@de.ibm.com
      Cc: tony.luck@intel.com
      Link: http://lkml.kernel.org/r/149789533283.9059.9829416940494747182.stgit@localhost.localdomainSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0aa1125f
    • P
      locking/osq_lock: Fix osq_lock queue corruption · 50972fe7
      Prateek Sood 提交于
      Fix ordering of link creation between node->prev and prev->next in
      osq_lock(). A case in which the status of optimistic spin queue is
      CPU6->CPU2 in which CPU6 has acquired the lock.
      
              tail
                v
        ,-. <- ,-.
        |6|    |2|
        `-' -> `-'
      
      At this point if CPU0 comes in to acquire osq_lock, it will update the
      tail count.
      
        CPU2			CPU0
        ----------------------------------
      
      				       tail
      				         v
      			  ,-. <- ,-.    ,-.
      			  |6|    |2|    |0|
      			  `-' -> `-'    `-'
      
      After tail count update if CPU2 starts to unqueue itself from
      optimistic spin queue, it will find an updated tail count with CPU0 and
      update CPU2 node->next to NULL in osq_wait_next().
      
        unqueue-A
      
      	       tail
      	         v
        ,-. <- ,-.    ,-.
        |6|    |2|    |0|
        `-'    `-'    `-'
      
        unqueue-B
      
        ->tail != curr && !node->next
      
      If reordering of following stores happen then prev->next where prev
      being CPU2 would be updated to point to CPU0 node:
      
      				       tail
      				         v
      			  ,-. <- ,-.    ,-.
      			  |6|    |2|    |0|
      			  `-'    `-' -> `-'
      
        osq_wait_next()
          node->next <- 0
          xchg(node->next, NULL)
      
      	       tail
      	         v
        ,-. <- ,-.    ,-.
        |6|    |2|    |0|
        `-'    `-'    `-'
      
        unqueue-C
      
      At this point if next instruction
      	WRITE_ONCE(next->prev, prev);
      in CPU2 path is committed before the update of CPU0 node->prev = prev then
      CPU0 node->prev will point to CPU6 node.
      
      	       tail
          v----------. v
        ,-. <- ,-.    ,-.
        |6|    |2|    |0|
        `-'    `-'    `-'
           `----------^
      
      At this point if CPU0 path's node->prev = prev is committed resulting
      in change of CPU0 prev back to CPU2 node. CPU2 node->next is NULL
      currently,
      
      				       tail
      			                 v
      			  ,-. <- ,-. <- ,-.
      			  |6|    |2|    |0|
      			  `-'    `-'    `-'
      			     `----------^
      
      so if CPU0 gets into unqueue path of osq_lock it will keep spinning
      in infinite loop as condition prev->next == node will never be true.
      Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
      [ Added pictures, rewrote comments. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: sramana@codeaurora.org
      Link: http://lkml.kernel.org/r/1500040076-27626-1-git-send-email-prsood@codeaurora.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      50972fe7
    • P
      locking/atomic: Fix atomic_set_release() for 'funny' architectures · 9d664c0a
      Peter Zijlstra 提交于
      Those architectures that have a special atomic_set implementation also
      need a special atomic_set_release(), because for the very same reason
      WRITE_ONCE() is broken for them, smp_store_release() is too.
      
      The vast majority is architectures that have spinlock hash based atomic
      implementation except hexagon which seems to have a hardware 'feature'.
      
      The spinlock based atomics should be SC, that is, none of them appear to
      place extra barriers in atomic_cmpxchg() or any of the other SC atomic
      primitives and therefore seem to rely on their spinlock implementation
      being SC (I did not fully validate all that).
      
      Therefore, the normal atomic_set() is SC and can be used at
      atomic_set_release().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: davem@davemloft.net
      Cc: james.hogan@imgtec.com
      Cc: jejb@parisc-linux.org
      Cc: rkuo@codeaurora.org
      Cc: vgupta@synopsys.com
      Link: http://lkml.kernel.org/r/20170609110506.yod47flaav3wgoj5@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9d664c0a
    • B
      sched/wait: Remove the lockless swait_active() check in swake_up*() · 35a2897c
      Boqun Feng 提交于
      Steven Rostedt reported a potential race in RCU core because of
      swake_up():
      
              CPU0                            CPU1
              ----                            ----
                                      __call_rcu_core() {
      
                                       spin_lock(rnp_root)
                                       need_wake = __rcu_start_gp() {
                                        rcu_start_gp_advanced() {
                                         gp_flags = FLAG_INIT
                                        }
                                       }
      
       rcu_gp_kthread() {
         swait_event_interruptible(wq,
              gp_flags & FLAG_INIT) {
         spin_lock(q->lock)
      
                                      *fetch wq->task_list here! *
      
         list_add(wq->task_list, q->task_list)
         spin_unlock(q->lock);
      
         *fetch old value of gp_flags here *
      
                                       spin_unlock(rnp_root)
      
                                       rcu_gp_kthread_wake() {
                                        swake_up(wq) {
                                         swait_active(wq) {
                                          list_empty(wq->task_list)
      
                                         } * return false *
      
        if (condition) * false *
          schedule();
      
      In this case, a wakeup is missed, which could cause the rcu_gp_kthread
      waits for a long time.
      
      The reason of this is that we do a lockless swait_active() check in
      swake_up(). To fix this, we can either 1) add a smp_mb() in swake_up()
      before swait_active() to provide the proper order or 2) simply remove
      the swait_active() in swake_up().
      
      The solution 2 not only fixes this problem but also keeps the swait and
      wait API as close as possible, as wake_up() doesn't provide a full
      barrier and doesn't do a lockless check of the wait queue either.
      Moreover, there are users already using swait_active() to do their quick
      checks for the wait queues, so it make less sense that swake_up() and
      swake_up_all() do this on their own.
      
      This patch then removes the lockless swait_active() check in swake_up()
      and swake_up_all().
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NBoqun Feng <boqun.feng@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Krister Johansen <kjlx@templeofstupid.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170615041828.zk3a3sfyudm5p6nl@tardisSigned-off-by: NIngo Molnar <mingo@kernel.org>
      35a2897c
    • I
      388f8e12
    • L
      Merge tag 'pinctrl-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 8d31f80e
      Linus Torvalds 提交于
      Pull pin control fixes from Linus Walleij:
       "These are the pin control fixes I have gathered since the return from
        my vacation. They boiled in -next a while so let's get them in.
      
        Apart from the documentation build it is purely driver fixes. Which is
        nice. The Intel fixes seem kind of important.
      
         - Fix the documentation build as the docs were moved
      
         - Correct the UART pin list on the Intel Merrifield
      
         - Fix pin assignment and number of pins on the Marvell Armada 37xx
           pin controller
      
         - Cover the Setzer models in the Chromebook DMI quirk in the Intel
           cheryview driver so they start working
      
         - Add the missing "sim" function to the sunxi driver
      
         - Fix USB pin definitions on Uniphier Pro4
      
         - Smatch fix for invalid reference in the zx pin control driver"
      
      * tag 'pinctrl-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        pinctrl: generic: update references to Documentation/pinctrl.txt
        pinctrl: intel: merrifield: Correct UART pin lists
        pinctrl: armada-37xx: Fix number of pin in south bridge
        pinctrl: armada-37xx: Fix the pin 23 on south bridge
        pinctrl: cherryview: Add Setzer models to the Chromebook DMI quirk
        pinctrl: sunxi: add a missing function of A10/A20 pinctrl driver
        pinctrl: uniphier: fix USB3 pin assignment for Pro4
        pinctrl: zte: fix dereference of 'data' in zx_set_mux()
      8d31f80e
    • M
      futex: Remove unnecessary warning from get_futex_key · 48fb6f4d
      Mel Gorman 提交于
      Commit 65d8fc77 ("futex: Remove requirement for lock_page() in
      get_futex_key()") removed an unnecessary lock_page() with the
      side-effect that page->mapping needed to be treated very carefully.
      
      Two defensive warnings were added in case any assumption was missed and
      the first warning assumed a correct application would not alter a
      mapping backing a futex key.  Since merging, it has not triggered for
      any unexpected case but Mark Rutland reported the following bug
      triggering due to the first warning.
      
        kernel BUG at kernel/futex.c:679!
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted 4.13.0-rc3-00020-g307fec773ba3 #3
        Hardware name: linux,dummy-virt (DT)
        task: ffff80001e271780 task.stack: ffff000010908000
        PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
        LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
        pc : [<ffff00000821ac14>] lr : [<ffff00000821ac14>] pstate: 80000145
      
      The fact that it's a bug instead of a warning was due to an unrelated
      arm64 problem, but the warning itself triggered because the underlying
      mapping changed.
      
      This is an application issue but from a kernel perspective it's a
      recoverable situation and the warning is unnecessary so this patch
      removes the warning.  The warning may potentially be triggered with the
      following test program from Mark although it may be necessary to adjust
      NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the
      system.
      
          #include <linux/futex.h>
          #include <pthread.h>
          #include <stdio.h>
          #include <stdlib.h>
          #include <sys/mman.h>
          #include <sys/syscall.h>
          #include <sys/time.h>
          #include <unistd.h>
      
          #define NR_FUTEX_THREADS 16
          pthread_t threads[NR_FUTEX_THREADS];
      
          void *mem;
      
          #define MEM_PROT  (PROT_READ | PROT_WRITE)
          #define MEM_SIZE  65536
      
          static int futex_wrapper(int *uaddr, int op, int val,
                                   const struct timespec *timeout,
                                   int *uaddr2, int val3)
          {
              syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3);
          }
      
          void *poll_futex(void *unused)
          {
              for (;;) {
                  futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1);
              }
          }
      
          int main(int argc, char *argv[])
          {
              int i;
      
              mem = mmap(NULL, MEM_SIZE, MEM_PROT,
                     MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      
              printf("Mapping @ %p\n", mem);
      
              printf("Creating futex threads...\n");
      
              for (i = 0; i < NR_FUTEX_THREADS; i++)
                  pthread_create(&threads[i], NULL, poll_futex, NULL);
      
              printf("Flipping mapping...\n");
              for (;;) {
                  mmap(mem, MEM_SIZE, MEM_PROT,
                       MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
              }
      
              return 0;
          }
      Reported-and-tested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org # 4.7+
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48fb6f4d
    • L
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 358f8c26
      Linus Torvalds 提交于
      Pull i2c fixes from Wolfram Sang:
       "The main thing is to allow empty id_tables for ACPI to make some
        drivers get probed again. It looks a bit bigger than usual because it
        needs some internal renaming, too.
      
        Other than that, there is a fix for broken DSTDs, a super simple
        enablement for ARM MPS, and two documentation fixes which I'd like to
        see in v4.13 already"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: rephrase explanation of I2C_CLASS_DEPRECATED
        i2c: allow i2c-versatile for ARM MPS platforms
        i2c: designware: Some broken DSTDs use 1MiHz instead of 1MHz
        i2c: designware: Print clock freq on invalid clock freq error
        i2c: core: Allow empty id_table in ACPI case as well
        i2c: mux: pinctrl: mention correct module name in Kconfig help text
      358f8c26
    • L
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 31cf92f3
      Linus Torvalds 提交于
      Pull block fixes from Jens Axboe:
       "Three patches that should go into this release.
      
        Two of them are from Paolo and fix up some corner cases with BFQ, and
        the last patch is from Ming and fixes up a potential usage count
        imbalance regression due to the recent NOWAIT work"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed
        block, bfq: consider also in_service_entity to state whether an entity is active
        block, bfq: reset in_service_entity if it becomes idle
      31cf92f3
    • L
      Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · d555eb6b
      Linus Torvalds 提交于
      Pull crypto fixes from Herbert Xu:
       "Fix two regressions in the inside-secure driver with respect to
        hmac(sha1)"
      
      * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: inside-secure - fix the sha state length in hmac_sha1_setkey
        crypto: inside-secure - fix invalidation check in hmac_sha1_setkey
      d555eb6b
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 4530cca1
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
       "The pull requests are getting smaller, that's progress I suppose :-)
      
         1) Fix infinite loop in CIPSO option parsing, from Yujuan Qi.
      
         2) Fix remote checksum handling in VXLAN and GUE tunneling drivers,
            from Koichiro Den.
      
         3) Missing u64_stats_init() calls in several drivers, from Florian
            Fainelli.
      
         4) TCP can set the congestion window to an invalid ssthresh value
            after congestion window reductions, from Yuchung Cheng.
      
         5) Fix BPF jit branch generation on s390, from Daniel Borkmann.
      
         6) Correct MIPS ebpf JIT merge, from David Daney.
      
         7) Correct byte order test in BPF test_verifier.c, from Daniel
            Borkmann.
      
         8) Fix various crashes and leaks in ASIX driver, from Dean Jenkins.
      
         9) Handle SCTP checksums properly in mlx4 driver, from Davide
            Caratti.
      
        10) We can potentially enter tcp_connect() with a cached route
            already, due to fastopen, so we have to explicitly invalidate it.
      
        11) skb_warn_bad_offload() can bark in legitimate situations, fix from
            Willem de Bruijn"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
        net: avoid skb_warn_bad_offload false positives on UFO
        qmi_wwan: fix NULL deref on disconnect
        ppp: fix xmit recursion detection on ppp channels
        rds: Reintroduce statistics counting
        tcp: fastopen: tcp_connect() must refresh the route
        net: sched: set xt_tgchk_param par.net properly in ipt_init_target
        net: dsa: mediatek: add adjust link support for user ports
        net/mlx4_en: don't set CHECKSUM_COMPLETE on SCTP packets
        qed: Fix a memory allocation failure test in 'qed_mcp_cmd_init()'
        hysdn: fix to a race condition in put_log_buffer
        s390/qeth: fix L3 next-hop in xmit qeth hdr
        asix: Fix small memory leak in ax88772_unbind()
        asix: Ensure asix_rx_fixup_info members are all reset
        asix: Add rx->ax_skb = NULL after usbnet_skb_return()
        bpf: fix selftest/bpf/test_pkt_md_access on s390x
        netvsc: fix race on sub channel creation
        bpf: fix byte order test in test_verifier
        xgene: Always get clk source, but ignore if it's missing for SGMII ports
        MIPS: Add missing file for eBPF JIT.
        bpf, s390: fix build for libbpf and selftest suite
        ...
      4530cca1
  2. 09 8月, 2017 6 次提交
    • W
      net: avoid skb_warn_bad_offload false positives on UFO · 8d63bee6
      Willem de Bruijn 提交于
      skb_warn_bad_offload triggers a warning when an skb enters the GSO
      stack at __skb_gso_segment that does not have CHECKSUM_PARTIAL
      checksum offload set.
      
      Commit b2504a5d ("net: reduce skb_warn_bad_offload() noise")
      observed that SKB_GSO_DODGY producers can trigger the check and
      that passing those packets through the GSO handlers will fix it
      up. But, the software UFO handler will set ip_summed to
      CHECKSUM_NONE.
      
      When __skb_gso_segment is called from the receive path, this
      triggers the warning again.
      
      Make UFO set CHECKSUM_UNNECESSARY instead of CHECKSUM_NONE. On
      Tx these two are equivalent. On Rx, this better matches the
      skb state (checksum computed), as CHECKSUM_NONE here means no
      checksum computed.
      
      See also this thread for context:
      http://patchwork.ozlabs.org/patch/799015/
      
      Fixes: b2504a5d ("net: reduce skb_warn_bad_offload() noise")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d63bee6
    • B
      qmi_wwan: fix NULL deref on disconnect · bbae08e5
      Bjørn Mork 提交于
      qmi_wwan_disconnect is called twice when disconnecting devices with
      separate control and data interfaces.  The first invocation will set
      the interface data to NULL for both interfaces to flag that the
      disconnect has been handled.  But the matching NULL check was left
      out when qmi_wwan_disconnect was added, resulting in this oops:
      
        usb 2-1.4: USB disconnect, device number 4
        qmi_wwan 2-1.4:1.6 wwp0s29u1u4i6: unregister 'qmi_wwan' usb-0000:00:1d.0-1.4, WWAN/QMI device
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000e0
        IP: qmi_wwan_disconnect+0x25/0xc0 [qmi_wwan]
        PGD 0
        P4D 0
        Oops: 0000 [#1] SMP
        Modules linked in: <stripped irrelevant module list>
        CPU: 2 PID: 33 Comm: kworker/2:1 Tainted: G            E   4.12.3-nr44-normandy-r1500619820+ #1
        Hardware name: LENOVO 4291LR7/4291LR7, BIOS CBET4000 4.6-810-g50522254fb 07/21/2017
        Workqueue: usb_hub_wq hub_event [usbcore]
        task: ffff8c882b716040 task.stack: ffffb8e800d84000
        RIP: 0010:qmi_wwan_disconnect+0x25/0xc0 [qmi_wwan]
        RSP: 0018:ffffb8e800d87b38 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffff8c8824f3f1d0 RDI: ffff8c8824ef6400
        RBP: ffff8c8824ef6400 R08: 0000000000000000 R09: 0000000000000000
        R10: ffffb8e800d87780 R11: 0000000000000011 R12: ffffffffc07ea0e8
        R13: ffff8c8824e2e000 R14: ffff8c8824e2e098 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff8c8835300000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000e0 CR3: 0000000229ca5000 CR4: 00000000000406e0
        Call Trace:
         ? usb_unbind_interface+0x71/0x270 [usbcore]
         ? device_release_driver_internal+0x154/0x210
         ? qmi_wwan_unbind+0x6d/0xc0 [qmi_wwan]
         ? usbnet_disconnect+0x6c/0xf0 [usbnet]
         ? qmi_wwan_disconnect+0x87/0xc0 [qmi_wwan]
         ? usb_unbind_interface+0x71/0x270 [usbcore]
         ? device_release_driver_internal+0x154/0x210
      Reported-and-tested-by: NNathaniel Roach <nroach44@gmail.com>
      Fixes: c6adf779 ("net: usb: qmi_wwan: add qmap mux protocol support")
      Cc: Daniele Palmas <dnlplm@gmail.com>
      Signed-off-by: NBjørn Mork <bjorn@mork.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bbae08e5
    • G
      ppp: fix xmit recursion detection on ppp channels · 0a0e1a85
      Guillaume Nault 提交于
      Commit e5dadc65 ("ppp: Fix false xmit recursion detect with two ppp
      devices") dropped the xmit_recursion counter incrementation in
      ppp_channel_push() and relied on ppp_xmit_process() for this task.
      But __ppp_channel_push() can also send packets directly (using the
      .start_xmit() channel callback), in which case the xmit_recursion
      counter isn't incremented anymore. If such packets get routed back to
      the parent ppp unit, ppp_xmit_process() won't notice the recursion and
      will call ppp_channel_push() on the same channel, effectively creating
      the deadlock situation that the xmit_recursion mechanism was supposed
      to prevent.
      
      This patch re-introduces the xmit_recursion counter incrementation in
      ppp_channel_push(). Since the xmit_recursion variable is now part of
      the parent ppp unit, incrementation is skipped if the channel doesn't
      have any. This is fine because only packets routed through the parent
      unit may enter the channel recursively.
      
      Finally, we have to ensure that pch->ppp is not going to be modified
      while executing ppp_channel_push(). Instead of taking this lock only
      while calling ppp_xmit_process(), we now have to hold it for the full
      ppp_channel_push() execution. This respects the ppp locks ordering
      which requires locking ->upl before ->downl.
      
      Fixes: e5dadc65 ("ppp: Fix false xmit recursion detect with two ppp devices")
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a0e1a85
    • H
      rds: Reintroduce statistics counting · 05bfd7db
      Håkon Bugge 提交于
      In commit 7e3f2952 ("rds: don't let RDS shutdown a connection
      while senders are present"), refilling the receive queue was removed
      from rds_ib_recv(), along with the increment of
      s_ib_rx_refill_from_thread.
      
      Commit 73ce4317 ("RDS: make sure we post recv buffers")
      re-introduces filling the receive queue from rds_ib_recv(), but does
      not add the statistics counter. rds_ib_recv() was later renamed to
      rds_ib_recv_path().
      
      This commit reintroduces the statistics counting of
      s_ib_rx_refill_from_thread and s_ib_rx_refill_from_cq.
      Signed-off-by: NHåkon Bugge <haakon.bugge@oracle.com>
      Reviewed-by: NKnut Omang <knut.omang@oracle.com>
      Reviewed-by: NWei Lin Guay <wei.lin.guay@oracle.com>
      Reviewed-by: NShamir Rabinovitch <shamir.rabinovitch@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05bfd7db
    • E
      tcp: fastopen: tcp_connect() must refresh the route · 8ba60924
      Eric Dumazet 提交于
      With new TCP_FASTOPEN_CONNECT socket option, there is a possibility
      to call tcp_connect() while socket sk_dst_cache is either NULL
      or invalid.
      
       +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
       +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
       +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
       +0 connect(4, ..., ...) = 0
      
      << sk->sk_dst_cache becomes obsolete, or even set to NULL >>
      
       +1 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
      
      We need to refresh the route otherwise bad things can happen,
      especially when syzkaller is running on the host :/
      
      Fixes: 19f6d3f3 ("net/tcp-fastopen: Add new API support")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NWei Wang <weiwan@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ba60924
    • X
      net: sched: set xt_tgchk_param par.net properly in ipt_init_target · ec0acb09
      Xin Long 提交于
      Now xt_tgchk_param par in ipt_init_target is a local varibale,
      par.net is not initialized there. Later when xt_check_target
      calls target's checkentry in which it may access par.net, it
      would cause kernel panic.
      
      Jaroslav found this panic when running:
      
        # ip link add TestIface type dummy
        # tc qd add dev TestIface ingress handle ffff:
        # tc filter add dev TestIface parent ffff: u32 match u32 0 0 \
          action xt -j CONNMARK --set-mark 4
      
      This patch is to pass net param into ipt_init_target and set
      par.net with it properly in there.
      
      v1->v2:
        As Wang Cong pointed, I missed ipt_net_id != xt_net_id, so fix
        it by also passing net_id to __tcf_ipt_init.
      v2->v3:
        Missed the fixes tag, so add it.
      
      Fixes: ecb2421b ("netfilter: add and use nf_ct_netns_get/put")
      Reported-by: NJaroslav Aster <jaster@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec0acb09