1. 08 6月, 2017 23 次提交
    • P
      srcu: Add DEBUG_OBJECTS_RCU_HEAD functionality · a602538e
      Paul E. McKenney 提交于
      This commit adds DEBUG_OBJECTS_RCU_HEAD checking to detect call_srcu()
      counterparts to double-free bugs.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a602538e
    • P
      srcu: Shrink Tiny SRCU a bit · d4efe6c5
      Paul E. McKenney 提交于
      In Tiny SRCU, __srcu_read_lock() is a trivial function, outweighed by
      its EXPORT_SYMBOL_GPL(), and on many architectures, its call sequence.
      This commit therefore moves it to srcutiny.h so that it can be inlined.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      d4efe6c5
    • P
      rcu: Add lockdep_assert_held() teeth to tree_plugin.h · ea9b0c8a
      Paul E. McKenney 提交于
      Comments can be helpful, but assertions carry more force.  This commit
      therefore adds lockdep_assert_held() and RCU_LOCKDEP_WARN() calls to
      enforce lock-held and interrupt-disabled preconditions.
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      ea9b0c8a
    • P
      rcu: Add lockdep_assert_held() teeth to tree.c · c0b334c5
      Paul E. McKenney 提交于
      Comments can be helpful, but assertions carry more force.  This
      commit therefore adds lockdep_assert_held() and RCU_LOCKDEP_WARN()
      calls to enforce lock-held and interrupt-disabled preconditions.
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      c0b334c5
    • P
      srcu: Print non-default exp_holdoff values at boot time · 0c8e0e3c
      Paul E. McKenney 提交于
      This commit makes srcu_bootup_announce() check for non-default values
      of the auto-expedite holdoff time exp_holdoff and print a message if so.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0c8e0e3c
    • P
      srcu: Make exp_holdoff module parameter be static · b5815e6c
      Paul E. McKenney 提交于
      Because exp_holdoff is not used outside of srcutree.c, it can be static.
      This commit therefore makes this change.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b5815e6c
    • P
      rcu: Update rcu_bootup_announce_oddness() · 17c7798b
      Paul E. McKenney 提交于
      This commit updates rcu_bootup_announce_oddness() to check additional
      Kconfig options and module/boot parameters.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      17c7798b
    • P
      rcu: Print out rcupdate.c non-default boot-time settings · 59d80fd8
      Paul E. McKenney 提交于
      This commit adds a rcupdate_announce_bootup_oddness() function to
      print out non-default values of significant kernel boot parameter
      settings to aid in debugging.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      59d80fd8
    • P
      rcu: Add preemptibility checks in rcu_sched_qs() and rcu_bh_qs() · f4687d26
      Paul E. McKenney 提交于
      This commit adds WARN_ON_ONCE() calls that trigger if either
      rcu_sched_qs() or rcu_bh_qs() are invoked with preemption enabled.
      In the immortal words of Peter Zijlstra: "these are much harder to ignore
      than comments".
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      f4687d26
    • P
      rcuperf: Add writer_holdoff boot parameter · 820687a7
      Paul E. McKenney 提交于
      This commit adds a writer_holdoff boot parameter to rcuperf, which is
      intended to be used to test Tree SRCU's auto-expediting.  This
      boot parameter is in microseconds, and defaults to zero (that is,
      disabled).  Set it to a bit larger than srcutree.exp_holdoff,
      keeping the nanosecond/microsecond conversion, to force Tree SRCU
      to auto-expedite more aggressively.
      
      This commit also adds documentation for this parameter, and fixes some
      alphabetization while in the neighborhood.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      820687a7
    • P
      rcuperf: Set more user-friendly defaults · 492b95e5
      Paul E. McKenney 提交于
      Common-case use of rcuperf must set rcuperf.nreaders=0 and if not built
      as a module, rcuperf.shutdown.  This commit therefore sets the default
      for rcuperf.nreaders to zero and sets the default for rcuperf.shutdown
      to zero if rcuperf is built as a module and to one otherwise.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      492b95e5
    • P
      srcu: Shrink Tiny SRCU a bit more · 3ddf20c9
      Paul E. McKenney 提交于
      This commit rearranges Tiny SRCU's srcu_struct structure, substitutes
      u8 for bool, and shrinks counters down to short.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3ddf20c9
    • P
      srcu: Make Classic and Tree SRCU announce themselves at bootup · 1f4f6da1
      Paul E. McKenney 提交于
      Currently, the only way to tell whether a given kernel is running
      Classic, Tiny, or Tree SRCU is to look at the .config file, which
      can easily be lost or associated with the wrong kernel.  This commit
      therefore has Classic and Tree SRCU identify themselves at boot time.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      1f4f6da1
    • P
      rcuperf: Add test for dynamically initialized srcu_struct · f60cb4d4
      Paul E. McKenney 提交于
      This commit adds a perf_type of "srcud", which species that rcuperf
      test SRCU on a dynamically initialized srcu_struct.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      f60cb4d4
    • P
      rcu: Make sync_rcu_preempt_exp_done() return bool · dcfc315b
      Paul E. McKenney 提交于
      The sync_rcu_preempt_exp_done() function returns a logical expression,
      but its return type is nevertheless int.  This commit therefore changes
      the return type to bool.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      dcfc315b
    • P
      rcuperf: Add ability to performance-test call_rcu() and friends · 881ed593
      Paul E. McKenney 提交于
      This commit upgrades rcuperf so that it can do performance testing on
      asynchronous grace-period primitives such as call_srcu().  There is
      a new rcuperf.gp_async module parameter that specifies this new behavior,
      with the pre-existing rcuperf.gp_exp testing expedited grace periods such as
      synchronize_rcu_expedited, and with the default being to test synchronous
      non-expedited grace periods such as synchronize_rcu().
      
      There is also a new rcuperf.gp_async_max module parameter that specifies
      the maximum number of outstanding callbacks per writer kthread, defaulting
      to 1,000.  When this limit is exceeded, the writer thread invokes the
      appropriate flavor of rcu_barrier() to wait for callbacks to drain.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: Removed the redundant initialization noted by Arnd Bergmann. ]
      881ed593
    • P
      rcu: Remove obsolete reference to synchronize_kernel() · e28371c8
      Paul E. McKenney 提交于
      The synchronize_kernel() primitive was removed in favor of
      synchronize_sched() more than a decade ago, and it seems likely that
      rather few kernel hackers are familiar with it.  Its continued presence
      is therefore providing more confusion than enlightenment.  This commit
      therefore removes the reference from the synchronize_sched() header
      comment, and adds the corresponding information to the synchronize_rcu(0
      header comment.
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      e28371c8
    • P
      rcuperf: Defer expedited/normal check to end of test · 9683937d
      Paul E. McKenney 提交于
      Current rcuperf startup checks to see if the user asked to measure
      only expedited grace periods, yet constrained all grace periods to be
      normal, or if the user asked to measure only normal grace periods, yet
      constrained all grace periods to be expedited.  Useless tests of this
      sort are aborted.
      
      Unfortunately, making RCU work through the mid-boot dead zone [1] puts
      RCU into expedited-only mode during that zone.  Which happens to also
      be the exact time that rcuperf carries out the aforementioned check.
      So if the user asks rcuperf to measure only normal grace periods (the
      default), rcuperf will now always complain and terminate the test.
      
      This commit therefore moves the checks to rcu_perf_cleanup().  This has
      the disadvantage of failing to abort useless tests, but avoids the need to
      create yet another kthread and the need to do fiddly checks involving the
      holdoff time.  (Yes, another approach is to do the checks in a late-stage
      init function, but that would require some way to communicate badness
      to rcuperf's kthreads, and seems not worth the bother.)
      
      [1] https://lwn.net/Articles/716148/Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      9683937d
    • P
      rcu: Complain if blocking in preemptible RCU read-side critical section · 5b72f964
      Paul E. McKenney 提交于
      Although preemptible RCU allows its read-side critical sections to be
      preempted, general blocking is forbidden.  The reason for this is that
      excessive preemption times can be handled by CONFIG_RCU_BOOST=y, but a
      voluntarily blocked task doesn't care how high you boost its priority.
      Because preemptible RCU is a global mechanism, one ill-behaved reader
      hurts everyone.  Hence the prohibition against general blocking in
      RCU-preempt read-side critical sections.  Preemption yes, blocking no.
      
      This commit enforces this prohibition.
      
      There is a special exception for the -rt patchset (which they kindly
      volunteered to implement):  It is OK to block (as opposed to merely being
      preempted) within an RCU-preempt read-side critical section, but only if
      the blocking is subject to priority inheritance.  This exception permits
      CONFIG_RCU_BOOST=y to get -rt RCU readers out of trouble.
      
      Why doesn't this exception also apply to mainline's rt_mutex?  Because
      of the possibility that someone does general blocking while holding
      an rt_mutex.  Yes, the priority boosting will affect the rt_mutex,
      but it won't help with the task doing general blocking while holding
      that rt_mutex.
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      5b72f964
    • P
      srcu: Eliminate possibility of destructive counter overflow · 881ec9d2
      Paul E. McKenney 提交于
      Earlier versions of Tree SRCU were subject to a counter overflow bug that
      could theoretically result in too-short grace periods.  This commit
      eliminates this problem by adding an update-side memory barrier.
      The short explanation is that if the updater sums the unlock counts
      too late to see a given __srcu_read_unlock() increment, that CPU's
      next __srcu_read_lock() must see the new value of ->srcu_idx, thus
      incrementing the other bank of counters.  This eliminates the possibility
      of destructive counter overflow as long as the srcu_read_lock() nesting
      level does not exceed floor(ULONG_MAX/NR_CPUS/2), which should be an
      eminently reasonable nesting limit, especially on 64-bit systems.
      Reported-by: NLance Roy <ldr709@gmail.com>
      Suggested-by: NLance Roy <ldr709@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      881ec9d2
    • P
      rcu: Prevent rcu_barrier() from starting needless grace periods · f92c734f
      Paul E. McKenney 提交于
      Currently rcu_barrier() uses call_rcu() to enqueue new callbacks
      on each CPU with a non-empty callback list.  This works, but means
      that rcu_barrier() forces grace periods that are not otherwise needed.
      The key point is that rcu_barrier() never needs to wait for a grace
      period, but instead only for all pre-existing callbacks to be invoked.
      This means that rcu_barrier()'s new callbacks should be placed in
      the callback-list segment containing the last pre-existing callback.
      
      This commit makes this change using the new rcu_segcblist_entrain()
      function.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      f92c734f
    • P
      srcu: Allow use of Classic SRCU from both process and interrupt context · 1123a604
      Paolo Bonzini 提交于
      Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
      down a guest running iperf on a VFIO assigned device.  This happens
      because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
      context, while a worker thread does the same inside kvm_set_irq().  If the
      interrupt happens while the worker thread is executing __srcu_read_lock(),
      updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
      ->srcu_lock_count[] field can be lost.
      
      The docs say you are not supposed to call srcu_read_lock() and
      srcu_read_unlock() from irq context, but KVM interrupt injection happens
      from (host) interrupt context and it would be nice if SRCU supported the
      use case.  KVM is using SRCU here not really for the "sleepable" part,
      but rather due to its IPI-free fast detection of grace periods.  It is
      therefore not desirable to switch back to RCU, which would effectively
      revert commit 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
      2014-01-16).
      
      However, the docs are overly conservative.  You can have an SRCU instance
      only has users in irq context, and you can mix process and irq context
      as long as process context users disable interrupts.  In addition,
      __srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
      Classic SRCU.  For those two implementations, only srcu_read_lock()
      is unsafe.
      
      When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
      in commit 5a41344a ("srcu: Simplify __srcu_read_unlock() via
      this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
      Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
      the caller.  Tree SRCU however only does one increment, so on most
      architectures it is more efficient for __srcu_read_lock() to use
      this_cpu_inc(), and any performance differences appear to be down in
      the noise.
      
      Cc: stable@vger.kernel.org
      Fixes: 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
      Reported-by: NLinu Cherian <linuc.decode@gmail.com>
      Suggested-by: NLinu Cherian <linuc.decode@gmail.com>
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      1123a604
    • P
      srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context · cdf7abc4
      Paolo Bonzini 提交于
      Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
      down a guest running iperf on a VFIO assigned device.  This happens
      because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
      context, while a worker thread does the same inside kvm_set_irq().  If the
      interrupt happens while the worker thread is executing __srcu_read_lock(),
      updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
      ->srcu_lock_count[] field can be lost.
      
      The docs say you are not supposed to call srcu_read_lock() and
      srcu_read_unlock() from irq context, but KVM interrupt injection happens
      from (host) interrupt context and it would be nice if SRCU supported the
      use case.  KVM is using SRCU here not really for the "sleepable" part,
      but rather due to its IPI-free fast detection of grace periods.  It is
      therefore not desirable to switch back to RCU, which would effectively
      revert commit 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
      2014-01-16).
      
      However, the docs are overly conservative.  You can have an SRCU instance
      only has users in irq context, and you can mix process and irq context
      as long as process context users disable interrupts.  In addition,
      __srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
      Classic SRCU.  For those two implementations, only srcu_read_lock()
      is unsafe.
      
      When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
      in commit 5a41344a ("srcu: Simplify __srcu_read_unlock() via
      this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
      Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
      the caller.  Tree SRCU however only does one increment, so on most
      architectures it is more efficient for __srcu_read_lock() to use
      this_cpu_inc(), and any performance differences appear to be down in
      the noise.
      
      Unlike Classic and Tree SRCU, Tiny SRCU does increments and decrements on
      a single variable.  Therefore, as Peter Zijlstra pointed out, Tiny SRCU's
      implementation already supports mixed-context use of srcu_read_lock()
      and srcu_read_unlock(), at least as long as uses of srcu_read_lock()
      and srcu_read_unlock() in each handler are nested and paired properly.
      In other words, it is still illegal to (say) invoke srcu_read_lock()
      in an interrupt handler and to invoke the matching srcu_read_unlock()
      in a softirq handler.  Therefore, the only change required for Tiny SRCU
      is to its comments.
      
      Fixes: 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
      Reported-by: NLinu Cherian <linuc.decode@gmail.com>
      Suggested-by: NLinu Cherian <linuc.decode@gmail.com>
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NPaolo Bonzini <pbonzini@redhat.com>
      cdf7abc4
  2. 27 5月, 2017 3 次提交
  3. 26 5月, 2017 3 次提交
    • D
      bpf: fix wrong exposure of map_flags into fdinfo for lpm · a316338c
      Daniel Borkmann 提交于
      trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
      attr->map_flags, since it does not support preallocation yet. We
      check the flag, but we never copy the flag into trie->map.map_flags,
      which is later on exposed into fdinfo and used by loaders such as
      iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
      whether a pinned map has the same spec as the one from the BPF obj
      file and if not, bails out, which is currently the case for lpm
      since it exposes always 0 as flags.
      
      Also copy over flags in array_map_alloc() and stack_map_alloc().
      They always have to be 0 right now, but we should make sure to not
      miss to copy them over at a later point in time when we add actual
      flags for them to use.
      
      Fixes: b95a5c4d ("bpf: add a longest prefix match trie map implementation")
      Reported-by: NJarno Rajahalme <jarno@covalent.io>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a316338c
    • D
      bpf: properly reset caller saved regs after helper call and ld_abs/ind · a9789ef9
      Daniel Borkmann 提交于
      Currently, after performing helper calls, we clear all caller saved
      registers, that is r0 - r5 and fill r0 depending on struct bpf_func_proto
      specification. The way we reset these regs can affect pruning decisions
      in later paths, since we only reset register's imm to 0 and type to
      NOT_INIT. However, we leave out clearing of other variables such as id,
      min_value, max_value, etc, which can later on lead to pruning mismatches
      due to stale data.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9789ef9
    • D
      bpf: fix incorrect pruning decision when alignment must be tracked · 1ad2f583
      Daniel Borkmann 提交于
      Currently, when we enforce alignment tracking on direct packet access,
      the verifier lets the following program pass despite doing a packet
      write with unaligned access:
      
        0: (61) r2 = *(u32 *)(r1 +76)
        1: (61) r3 = *(u32 *)(r1 +80)
        2: (61) r7 = *(u32 *)(r1 +8)
        3: (bf) r0 = r2
        4: (07) r0 += 14
        5: (25) if r7 > 0x1 goto pc+4
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
        6: (2d) if r0 > r3 goto pc+1
         R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14)
         R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
        7: (63) *(u32 *)(r0 -4) = r0
        8: (b7) r0 = 0
        9: (95) exit
      
        from 6 to 8:
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
        8: (b7) r0 = 0
        9: (95) exit
      
        from 5 to 10:
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=2 R10=fp
        10: (07) r0 += 1
        11: (05) goto pc-6
        6: safe                           <----- here, wrongly found safe
        processed 15 insns
      
      However, if we enforce a pruning mismatch by adding state into r8
      which is then being mismatched in states_equal(), we find that for
      the otherwise same program, the verifier detects a misaligned packet
      access when actually walking that path:
      
        0: (61) r2 = *(u32 *)(r1 +76)
        1: (61) r3 = *(u32 *)(r1 +80)
        2: (61) r7 = *(u32 *)(r1 +8)
        3: (b7) r8 = 1
        4: (bf) r0 = r2
        5: (07) r0 += 14
        6: (25) if r7 > 0x1 goto pc+4
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=0,max_value=1
         R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
        7: (2d) if r0 > r3 goto pc+1
         R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14)
         R3=pkt_end R7=inv,min_value=0,max_value=1
         R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
        8: (63) *(u32 *)(r0 -4) = r0
        9: (b7) r0 = 0
        10: (95) exit
      
        from 7 to 9:
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=0,max_value=1
         R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
        9: (b7) r0 = 0
        10: (95) exit
      
        from 6 to 11:
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=2
         R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
        11: (07) r0 += 1
        12: (b7) r8 = 0
        13: (05) goto pc-7                <----- mismatch due to r8
        7: (2d) if r0 > r3 goto pc+1
         R0=pkt(id=0,off=15,r=15) R1=ctx R2=pkt(id=0,off=0,r=15)
         R3=pkt_end R7=inv,min_value=2
         R8=imm0,min_value=0,max_value=0,min_align=2147483648 R10=fp
        8: (63) *(u32 *)(r0 -4) = r0
        misaligned packet access off 2+15+-4 size 4
      
      The reason why we fail to see it in states_equal() is that the
      third test in compare_ptrs_to_packet() ...
      
        if (old->off <= cur->off &&
            old->off >= old->range && cur->off >= cur->range)
                return true;
      
      ... will let the above pass. The situation we run into is that
      old->off <= cur->off (14 <= 15), meaning that prior walked paths
      went with smaller offset, which was later used in the packet
      access after successful packet range check and found to be safe
      already.
      
      For example: Given is R0=pkt(id=0,off=0,r=0). Adding offset 14
      as in above program to it, results in R0=pkt(id=0,off=14,r=0)
      before the packet range test. Now, testing this against R3=pkt_end
      with 'if r0 > r3 goto out' will transform R0 into R0=pkt(id=0,off=14,r=14)
      for the case when we're within bounds. A write into the packet
      at offset *(u32 *)(r0 -4), that is, 2 + 14 -4, is valid and
      aligned (2 is for NET_IP_ALIGN). After processing this with
      all fall-through paths, we later on check paths from branches.
      When the above skb->mark test is true, then we jump near the
      end of the program, perform r0 += 1, and jump back to the
      'if r0 > r3 goto out' test we've visited earlier already. This
      time, R0 is of type R0=pkt(id=0,off=15,r=0), and we'll prune
      that part because this time we'll have a larger safe packet
      range, and we already found that with off=14 all further insn
      were already safe, so it's safe as well with a larger off.
      However, the problem is that the subsequent write into the packet
      with 2 + 15 -4 is then unaligned, and not caught by the alignment
      tracking. Note that min_align, aux_off, and aux_off_align were
      all 0 in this example.
      
      Since we cannot tell at this time what kind of packet access was
      performed in the prior walk and what minimal requirements it has
      (we might do so in the future, but that requires more complexity),
      fix it to disable this pruning case for strict alignment for now,
      and let the verifier do check such paths instead. With that applied,
      the test cases pass and reject the program due to misalignment.
      
      Fixes: d1174416 ("bpf: Track alignment of register values in the verifier.")
      Reference: http://patchwork.ozlabs.org/patch/761909/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ad2f583
  4. 24 5月, 2017 1 次提交
  5. 23 5月, 2017 4 次提交
    • E
      ptrace: Properly initialize ptracer_cred on fork · c70d9d80
      Eric W. Biederman 提交于
      When I introduced ptracer_cred I failed to consider the weirdness of
      fork where the task_struct copies the old value by default.  This
      winds up leaving ptracer_cred set even when a process forks and
      the child process does not wind up being ptraced.
      
      Because ptracer_cred is not set on non-ptraced processes whose
      parents were ptraced this has broken the ability of the enlightenment
      window manager to start setuid children.
      
      Fix this by properly initializing ptracer_cred in ptrace_init_task
      
      This must be done with a little bit of care to preserve the current value
      of ptracer_cred when ptrace carries through fork.  Re-reading the
      ptracer_cred from the ptracing process at this point is inconsistent
      with how PT_PTRACE_CAP has been maintained all of these years.
      Tested-by: NTakashi Iwai <tiwai@suse.de>
      Fixes: 64b875f7 ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      c70d9d80
    • V
      kthread: Fix use-after-free if kthread fork fails · 4d6501dc
      Vegard Nossum 提交于
      If a kthread forks (e.g. usermodehelper since commit 1da5c46f) but
      fails in copy_process() between calling dup_task_struct() and setting
      p->set_child_tid, then the value of p->set_child_tid will be inherited
      from the parent and get prematurely freed by free_kthread_struct().
      
          kthread()
           - worker_thread()
              - process_one_work()
              |  - call_usermodehelper_exec_work()
              |     - kernel_thread()
              |        - _do_fork()
              |           - copy_process()
              |              - dup_task_struct()
              |                 - arch_dup_task_struct()
              |                    - tsk->set_child_tid = current->set_child_tid // implied
              |              - ...
              |              - goto bad_fork_*
              |              - ...
              |              - free_task(tsk)
              |                 - free_kthread_struct(tsk)
              |                    - kfree(tsk->set_child_tid)
              - ...
              - schedule()
                 - __schedule()
                    - wq_worker_sleeping()
                       - kthread_data(task)->flags // UAF
      
      The problem started showing up with commit 1da5c46f since it reused
      ->set_child_tid for the kthread worker data.
      
      A better long-term solution might be to get rid of the ->set_child_tid
      abuse. The comment in set_kthread_struct() also looks slightly wrong.
      Debugged-by: NJamie Iles <jamie.iles@oracle.com>
      Fixes: 1da5c46f ("kthread: Make struct kthread kmalloc'ed")
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jamie Iles <jamie.iles@oracle.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nossum@oracle.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      4d6501dc
    • P
      futex,rt_mutex: Fix rt_mutex_cleanup_proxy_lock() · 04dc1b2f
      Peter Zijlstra 提交于
      Markus reported that the glibc/nptl/tst-robustpi8 test was failing after
      commit:
      
        cfafcd11 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")
      
      The following trace shows the problem:
      
       ld-linux-x86-64-2161  [019] ....   410.760971: SyS_futex: 00007ffbeb76b028: 80000875  op=FUTEX_LOCK_PI
       ld-linux-x86-64-2161  [019] ...1   410.760972: lock_pi_update_atomic: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000875 ret=0
       ld-linux-x86-64-2165  [011] ....   410.760978: SyS_futex: 00007ffbeb76b028: 80000875  op=FUTEX_UNLOCK_PI
       ld-linux-x86-64-2165  [011] d..1   410.760979: do_futex: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000871 ret=0
       ld-linux-x86-64-2165  [011] ....   410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=0000
       ld-linux-x86-64-2161  [019] ....   410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=ETIMEDOUT
      
      Task 2165 does an UNLOCK_PI, assigning the lock to the waiter task 2161
      which then returns with -ETIMEDOUT. That wrecks the lock state, because now
      the owner isn't aware it acquired the lock and removes the pending robust
      list entry.
      
      If 2161 is killed, the robust list will not clear out this futex and the
      subsequent acquire on this futex will then (correctly) result in -ESRCH
      which is unexpected by glibc, triggers an internal assertion and dies.
      
      Task 2161			Task 2165
      
      rt_mutex_wait_proxy_lock()
         timeout();
         /* T2161 is still queued in  the waiter list */
         return -ETIMEDOUT;
      
      				futex_unlock_pi()
      				spin_lock(hb->lock);
      				rtmutex_unlock()
      				  remove_rtmutex_waiter(T2161);
      				   mark_lock_available();
      				/* Make the next waiter owner of the user space side */
      				futex_uval = 2161;
      				spin_unlock(hb->lock);
      spin_lock(hb->lock);
      rt_mutex_cleanup_proxy_lock()
        if (rtmutex_owner() !== current)
           ...
           return FAIL;
      ....
      return -ETIMEOUT;
      
      This means that rt_mutex_cleanup_proxy_lock() needs to call
      try_to_take_rt_mutex() so it can take over the rtmutex correctly which was
      assigned by the waker. If the rtmutex is owned by some other task then this
      call is harmless and just confirmes that the waiter is not able to acquire
      it.
      
      While there, fix what looks like a merge error which resulted in
      rt_mutex_cleanup_proxy_lock() having two calls to
      fixup_rt_mutex_waiters() and rt_mutex_wait_proxy_lock() not having any.
      Both should have one, since both potentially touch the waiter list.
      
      Fixes: 38d589f2 ("futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()")
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Bug-Spotted-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Darren Hart <dvhart@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Link: http://lkml.kernel.org/r/20170519154850.mlomgdsd26drq5j6@hirez.programming.kicks-ass.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      04dc1b2f
    • D
      net: Make IP alignment calulations clearer. · e4eda884
      David S. Miller 提交于
      The assignmnet:
      
      	ip_align = strict ? 2 : NET_IP_ALIGN;
      
      in compare_pkt_ptr_alignment() trips up Coverity because we can only
      get to this code when strict is true, therefore ip_align will always
      be 2 regardless of NET_IP_ALIGN's value.
      
      So just assign directly to '2' and explain the situation in the
      comment above.
      Reported-by: N"Gustavo A. R. Silva" <garsilva@embeddedor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4eda884
  6. 19 5月, 2017 2 次提交
  7. 18 5月, 2017 4 次提交
    • D
      bpf: adjust verifier heuristics · 3c2ce60b
      Daniel Borkmann 提交于
      Current limits with regards to processing program paths do not
      really reflect today's needs anymore due to programs becoming
      more complex and verifier smarter, keeping track of more data
      such as const ALU operations, alignment tracking, spilling of
      PTR_TO_MAP_VALUE_ADJ registers, and other features allowing for
      smarter matching of what LLVM generates.
      
      This also comes with the side-effect that we result in fewer
      opportunities to prune search states and thus often need to do
      more work to prove safety than in the past due to different
      register states and stack layout where we mismatch. Generally,
      it's quite hard to determine what caused a sudden increase in
      complexity, it could be caused by something as trivial as a
      single branch somewhere at the beginning of the program where
      LLVM assigned a stack slot that is marked differently throughout
      other branches and thus causing a mismatch, where verifier
      then needs to prove safety for the whole rest of the program.
      Subsequently, programs with even less than half the insn size
      limit can get rejected. We noticed that while some programs
      load fine under pre 4.11, they get rejected due to hitting
      limits on more recent kernels. We saw that in the vast majority
      of cases (90+%) pruning failed due to register mismatches. In
      case of stack mismatches, majority of cases failed due to
      different stack slot types (invalid, spill, misc) rather than
      differences in spilled registers.
      
      This patch makes pruning more aggressive by also adding markers
      that sit at conditional jumps as well. Currently, we only mark
      jump targets for pruning. For example in direct packet access,
      these are usually error paths where we bail out. We found that
      adding these markers, it can reduce number of processed insns
      by up to 30%. Another option is to ignore reg->id in probing
      PTR_TO_MAP_VALUE_OR_NULL registers, which can help pruning
      slightly as well by up to 7% observed complexity reduction as
      stand-alone. Meaning, if a previous path with register type
      PTR_TO_MAP_VALUE_OR_NULL for map X was found to be safe, then
      in the current state a PTR_TO_MAP_VALUE_OR_NULL register for
      the same map X must be safe as well. Last but not least the
      patch also adds a scheduling point and bumps the current limit
      for instructions to be processed to a more adequate value.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c2ce60b
    • S
      kprobes: Document how optimized kprobes are removed from module unload · 545a0281
      Steven Rostedt (VMware) 提交于
      Thomas discovered a bug where the kprobe trace tests had a race
      condition where the kprobe_optimizer called from a delayed work queue
      that does the optimizing and "unoptimizing" of a kprobe, can try to
      modify the text after it has been freed by the init code.
      
      The kprobe trace selftest is a special case, and Thomas and myself
      investigated to see if there's a chance that this could also be a bug
      with module unloading, as the code is not obvious to how it handles
      this. After adding lots of printks, I figured it out. Thomas suggested
      that this should be commented so that others will not have to go
      through this exercise again.
      
      Link: http://lkml.kernel.org/r/20170516145835.3827d3aa@gandalf.local.homeAcked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      545a0281
    • S
      ftrace: Remove #ifdef from code and add clear_ftrace_function_probes() stub · 8a49f3e0
      Steven Rostedt (VMware) 提交于
      No need to add ugly #ifdefs in the code. Having a standard stub file is much
      prettier.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      8a49f3e0
    • N
      ftrace/instances: Clear function triggers when removing instances · a0e6369e
      Naveen N. Rao 提交于
      If instance directories are deleted while there are registered function
      triggers:
      
        # cd /sys/kernel/debug/tracing/instances
        # mkdir test
        # echo "schedule:enable_event:sched:sched_switch" > test/set_ftrace_filter
        # rmdir test
        Unable to handle kernel paging request for data at address 0x00000008
        Unable to handle kernel paging request for data at address 0x00000008
        Faulting instruction address: 0xc0000000021edde8
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048
        NUMA
        pSeries
        Modules linked in: iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp tun bridge stp llc kvm iptable_filter fuse binfmt_misc pseries_rng rng_core vmx_crypto ib_iser rdma_cm iw_cm ib_cm ib_core libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c multipath virtio_net virtio_blk virtio_pci crc32c_vpmsum virtio_ring virtio
        CPU: 8 PID: 8694 Comm: rmdir Not tainted 4.11.0-nnr+ #113
        task: c0000000bab52800 task.stack: c0000000baba0000
        NIP: c0000000021edde8 LR: c0000000021f0590 CTR: c000000002119620
        REGS: c0000000baba3870 TRAP: 0300   Not tainted  (4.11.0-nnr+)
        MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE>
          CR: 22002422  XER: 20000000
        CFAR: 00007fffabb725a8 DAR: 0000000000000008 DSISR: 40000000 SOFTE: 0
        GPR00: c00000000220f750 c0000000baba3af0 c000000003157e00 0000000000000000
        GPR04: 0000000000000040 00000000000000eb 0000000000000040 0000000000000000
        GPR08: 0000000000000000 0000000000000113 0000000000000000 c00000000305db98
        GPR12: c000000002119620 c00000000fd42c00 0000000000000000 0000000000000000
        GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        GPR20: 0000000000000000 0000000000000000 c0000000bab52e90 0000000000000000
        GPR24: 0000000000000000 00000000000000eb 0000000000000040 c0000000baba3bb0
        GPR28: c00000009cb06eb0 c0000000bab52800 c00000009cb06eb0 c0000000baba3bb0
        NIP [c0000000021edde8] ring_buffer_lock_reserve+0x8/0x4e0
        LR [c0000000021f0590] trace_event_buffer_lock_reserve+0xe0/0x1a0
        Call Trace:
        [c0000000baba3af0] [c0000000021f96c8] trace_event_buffer_commit+0x1b8/0x280 (unreliable)
        [c0000000baba3b60] [c00000000220f750] trace_event_buffer_reserve+0x80/0xd0
        [c0000000baba3b90] [c0000000021196b8] trace_event_raw_event_sched_switch+0x98/0x180
        [c0000000baba3c10] [c0000000029d9980] __schedule+0x6e0/0xab0
        [c0000000baba3ce0] [c000000002122230] do_task_dead+0x70/0xc0
        [c0000000baba3d10] [c0000000020ea9c8] do_exit+0x828/0xd00
        [c0000000baba3dd0] [c0000000020eaf70] do_group_exit+0x60/0x100
        [c0000000baba3e10] [c0000000020eb034] SyS_exit_group+0x24/0x30
        [c0000000baba3e30] [c00000000200bcec] system_call+0x38/0x54
        Instruction dump:
        60000000 60420000 7d244b78 7f63db78 4bffaa09 393efff8 793e0020 39200000
        4bfffecc 60420000 3c4c00f7 3842a020 <81230008> 2f890000 409e02f0 a14d0008
        ---[ end trace b917b8985d0e650b ]---
        Unable to handle kernel paging request for data at address 0x00000008
        Faulting instruction address: 0xc0000000021edde8
        Unable to handle kernel paging request for data at address 0x00000008
        Faulting instruction address: 0xc0000000021edde8
        Faulting instruction address: 0xc0000000021edde8
      
      To address this, let's clear all registered function probes before
      deleting the ftrace instance.
      
      Link: http://lkml.kernel.org/r/c5f1ca624043690bd94642bb6bffd3f2fc504035.1494956770.git.naveen.n.rao@linux.vnet.ibm.comReported-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      a0e6369e