1. 18 11月, 2017 7 次提交
  2. 16 11月, 2017 8 次提交
  3. 15 11月, 2017 1 次提交
    • E
      bpf: fix lockdep splat · 89ad2fa3
      Eric Dumazet 提交于
      pcpu_freelist_pop() needs the same lockdep awareness than
      pcpu_freelist_populate() to avoid a false positive.
      
       [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
      
       switchto-defaul/12508 [HC0[0]:SC0[6]:HE0:SE0] is trying to acquire:
        (&htab->buckets[i].lock){......}, at: [<ffffffff9dc099cb>] __htab_percpu_map_update_elem+0x1cb/0x300
      
       and this task is already holding:
        (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...}, at: [<ffffffff9e135848>] __dev_queue_xmit+0
      x868/0x1240
       which would create a new lock dependency:
        (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...} -> (&htab->buckets[i].lock){......}
      
       but this new dependency connects a SOFTIRQ-irq-safe lock:
        (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...}
       ... which became SOFTIRQ-irq-safe at:
         [<ffffffff9db5931b>] __lock_acquire+0x42b/0x1f10
         [<ffffffff9db5b32c>] lock_acquire+0xbc/0x1b0
         [<ffffffff9da05e38>] _raw_spin_lock+0x38/0x50
         [<ffffffff9e135848>] __dev_queue_xmit+0x868/0x1240
         [<ffffffff9e136240>] dev_queue_xmit+0x10/0x20
         [<ffffffff9e1965d9>] ip_finish_output2+0x439/0x590
         [<ffffffff9e197410>] ip_finish_output+0x150/0x2f0
         [<ffffffff9e19886d>] ip_output+0x7d/0x260
         [<ffffffff9e19789e>] ip_local_out+0x5e/0xe0
         [<ffffffff9e197b25>] ip_queue_xmit+0x205/0x620
         [<ffffffff9e1b8398>] tcp_transmit_skb+0x5a8/0xcb0
         [<ffffffff9e1ba152>] tcp_write_xmit+0x242/0x1070
         [<ffffffff9e1baffc>] __tcp_push_pending_frames+0x3c/0xf0
         [<ffffffff9e1b3472>] tcp_rcv_established+0x312/0x700
         [<ffffffff9e1c1acc>] tcp_v4_do_rcv+0x11c/0x200
         [<ffffffff9e1c3dc2>] tcp_v4_rcv+0xaa2/0xc30
         [<ffffffff9e191107>] ip_local_deliver_finish+0xa7/0x240
         [<ffffffff9e191a36>] ip_local_deliver+0x66/0x200
         [<ffffffff9e19137d>] ip_rcv_finish+0xdd/0x560
         [<ffffffff9e191e65>] ip_rcv+0x295/0x510
         [<ffffffff9e12ff88>] __netif_receive_skb_core+0x988/0x1020
         [<ffffffff9e130641>] __netif_receive_skb+0x21/0x70
         [<ffffffff9e1306ff>] process_backlog+0x6f/0x230
         [<ffffffff9e132129>] net_rx_action+0x229/0x420
         [<ffffffff9da07ee8>] __do_softirq+0xd8/0x43d
         [<ffffffff9e282bcc>] do_softirq_own_stack+0x1c/0x30
         [<ffffffff9dafc2f5>] do_softirq+0x55/0x60
         [<ffffffff9dafc3a8>] __local_bh_enable_ip+0xa8/0xb0
         [<ffffffff9db4c727>] cpu_startup_entry+0x1c7/0x500
         [<ffffffff9daab333>] start_secondary+0x113/0x140
      
       to a SOFTIRQ-irq-unsafe lock:
        (&head->lock){+.+...}
       ... which became SOFTIRQ-irq-unsafe at:
       ...  [<ffffffff9db5971f>] __lock_acquire+0x82f/0x1f10
         [<ffffffff9db5b32c>] lock_acquire+0xbc/0x1b0
         [<ffffffff9da05e38>] _raw_spin_lock+0x38/0x50
         [<ffffffff9dc0b7fa>] pcpu_freelist_pop+0x7a/0xb0
         [<ffffffff9dc08b2c>] htab_map_alloc+0x50c/0x5f0
         [<ffffffff9dc00dc5>] SyS_bpf+0x265/0x1200
         [<ffffffff9e28195f>] entry_SYSCALL_64_fastpath+0x12/0x17
      
       other info that might help us debug this:
      
       Chain exists of:
         dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2 --> &htab->buckets[i].lock --> &head->lock
      
        Possible interrupt unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(&head->lock);
                                      local_irq_disable();
                                      lock(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2);
                                      lock(&htab->buckets[i].lock);
         <Interrupt>
           lock(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2);
      
        *** DEADLOCK ***
      
      Fixes: e19494ed ("bpf: introduce percpu_freelist")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89ad2fa3
  4. 14 11月, 2017 2 次提交
    • Y
      bpf: change helper bpf_probe_read arg2 type to ARG_CONST_SIZE_OR_ZERO · 9c019e2b
      Yonghong Song 提交于
      The helper bpf_probe_read arg2 type is changed
      from ARG_CONST_SIZE to ARG_CONST_SIZE_OR_ZERO to permit
      size-0 buffer. Together with newer ARG_CONST_SIZE_OR_ZERO
      semantics which allows non-NULL buffer with size 0,
      this allows simpler bpf programs with verifier acceptance.
      The previous commit which changes ARG_CONST_SIZE_OR_ZERO semantics
      has details on examples.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c019e2b
    • Y
      bpf: improve verifier ARG_CONST_SIZE_OR_ZERO semantics · 9fd29c08
      Yonghong Song 提交于
      For helpers, the argument type ARG_CONST_SIZE_OR_ZERO permits the
      access size to be 0 when accessing the previous argument (arg).
      Right now, it requires the arg needs to be NULL when size passed
      is 0 or could be 0. It also requires a non-NULL arg when the size
      is proved to be non-0.
      
      This patch changes verifier ARG_CONST_SIZE_OR_ZERO behavior
      such that for size-0 or possible size-0, it is not required
      the arg equal to NULL.
      
      There are a couple of reasons for this semantics change, and
      all of them intends to simplify user bpf programs which
      may improve user experience and/or increase chances of
      verifier acceptance. Together with the next patch which
      changes bpf_probe_read arg2 type from ARG_CONST_SIZE to
      ARG_CONST_SIZE_OR_ZERO, the following two examples, which
      fail the verifier currently, are able to get verifier acceptance.
      
      Example 1:
         unsigned long len = pend - pstart;
         len = len > MAX_PAYLOAD_LEN ? MAX_PAYLOAD_LEN : len;
         len &= MAX_PAYLOAD_LEN;
         bpf_probe_read(data->payload, len, pstart);
      
      It does not have test for "len > 0" and it failed the verifier.
      Users may not be aware that they have to add this test.
      Converting the bpf_probe_read helper to have
      ARG_CONST_SIZE_OR_ZERO helps the above code get
      verifier acceptance.
      
      Example 2:
        Here is one example where llvm "messed up" the code and
        the verifier fails.
      
      ......
         unsigned long len = pend - pstart;
         if (len > 0 && len <= MAX_PAYLOAD_LEN)
           bpf_probe_read(data->payload, len, pstart);
      ......
      
      The compiler generates the following code and verifier fails:
      ......
      39: (79) r2 = *(u64 *)(r10 -16)
      40: (1f) r2 -= r8
      41: (bf) r1 = r2
      42: (07) r1 += -1
      43: (25) if r1 > 0xffe goto pc+3
        R0=inv(id=0) R1=inv(id=0,umax_value=4094,var_off=(0x0; 0xfff))
        R2=inv(id=0) R6=map_value(id=0,off=0,ks=4,vs=4095,imm=0) R7=inv(id=0)
        R8=inv(id=0) R9=inv0 R10=fp0
      44: (bf) r1 = r6
      45: (bf) r3 = r8
      46: (85) call bpf_probe_read#45
      R2 min value is negative, either use unsigned or 'var &= const'
      ......
      
      The compiler optimization is correct. If r1 = 0,
      r1 - 1 = 0xffffffffffffffff > 0xffe.  If r1 != 0, r1 - 1 will not wrap.
      r1 > 0xffe at insn #43 can actually capture
      both "r1 > 0" and "len <= MAX_PAYLOAD_LEN".
      This however causes an issue in verifier as the value range of arg2
      "r2" does not properly get refined and lead to verification failure.
      
      Relaxing bpf_prog_read arg2 from ARG_CONST_SIZE to ARG_CONST_SIZE_OR_ZERO
      allows the following simplied code:
         unsigned long len = pend - pstart;
         if (len <= MAX_PAYLOAD_LEN)
           bpf_probe_read(data->payload, len, pstart);
      
      The llvm compiler will generate less complex code and the
      verifier is able to verify that the program is okay.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fd29c08
  5. 13 11月, 2017 6 次提交
  6. 12 11月, 2017 3 次提交
    • D
      timers: Add a function to start/reduce a timer · b24591e2
      David Howells 提交于
      Add a function, similar to mod_timer(), that will start a timer if it isn't
      running and will modify it if it is running and has an expiry time longer
      than the new time.  If the timer is running with an expiry time that's the
      same or sooner, no change is made.
      
      The function looks like:
      
      	int timer_reduce(struct timer_list *timer, unsigned long expires);
      
      This can be used by code such as networking code to make it easier to share
      a timer for multiple timeouts.  For instance, in upcoming AF_RXRPC code,
      the rxrpc_call struct will maintain a number of timeouts:
      
      	unsigned long	ack_at;
      	unsigned long	resend_at;
      	unsigned long	ping_at;
      	unsigned long	expect_rx_by;
      	unsigned long	expect_req_by;
      	unsigned long	expect_term_by;
      
      each of which is set independently of the others.  With timer reduction
      available, when the code needs to set one of the timeouts, it only needs to
      look at that timeout and then call timer_reduce() to modify the timer,
      starting it or bringing it forward if necessary.  There is no need to refer
      to the other timeouts to see which is earliest and no need to take any lock
      other than, potentially, the timer lock inside timer_reduce().
      
      Note, that this does not protect against concurrent invocations of any of
      the timer functions.
      
      As an example, the expect_rx_by timeout above, which terminates a call if
      we don't get a packet from the server within a certain time window, would
      be set something like this:
      
      	unsigned long now = jiffies;
      	unsigned long expect_rx_by = now + packet_receive_timeout;
      	WRITE_ONCE(call->expect_rx_by, expect_rx_by);
      	timer_reduce(&call->timer, expect_rx_by);
      
      The timer service code (which might, say, be in a work function) would then
      check all the timeouts to see which, if any, had triggered, deal with
      those:
      
      	t = READ_ONCE(call->ack_at);
      	if (time_after_eq(now, t)) {
      		cmpxchg(&call->ack_at, t, now + MAX_JIFFY_OFFSET);
      		set_bit(RXRPC_CALL_EV_ACK, &call->events);
      	}
      
      and then restart the timer if necessary by finding the soonest timeout that
      hasn't yet passed and then calling timer_reduce().
      
      The disadvantage of doing things this way rather than comparing the timers
      each time and calling mod_timer() is that you *will* take timer events
      unless you can finish what you're doing and delete the timer in time.
      
      The advantage of doing things this way is that you don't need to use a lock
      to work out when the next timer should be set, other than the timer's own
      lock - which you might not have to take.
      
      [ tglx: Fixed weird formatting and adopted it to pending changes ]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: keyrings@vger.kernel.org
      Cc: linux-afs@lists.infradead.org
      Link: https://lkml.kernel.org/r/151023090769.23050.1801643667223880753.stgit@warthog.procyon.org.uk
      b24591e2
    • A
      pstore: Use ktime_get_real_fast_ns() instead of __getnstimeofday() · df27067e
      Arnd Bergmann 提交于
      __getnstimeofday() is a rather odd interface, with a number of quirks:
      
      - The caller may come from NMI context, but the implementation is not NMI safe,
        one way to get there from NMI is
      
            NMI handler:
              something bad
                panic()
                  kmsg_dump()
                    pstore_dump()
                       pstore_record_init()
                         __getnstimeofday()
      
      - The calling conventions are different from any other timekeeping functions,
        to deal with returning an error code during suspended timekeeping.
      
      Address the above issues by using a completely different method to get the
      time: ktime_get_real_fast_ns() is NMI safe and has a reasonable behavior
      when timekeeping is suspended: it returns the time at which it got
      suspended. As Thomas Gleixner explained, this is safe, as
      ktime_get_real_fast_ns() does not call into the clocksource driver that
      might be suspended.
      
      The result can easily be transformed into a timespec structure. Since
      ktime_get_real_fast_ns() was not exported to modules, add the export.
      
      The pstore behavior for the suspended case changes slightly, as it now
      stores the timestamp at which timekeeping was suspended instead of storing
      a zero timestamp.
      
      This change is not addressing y2038-safety, that's subject to a more
      complex follow up patch.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Stephen Boyd <sboyd@codeaurora.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Colin Cross <ccross@android.com>
      Link: https://lkml.kernel.org/r/20171110152530.1926955-1-arnd@arndb.de
      df27067e
    • T
      irq/work: Use llist_for_each_entry_safe · d00a08cf
      Thomas Gleixner 提交于
      The llist_for_each_entry() loop in irq_work_run_list() is unsafe because
      once the works PENDING bit is cleared it can be requeued on another CPU.
      
      Use llist_for_each_entry_safe() instead.
      
      Fixes: 16c0890d ("irq/work: Don't reinvent the wheel but use existing llist API")
      Reported-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petri Latvala <petri.latvala@intel.com>
      Link: http://lkml.kernel.org/r/151027307351.14762.4611888896020658384@mail.alporthouse.com
      d00a08cf
  7. 11 11月, 2017 13 次提交