1. 18 11月, 2017 14 次提交
  2. 16 11月, 2017 8 次提交
  3. 15 11月, 2017 1 次提交
    • E
      bpf: fix lockdep splat · 89ad2fa3
      Eric Dumazet 提交于
      pcpu_freelist_pop() needs the same lockdep awareness than
      pcpu_freelist_populate() to avoid a false positive.
      
       [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
      
       switchto-defaul/12508 [HC0[0]:SC0[6]:HE0:SE0] is trying to acquire:
        (&htab->buckets[i].lock){......}, at: [<ffffffff9dc099cb>] __htab_percpu_map_update_elem+0x1cb/0x300
      
       and this task is already holding:
        (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...}, at: [<ffffffff9e135848>] __dev_queue_xmit+0
      x868/0x1240
       which would create a new lock dependency:
        (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...} -> (&htab->buckets[i].lock){......}
      
       but this new dependency connects a SOFTIRQ-irq-safe lock:
        (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...}
       ... which became SOFTIRQ-irq-safe at:
         [<ffffffff9db5931b>] __lock_acquire+0x42b/0x1f10
         [<ffffffff9db5b32c>] lock_acquire+0xbc/0x1b0
         [<ffffffff9da05e38>] _raw_spin_lock+0x38/0x50
         [<ffffffff9e135848>] __dev_queue_xmit+0x868/0x1240
         [<ffffffff9e136240>] dev_queue_xmit+0x10/0x20
         [<ffffffff9e1965d9>] ip_finish_output2+0x439/0x590
         [<ffffffff9e197410>] ip_finish_output+0x150/0x2f0
         [<ffffffff9e19886d>] ip_output+0x7d/0x260
         [<ffffffff9e19789e>] ip_local_out+0x5e/0xe0
         [<ffffffff9e197b25>] ip_queue_xmit+0x205/0x620
         [<ffffffff9e1b8398>] tcp_transmit_skb+0x5a8/0xcb0
         [<ffffffff9e1ba152>] tcp_write_xmit+0x242/0x1070
         [<ffffffff9e1baffc>] __tcp_push_pending_frames+0x3c/0xf0
         [<ffffffff9e1b3472>] tcp_rcv_established+0x312/0x700
         [<ffffffff9e1c1acc>] tcp_v4_do_rcv+0x11c/0x200
         [<ffffffff9e1c3dc2>] tcp_v4_rcv+0xaa2/0xc30
         [<ffffffff9e191107>] ip_local_deliver_finish+0xa7/0x240
         [<ffffffff9e191a36>] ip_local_deliver+0x66/0x200
         [<ffffffff9e19137d>] ip_rcv_finish+0xdd/0x560
         [<ffffffff9e191e65>] ip_rcv+0x295/0x510
         [<ffffffff9e12ff88>] __netif_receive_skb_core+0x988/0x1020
         [<ffffffff9e130641>] __netif_receive_skb+0x21/0x70
         [<ffffffff9e1306ff>] process_backlog+0x6f/0x230
         [<ffffffff9e132129>] net_rx_action+0x229/0x420
         [<ffffffff9da07ee8>] __do_softirq+0xd8/0x43d
         [<ffffffff9e282bcc>] do_softirq_own_stack+0x1c/0x30
         [<ffffffff9dafc2f5>] do_softirq+0x55/0x60
         [<ffffffff9dafc3a8>] __local_bh_enable_ip+0xa8/0xb0
         [<ffffffff9db4c727>] cpu_startup_entry+0x1c7/0x500
         [<ffffffff9daab333>] start_secondary+0x113/0x140
      
       to a SOFTIRQ-irq-unsafe lock:
        (&head->lock){+.+...}
       ... which became SOFTIRQ-irq-unsafe at:
       ...  [<ffffffff9db5971f>] __lock_acquire+0x82f/0x1f10
         [<ffffffff9db5b32c>] lock_acquire+0xbc/0x1b0
         [<ffffffff9da05e38>] _raw_spin_lock+0x38/0x50
         [<ffffffff9dc0b7fa>] pcpu_freelist_pop+0x7a/0xb0
         [<ffffffff9dc08b2c>] htab_map_alloc+0x50c/0x5f0
         [<ffffffff9dc00dc5>] SyS_bpf+0x265/0x1200
         [<ffffffff9e28195f>] entry_SYSCALL_64_fastpath+0x12/0x17
      
       other info that might help us debug this:
      
       Chain exists of:
         dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2 --> &htab->buckets[i].lock --> &head->lock
      
        Possible interrupt unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(&head->lock);
                                      local_irq_disable();
                                      lock(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2);
                                      lock(&htab->buckets[i].lock);
         <Interrupt>
           lock(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2);
      
        *** DEADLOCK ***
      
      Fixes: e19494ed ("bpf: introduce percpu_freelist")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89ad2fa3
  4. 14 11月, 2017 2 次提交
    • Y
      bpf: change helper bpf_probe_read arg2 type to ARG_CONST_SIZE_OR_ZERO · 9c019e2b
      Yonghong Song 提交于
      The helper bpf_probe_read arg2 type is changed
      from ARG_CONST_SIZE to ARG_CONST_SIZE_OR_ZERO to permit
      size-0 buffer. Together with newer ARG_CONST_SIZE_OR_ZERO
      semantics which allows non-NULL buffer with size 0,
      this allows simpler bpf programs with verifier acceptance.
      The previous commit which changes ARG_CONST_SIZE_OR_ZERO semantics
      has details on examples.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c019e2b
    • Y
      bpf: improve verifier ARG_CONST_SIZE_OR_ZERO semantics · 9fd29c08
      Yonghong Song 提交于
      For helpers, the argument type ARG_CONST_SIZE_OR_ZERO permits the
      access size to be 0 when accessing the previous argument (arg).
      Right now, it requires the arg needs to be NULL when size passed
      is 0 or could be 0. It also requires a non-NULL arg when the size
      is proved to be non-0.
      
      This patch changes verifier ARG_CONST_SIZE_OR_ZERO behavior
      such that for size-0 or possible size-0, it is not required
      the arg equal to NULL.
      
      There are a couple of reasons for this semantics change, and
      all of them intends to simplify user bpf programs which
      may improve user experience and/or increase chances of
      verifier acceptance. Together with the next patch which
      changes bpf_probe_read arg2 type from ARG_CONST_SIZE to
      ARG_CONST_SIZE_OR_ZERO, the following two examples, which
      fail the verifier currently, are able to get verifier acceptance.
      
      Example 1:
         unsigned long len = pend - pstart;
         len = len > MAX_PAYLOAD_LEN ? MAX_PAYLOAD_LEN : len;
         len &= MAX_PAYLOAD_LEN;
         bpf_probe_read(data->payload, len, pstart);
      
      It does not have test for "len > 0" and it failed the verifier.
      Users may not be aware that they have to add this test.
      Converting the bpf_probe_read helper to have
      ARG_CONST_SIZE_OR_ZERO helps the above code get
      verifier acceptance.
      
      Example 2:
        Here is one example where llvm "messed up" the code and
        the verifier fails.
      
      ......
         unsigned long len = pend - pstart;
         if (len > 0 && len <= MAX_PAYLOAD_LEN)
           bpf_probe_read(data->payload, len, pstart);
      ......
      
      The compiler generates the following code and verifier fails:
      ......
      39: (79) r2 = *(u64 *)(r10 -16)
      40: (1f) r2 -= r8
      41: (bf) r1 = r2
      42: (07) r1 += -1
      43: (25) if r1 > 0xffe goto pc+3
        R0=inv(id=0) R1=inv(id=0,umax_value=4094,var_off=(0x0; 0xfff))
        R2=inv(id=0) R6=map_value(id=0,off=0,ks=4,vs=4095,imm=0) R7=inv(id=0)
        R8=inv(id=0) R9=inv0 R10=fp0
      44: (bf) r1 = r6
      45: (bf) r3 = r8
      46: (85) call bpf_probe_read#45
      R2 min value is negative, either use unsigned or 'var &= const'
      ......
      
      The compiler optimization is correct. If r1 = 0,
      r1 - 1 = 0xffffffffffffffff > 0xffe.  If r1 != 0, r1 - 1 will not wrap.
      r1 > 0xffe at insn #43 can actually capture
      both "r1 > 0" and "len <= MAX_PAYLOAD_LEN".
      This however causes an issue in verifier as the value range of arg2
      "r2" does not properly get refined and lead to verification failure.
      
      Relaxing bpf_prog_read arg2 from ARG_CONST_SIZE to ARG_CONST_SIZE_OR_ZERO
      allows the following simplied code:
         unsigned long len = pend - pstart;
         if (len <= MAX_PAYLOAD_LEN)
           bpf_probe_read(data->payload, len, pstart);
      
      The llvm compiler will generate less complex code and the
      verifier is able to verify that the program is okay.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fd29c08
  5. 13 11月, 2017 6 次提交
  6. 12 11月, 2017 3 次提交
    • D
      timers: Add a function to start/reduce a timer · b24591e2
      David Howells 提交于
      Add a function, similar to mod_timer(), that will start a timer if it isn't
      running and will modify it if it is running and has an expiry time longer
      than the new time.  If the timer is running with an expiry time that's the
      same or sooner, no change is made.
      
      The function looks like:
      
      	int timer_reduce(struct timer_list *timer, unsigned long expires);
      
      This can be used by code such as networking code to make it easier to share
      a timer for multiple timeouts.  For instance, in upcoming AF_RXRPC code,
      the rxrpc_call struct will maintain a number of timeouts:
      
      	unsigned long	ack_at;
      	unsigned long	resend_at;
      	unsigned long	ping_at;
      	unsigned long	expect_rx_by;
      	unsigned long	expect_req_by;
      	unsigned long	expect_term_by;
      
      each of which is set independently of the others.  With timer reduction
      available, when the code needs to set one of the timeouts, it only needs to
      look at that timeout and then call timer_reduce() to modify the timer,
      starting it or bringing it forward if necessary.  There is no need to refer
      to the other timeouts to see which is earliest and no need to take any lock
      other than, potentially, the timer lock inside timer_reduce().
      
      Note, that this does not protect against concurrent invocations of any of
      the timer functions.
      
      As an example, the expect_rx_by timeout above, which terminates a call if
      we don't get a packet from the server within a certain time window, would
      be set something like this:
      
      	unsigned long now = jiffies;
      	unsigned long expect_rx_by = now + packet_receive_timeout;
      	WRITE_ONCE(call->expect_rx_by, expect_rx_by);
      	timer_reduce(&call->timer, expect_rx_by);
      
      The timer service code (which might, say, be in a work function) would then
      check all the timeouts to see which, if any, had triggered, deal with
      those:
      
      	t = READ_ONCE(call->ack_at);
      	if (time_after_eq(now, t)) {
      		cmpxchg(&call->ack_at, t, now + MAX_JIFFY_OFFSET);
      		set_bit(RXRPC_CALL_EV_ACK, &call->events);
      	}
      
      and then restart the timer if necessary by finding the soonest timeout that
      hasn't yet passed and then calling timer_reduce().
      
      The disadvantage of doing things this way rather than comparing the timers
      each time and calling mod_timer() is that you *will* take timer events
      unless you can finish what you're doing and delete the timer in time.
      
      The advantage of doing things this way is that you don't need to use a lock
      to work out when the next timer should be set, other than the timer's own
      lock - which you might not have to take.
      
      [ tglx: Fixed weird formatting and adopted it to pending changes ]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: keyrings@vger.kernel.org
      Cc: linux-afs@lists.infradead.org
      Link: https://lkml.kernel.org/r/151023090769.23050.1801643667223880753.stgit@warthog.procyon.org.uk
      b24591e2
    • A
      pstore: Use ktime_get_real_fast_ns() instead of __getnstimeofday() · df27067e
      Arnd Bergmann 提交于
      __getnstimeofday() is a rather odd interface, with a number of quirks:
      
      - The caller may come from NMI context, but the implementation is not NMI safe,
        one way to get there from NMI is
      
            NMI handler:
              something bad
                panic()
                  kmsg_dump()
                    pstore_dump()
                       pstore_record_init()
                         __getnstimeofday()
      
      - The calling conventions are different from any other timekeeping functions,
        to deal with returning an error code during suspended timekeeping.
      
      Address the above issues by using a completely different method to get the
      time: ktime_get_real_fast_ns() is NMI safe and has a reasonable behavior
      when timekeeping is suspended: it returns the time at which it got
      suspended. As Thomas Gleixner explained, this is safe, as
      ktime_get_real_fast_ns() does not call into the clocksource driver that
      might be suspended.
      
      The result can easily be transformed into a timespec structure. Since
      ktime_get_real_fast_ns() was not exported to modules, add the export.
      
      The pstore behavior for the suspended case changes slightly, as it now
      stores the timestamp at which timekeeping was suspended instead of storing
      a zero timestamp.
      
      This change is not addressing y2038-safety, that's subject to a more
      complex follow up patch.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Stephen Boyd <sboyd@codeaurora.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Colin Cross <ccross@android.com>
      Link: https://lkml.kernel.org/r/20171110152530.1926955-1-arnd@arndb.de
      df27067e
    • T
      irq/work: Use llist_for_each_entry_safe · d00a08cf
      Thomas Gleixner 提交于
      The llist_for_each_entry() loop in irq_work_run_list() is unsafe because
      once the works PENDING bit is cleared it can be requeued on another CPU.
      
      Use llist_for_each_entry_safe() instead.
      
      Fixes: 16c0890d ("irq/work: Don't reinvent the wheel but use existing llist API")
      Reported-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petri Latvala <petri.latvala@intel.com>
      Link: http://lkml.kernel.org/r/151027307351.14762.4611888896020658384@mail.alporthouse.com
      d00a08cf
  7. 11 11月, 2017 6 次提交
    • D
      bpf: Revert bpf_overrid_function() helper changes. · f3edacbd
      David S. Miller 提交于
      NACK'd by x86 maintainer.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3edacbd
    • J
      bpf: add a bpf_override_function helper · dd0bb688
      Josef Bacik 提交于
      Error injection is sloppy and very ad-hoc.  BPF could fill this niche
      perfectly with it's kprobe functionality.  We could make sure errors are
      only triggered in specific call chains that we care about with very
      specific situations.  Accomplish this with the bpf_override_funciton
      helper.  This will modify the probe'd callers return value to the
      specified value and set the PC to an override function that simply
      returns, bypassing the originally probed function.  This gives us a nice
      clean way to implement systematic error injection for all of our code
      paths.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd0bb688
    • S
      kthread: zero the kthread data structure · e10237cc
      Shaohua Li 提交于
      kthread() could bail out early before we initialize blkcg_css (if the
      kthread is killed very early. Please see xchg() statement in kthread()),
      which confuses free_kthread_struct. Instead of moving the blkcg_css
      initialization early, we simply zero the whole 'self' data structure,
      which doesn't sound much overhead.
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Fixes: 05e3db95 ("kthread: add a mechanism to store cgroup info")
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e10237cc
    • J
      blktrace: fix unlocked registration of tracepoints · a6da0024
      Jens Axboe 提交于
      We need to ensure that tracepoints are registered and unregistered
      with the users of them. The existing atomic count isn't enough for
      that. Add a lock around the tracepoints, so we serialize access
      to them.
      
      This fixes cases where we have multiple users setting up and
      tearing down tracepoints, like this:
      
      CPU: 0 PID: 2995 Comm: syzkaller857118 Not tainted
      4.14.0-rc5-next-20171018+ #36
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:16 [inline]
        dump_stack+0x194/0x257 lib/dump_stack.c:52
        panic+0x1e4/0x41c kernel/panic.c:183
        __warn+0x1c4/0x1e0 kernel/panic.c:546
        report_bug+0x211/0x2d0 lib/bug.c:183
        fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:177
        do_trap_no_signal arch/x86/kernel/traps.c:211 [inline]
        do_trap+0x260/0x390 arch/x86/kernel/traps.c:260
        do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:297
        do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:310
        invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
      RIP: 0010:tracepoint_add_func kernel/tracepoint.c:210 [inline]
      RIP: 0010:tracepoint_probe_register_prio+0x397/0x9a0 kernel/tracepoint.c:283
      RSP: 0018:ffff8801d1d1f6c0 EFLAGS: 00010293
      RAX: ffff8801d22e8540 RBX: 00000000ffffffef RCX: ffffffff81710f07
      RDX: 0000000000000000 RSI: ffffffff85b679c0 RDI: ffff8801d5f19818
      RBP: ffff8801d1d1f7c8 R08: ffffffff81710c10 R09: 0000000000000004
      R10: ffff8801d1d1f6b0 R11: 0000000000000003 R12: ffffffff817597f0
      R13: 0000000000000000 R14: 00000000ffffffff R15: ffff8801d1d1f7a0
        tracepoint_probe_register+0x2a/0x40 kernel/tracepoint.c:304
        register_trace_block_rq_insert include/trace/events/block.h:191 [inline]
        blk_register_tracepoints+0x1e/0x2f0 kernel/trace/blktrace.c:1043
        do_blk_trace_setup+0xa10/0xcf0 kernel/trace/blktrace.c:542
        blk_trace_setup+0xbd/0x180 kernel/trace/blktrace.c:564
        sg_ioctl+0xc71/0x2d90 drivers/scsi/sg.c:1089
        vfs_ioctl fs/ioctl.c:45 [inline]
        do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
        SYSC_ioctl fs/ioctl.c:700 [inline]
        SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
        entry_SYSCALL_64_fastpath+0x1f/0xbe
      RIP: 0033:0x444339
      RSP: 002b:00007ffe05bb5b18 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 00000000006d66c0 RCX: 0000000000444339
      RDX: 000000002084cf90 RSI: 00000000c0481273 RDI: 0000000000000009
      RBP: 0000000000000082 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000206 R12: ffffffffffffffff
      R13: 00000000c0481273 R14: 0000000000000000 R15: 0000000000000000
      
      since we can now run these in parallel. Ensure that the exported helpers
      for doing this are grabbing the queue trace mutex.
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a6da0024
    • J
      blktrace: fix unlocked access to init/start-stop/teardown · 1f2cac10
      Jens Axboe 提交于
      sg.c calls into the blktrace functions without holding the proper queue
      mutex for doing setup, start/stop, or teardown.
      
      Add internal unlocked variants, and export the ones that do the proper
      locking.
      
      Fixes: 6da127ad ("blktrace: Add blktrace ioctls to SCSI generic devices")
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1f2cac10
    • R
      audit: filter PATH records keyed on filesystem magic · 42d5e376
      Richard Guy Briggs 提交于
      Tracefs or debugfs were causing hundreds to thousands of PATH records to
      be associated with the init_module and finit_module SYSCALL records on a
      few modules when the following rule was in place for startup:
      	-a always,exit -F arch=x86_64 -S init_module -F key=mod-load
      
      Provide a method to ignore these large number of PATH records from
      overwhelming the logs if they are not of interest.  Introduce a new
      filter list "AUDIT_FILTER_FS", with a new field type AUDIT_FSTYPE,
      which keys off the filesystem 4-octet hexadecimal magic identifier to
      filter specific filesystem PATH records.
      
      An example rule would look like:
      	-a never,filesystem -F fstype=0x74726163 -F key=ignore_tracefs
      	-a never,filesystem -F fstype=0x64626720 -F key=ignore_debugfs
      
      Arguably the better way to address this issue is to disable tracefs and
      debugfs on boot from production systems.
      
      See: https://github.com/linux-audit/audit-kernel/issues/16
      See: https://github.com/linux-audit/audit-userspace/issues/8
      Test case: https://github.com/linux-audit/audit-testsuite/issues/42Signed-off-by: NRichard Guy Briggs <rgb@redhat.com>
      [PM: fixed the whitespace damage in kernel/auditsc.c]
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      42d5e376