1. 12 8月, 2015 1 次提交
  2. 21 7月, 2015 12 次提交
  3. 08 7月, 2015 1 次提交
    • S
      tracing: Have branch tracer use recursive field of task struct · 6224beb1
      Steven Rostedt (Red Hat) 提交于
      Fengguang Wu's tests triggered a bug in the branch tracer's start up
      test when CONFIG_DEBUG_PREEMPT set. This was because that config
      adds some debug logic in the per cpu field, which calls back into
      the branch tracer.
      
      The branch tracer has its own recursive checks, but uses a per cpu
      variable to implement it. If retrieving the per cpu variable calls
      back into the branch tracer, you can see how things will break.
      
      Instead of using a per cpu variable, use the trace_recursion field
      of the current task struct. Simply set a bit when entering the
      branch tracing and clear it when leaving. If the bit is set on
      entry, just don't do the tracing.
      
      There's also the case with lockdep, as the local_irq_save() called
      before the recursion can also trigger code that can call back into
      the function. Changing that to a raw_local_irq_save() will protect
      that as well.
      
      This prevents the recursion and the inevitable crash that follows.
      
      Link: http://lkml.kernel.org/r/20150630141803.GA28071@wfg-t540p.sh.intel.com
      
      Cc: stable@vger.kernel.org # 3.10+
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Tested-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      6224beb1
  4. 26 6月, 2015 5 次提交
  5. 17 6月, 2015 1 次提交
    • S
      tracing: Have filter check for balanced ops · 2cf30dc1
      Steven Rostedt 提交于
      When the following filter is used it causes a warning to trigger:
      
       # cd /sys/kernel/debug/tracing
       # echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter
      -bash: echo: write error: Invalid argument
       # cat events/ext4/ext4_truncate_exit/filter
      ((dev==1)blocks==2)
      ^
      parse_error: No error
      
       ------------[ cut here ]------------
       WARNING: CPU: 2 PID: 1223 at kernel/trace/trace_events_filter.c:1640 replace_preds+0x3c5/0x990()
       Modules linked in: bnep lockd grace bluetooth  ...
       CPU: 3 PID: 1223 Comm: bash Tainted: G        W       4.1.0-rc3-test+ #450
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
        0000000000000668 ffff8800c106bc98 ffffffff816ed4f9 ffff88011ead0cf0
        0000000000000000 ffff8800c106bcd8 ffffffff8107fb07 ffffffff8136b46c
        ffff8800c7d81d48 ffff8800d4c2bc00 ffff8800d4d4f920 00000000ffffffea
       Call Trace:
        [<ffffffff816ed4f9>] dump_stack+0x4c/0x6e
        [<ffffffff8107fb07>] warn_slowpath_common+0x97/0xe0
        [<ffffffff8136b46c>] ? _kstrtoull+0x2c/0x80
        [<ffffffff8107fb6a>] warn_slowpath_null+0x1a/0x20
        [<ffffffff81159065>] replace_preds+0x3c5/0x990
        [<ffffffff811596b2>] create_filter+0x82/0xb0
        [<ffffffff81159944>] apply_event_filter+0xd4/0x180
        [<ffffffff81152bbf>] event_filter_write+0x8f/0x120
        [<ffffffff811db2a8>] __vfs_write+0x28/0xe0
        [<ffffffff811dda43>] ? __sb_start_write+0x53/0xf0
        [<ffffffff812e51e0>] ? security_file_permission+0x30/0xc0
        [<ffffffff811dc408>] vfs_write+0xb8/0x1b0
        [<ffffffff811dc72f>] SyS_write+0x4f/0xb0
        [<ffffffff816f5217>] system_call_fastpath+0x12/0x6a
       ---[ end trace e11028bd95818dcd ]---
      
      Worse yet, reading the error message (the filter again) it says that
      there was no error, when there clearly was. The issue is that the
      code that checks the input does not check for balanced ops. That is,
      having an op between a closed parenthesis and the next token.
      
      This would only cause a warning, and fail out before doing any real
      harm, but it should still not caues a warning, and the error reported
      should work:
      
       # cd /sys/kernel/debug/tracing
       # echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter
      -bash: echo: write error: Invalid argument
       # cat events/ext4/ext4_truncate_exit/filter
      ((dev==1)blocks==2)
      ^
      parse_error: Meaningless filter expression
      
      And give no kernel warning.
      
      Link: http://lkml.kernel.org/r/20150615175025.7e809215@gandalf.local.home
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: stable@vger.kernel.org # 2.6.31+
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Tested-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      2cf30dc1
  6. 16 6月, 2015 4 次提交
  7. 11 6月, 2015 3 次提交
  8. 01 6月, 2015 1 次提交
  9. 29 5月, 2015 3 次提交
    • S
      ring-buffer: Add enum names for the context levels · a497adb4
      Steven Rostedt (Red Hat) 提交于
      Instead of having hard coded numbers for the context levels, use
      enums to describe them more.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a497adb4
    • S
      ring-buffer: Remove useless unused tracing_off_permanent() · 3c6296f7
      Steven Rostedt (Red Hat) 提交于
      The tracing_off_permanent() call is a way to disable all ring_buffers.
      Nothing uses it and nothing should use it, as tracing_off() and
      friends are better, as they disable the ring buffers related to
      tracing. The tracing_off_permanent() even disabled non tracing
      ring buffers. This is a bit drastic, and was added to handle NMIs
      doing outputs that could corrupt the ring buffer when only tracing
      used them. It is now obsolete and adds a little overhead, it should
      be removed.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      3c6296f7
    • S
      ring-buffer: Give NMIs a chance to lock the reader_lock · 289a5a25
      Steven Rostedt (Red Hat) 提交于
      Currently, if an NMI does a dump of a ring buffer, it disables
      all ring buffers from ever doing any writes again. This is because
      it wont take the locks for the cpu_buffer and this can cause
      corruption if it preempted a read, or a read happens on another
      CPU for the current cpu buffer. This is a bit overkill.
      
      First, it should at least try to take the lock, and if it fails
      then disable it. Also, there's no need to disable all ring
      buffers, even those that are unrelated to what is being read.
      Only disable the per cpu ring buffer that is being read if
      it can not get the lock for it.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      289a5a25
  10. 27 5月, 2015 3 次提交
    • S
      ring-buffer: Add trace_recursive checks to ring_buffer_write() · 985e871b
      Steven Rostedt (Red Hat) 提交于
      The ring_buffer_write() function isn't protected by the trace recursive
      writes. Luckily, this function is not used as much and is unlikely
      to ever recurse. But it should still have the protection, because
      even a call to ring_buffer_lock_reserve() could cause ring buffer
      corruption if called when ring_buffer_write() is being used.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      985e871b
    • S
      ring-buffer: Allways do the trace_recursive checks · 6776221b
      Steven Rostedt (Red Hat) 提交于
      Currently the trace_recursive checks are only done if CONFIG_TRACING
      is enabled. That was because there use to be a dependency with tracing
      for the recursive checks (it used the task_struct trace recursive
      variable). But now it uses its own variable and there is no dependency.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      6776221b
    • S
      ring-buffer: Move recursive check to per_cpu descriptor · 58a09ec6
      Steven Rostedt (Red Hat) 提交于
      Instead of using a global per_cpu variable to perform the recursive
      checks into the ring buffer, use the already existing per_cpu descriptor
      that is part of the ring buffer itself.
      
      Not only does this simplify the code, it also allows for one ring buffer
      to be used within the guts of the use of another ring buffer. For example
      trace_printk() can now be used within the ring buffer to record changes
      done by an instance into the main ring buffer. The recursion checks
      will prevent the trace_printk() itself from causing recursive issues
      with the main ring buffer (it is just ignored), but the recursive
      checks wont prevent the trace_printk() from recording other ring buffers.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      58a09ec6
  11. 22 5月, 2015 2 次提交
    • S
      ring-buffer: Add unlikelys to make fast path the default · 3205f806
      Steven Rostedt (Red Hat) 提交于
      I was running the trace_event benchmark and noticed that the times
      to record a trace_event was all over the place. I looked at the assembly
      of the ring_buffer_lock_reserver() and saw this:
      
       <ring_buffer_lock_reserve>:
             31 c0                   xor    %eax,%eax
             48 83 3d 76 47 bd 00    cmpq   $0x1,0xbd4776(%rip)        # ffffffff81d10d60 <ring_buffer_flags>
             01
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             75 1d                   jne    ffffffff8113c60d <ring_buffer_lock_reserve+0x2d>
             65 ff 05 69 e3 ec 7e    incl   %gs:0x7eece369(%rip)        # a960 <__preempt_count>
             8b 47 08                mov    0x8(%rdi),%eax
             85 c0                   test   %eax,%eax
       +---- 74 12                   je     ffffffff8113c610 <ring_buffer_lock_reserve+0x30>
       |     65 ff 0d 5b e3 ec 7e    decl   %gs:0x7eece35b(%rip)        # a960 <__preempt_count>
       |     0f 84 85 00 00 00       je     ffffffff8113c690 <ring_buffer_lock_reserve+0xb0>
       |     31 c0                   xor    %eax,%eax
       |     5d                      pop    %rbp
       |     c3                      retq
       |     90                      nop
       +---> 65 44 8b 05 48 e3 ec    mov    %gs:0x7eece348(%rip),%r8d        # a960 <__preempt_count>
             7e
             41 81 e0 ff ff ff 7f    and    $0x7fffffff,%r8d
             b0 08                   mov    $0x8,%al
             65 8b 0d 58 36 ed 7e    mov    %gs:0x7eed3658(%rip),%ecx        # fc80 <current_context>
             41 f7 c0 00 ff 1f 00    test   $0x1fff00,%r8d
             74 1e                   je     ffffffff8113c64f <ring_buffer_lock_reserve+0x6f>
             41 f7 c0 00 00 10 00    test   $0x100000,%r8d
             b0 01                   mov    $0x1,%al
             75 13                   jne    ffffffff8113c64f <ring_buffer_lock_reserve+0x6f>
             41 81 e0 00 00 0f 00    and    $0xf0000,%r8d
             49 83 f8 01             cmp    $0x1,%r8
             19 c0                   sbb    %eax,%eax
             83 e0 02                and    $0x2,%eax
             83 c0 02                add    $0x2,%eax
             85 c8                   test   %ecx,%eax
             75 ab                   jne    ffffffff8113c5fe <ring_buffer_lock_reserve+0x1e>
             09 c8                   or     %ecx,%eax
             65 89 05 24 36 ed 7e    mov    %eax,%gs:0x7eed3624(%rip)        # fc80 <current_context>
      
      The arrow is the fast path.
      
      After adding the unlikely's, the fast path looks a bit better:
      
       <ring_buffer_lock_reserve>:
             31 c0                   xor    %eax,%eax
             48 83 3d 76 47 bd 00    cmpq   $0x1,0xbd4776(%rip)        # ffffffff81d10d60 <ring_buffer_flags>
             01
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             75 7b                   jne    ffffffff8113c66b <ring_buffer_lock_reserve+0x8b>
             65 ff 05 69 e3 ec 7e    incl   %gs:0x7eece369(%rip)        # a960 <__preempt_count>
             8b 47 08                mov    0x8(%rdi),%eax
             85 c0                   test   %eax,%eax
             0f 85 9f 00 00 00       jne    ffffffff8113c6a1 <ring_buffer_lock_reserve+0xc1>
             65 8b 0d 57 e3 ec 7e    mov    %gs:0x7eece357(%rip),%ecx        # a960 <__preempt_count>
             81 e1 ff ff ff 7f       and    $0x7fffffff,%ecx
             b0 08                   mov    $0x8,%al
             65 8b 15 68 36 ed 7e    mov    %gs:0x7eed3668(%rip),%edx        # fc80 <current_context>
             f7 c1 00 ff 1f 00       test   $0x1fff00,%ecx
             75 50                   jne    ffffffff8113c670 <ring_buffer_lock_reserve+0x90>
             85 d0                   test   %edx,%eax
             75 7d                   jne    ffffffff8113c6a1 <ring_buffer_lock_reserve+0xc1>
             09 d0                   or     %edx,%eax
             65 89 05 53 36 ed 7e    mov    %eax,%gs:0x7eed3653(%rip)        # fc80 <current_context>
             65 8b 05 fc da ec 7e    mov    %gs:0x7eecdafc(%rip),%eax        # a130 <cpu_number>
             89 c2                   mov    %eax,%edx
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      3205f806
    • A
      bpf: allow bpf programs to tail-call other bpf programs · 04fd61ab
      Alexei Starovoitov 提交于
      introduce bpf_tail_call(ctx, &jmp_table, index) helper function
      which can be used from BPF programs like:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        bpf_tail_call(ctx, &jmp_table, index);
        ...
      }
      that is roughly equivalent to:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        if (jmp_table[index])
          return (*jmp_table[index])(ctx);
        ...
      }
      The important detail that it's not a normal call, but a tail call.
      The kernel stack is precious, so this helper reuses the current
      stack frame and jumps into another BPF program without adding
      extra call frame.
      It's trivially done in interpreter and a bit trickier in JITs.
      In case of x64 JIT the bigger part of generated assembler prologue
      is common for all programs, so it is simply skipped while jumping.
      Other JITs can do similar prologue-skipping optimization or
      do stack unwind before jumping into the next program.
      
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      Since all BPF programs are idenitified by file descriptor, user space
      need to populate the jmp_table with FDs of other BPF programs.
      If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
      and program execution continues as normal.
      
      New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
      populate this jmp_table array with FDs of other bpf programs.
      Programs can share the same jmp_table array or use multiple jmp_tables.
      
      The chain of tail calls can form unpredictable dynamic loops therefore
      tail_call_cnt is used to limit the number of calls and currently is set to 32.
      
      Use cases:
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      
      ==========
      - simplify complex programs by splitting them into a sequence of small programs
      
      - dispatch routine
        For tracing and future seccomp the program may be triggered on all system
        calls, but processing of syscall arguments will be different. It's more
        efficient to implement them as:
        int syscall_entry(struct seccomp_data *ctx)
        {
           bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
           ... default: process unknown syscall ...
        }
        int sys_write_event(struct seccomp_data *ctx) {...}
        int sys_read_event(struct seccomp_data *ctx) {...}
        syscall_jmp_table[__NR_write] = sys_write_event;
        syscall_jmp_table[__NR_read] = sys_read_event;
      
        For networking the program may call into different parsers depending on
        packet format, like:
        int packet_parser(struct __sk_buff *skb)
        {
           ... parse L2, L3 here ...
           __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
           bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
           ... default: process unknown protocol ...
        }
        int parse_tcp(struct __sk_buff *skb) {...}
        int parse_udp(struct __sk_buff *skb) {...}
        ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
        ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
      
      - for TC use case, bpf_tail_call() allows to implement reclassify-like logic
      
      - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
        are atomic, so user space can build chains of BPF programs on the fly
      
      Implementation details:
      =======================
      - high performance of bpf_tail_call() is the goal.
        It could have been implemented without JIT changes as a wrapper on top of
        BPF_PROG_RUN() macro, but with two downsides:
        . all programs would have to pay performance penalty for this feature and
          tail call itself would be slower, since mandatory stack unwind, return,
          stack allocate would be done for every tailcall.
        . tailcall would be limited to programs running preempt_disabled, since
          generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
          need to be either global per_cpu variable accessed by helper and by wrapper
          or global variable protected by locks.
      
        In this implementation x64 JIT bypasses stack unwind and jumps into the
        callee program after prologue.
      
      - bpf_prog_array_compatible() ensures that prog_type of callee and caller
        are the same and JITed/non-JITed flag is the same, since calling JITed
        program from non-JITed is invalid, since stack frames are different.
        Similarly calling kprobe type program from socket type program is invalid.
      
      - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
        abstraction, its user space API and all of verifier logic.
        It's in the existing arraymap.c file, since several functions are
        shared with regular array map.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04fd61ab
  12. 14 5月, 2015 4 次提交