1. 11 4月, 2018 1 次提交
    • Y
      bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog · 3a38bb98
      Yonghong Song 提交于
      syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
      The error details:
        ======================================================
        WARNING: possible circular locking dependency detected
        4.16.0-rc7+ #3 Not tainted
        ------------------------------------------------------
        syz-executor7/24531 is trying to acquire lock:
         (bpf_event_mutex){+.+.}, at: [<000000008a849b07>] perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
      
        but task is already holding lock:
         (&mm->mmap_sem){++++}, at: [<0000000038768f87>] vm_mmap_pgoff+0x198/0x280 mm/util.c:353
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (&mm->mmap_sem){++++}:
             __might_fault+0x13a/0x1d0 mm/memory.c:4571
             _copy_to_user+0x2c/0xc0 lib/usercopy.c:25
             copy_to_user include/linux/uaccess.h:155 [inline]
             bpf_prog_array_copy_info+0xf2/0x1c0 kernel/bpf/core.c:1694
             perf_event_query_prog_array+0x1c7/0x2c0 kernel/trace/bpf_trace.c:891
             _perf_ioctl kernel/events/core.c:4750 [inline]
             perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4770
             vfs_ioctl fs/ioctl.c:46 [inline]
             do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
             SYSC_ioctl fs/ioctl.c:701 [inline]
             SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
             do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
             entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
        -> #0 (bpf_event_mutex){+.+.}:
             lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
             __mutex_lock_common kernel/locking/mutex.c:756 [inline]
             __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
             mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
             perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
             perf_event_free_bpf_prog kernel/events/core.c:8147 [inline]
             _free_event+0xbdb/0x10f0 kernel/events/core.c:4116
             put_event+0x24/0x30 kernel/events/core.c:4204
             perf_mmap_close+0x60d/0x1010 kernel/events/core.c:5172
             remove_vma+0xb4/0x1b0 mm/mmap.c:172
             remove_vma_list mm/mmap.c:2490 [inline]
             do_munmap+0x82a/0xdf0 mm/mmap.c:2731
             mmap_region+0x59e/0x15a0 mm/mmap.c:1646
             do_mmap+0x6c0/0xe00 mm/mmap.c:1483
             do_mmap_pgoff include/linux/mm.h:2223 [inline]
             vm_mmap_pgoff+0x1de/0x280 mm/util.c:355
             SYSC_mmap_pgoff mm/mmap.c:1533 [inline]
             SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1491
             SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
             SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91
             do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
             entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&mm->mmap_sem);
                                       lock(bpf_event_mutex);
                                       lock(&mm->mmap_sem);
          lock(bpf_event_mutex);
      
         *** DEADLOCK ***
        ======================================================
      
      The bug is introduced by Commit f371b304 ("bpf/tracing: allow
      user space to query prog array on the same tp") where copy_to_user,
      which requires mm->mmap_sem, is called inside bpf_event_mutex lock.
      At the same time, during perf_event file descriptor close,
      mm->mmap_sem is held first and then subsequent
      perf_event_detach_bpf_prog needs bpf_event_mutex lock.
      Such a senario caused a deadlock.
      
      As suggested by Daniel, moving copy_to_user out of the
      bpf_event_mutex lock should fix the problem.
      
      Fixes: f371b304 ("bpf/tracing: allow user space to query prog array on the same tp")
      Reported-by: syzbot+dc5ca0e4c9bfafaf2bae@syzkaller.appspotmail.com
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      3a38bb98
  2. 31 3月, 2018 1 次提交
    • A
      bpf: Check attach type at prog load time · 5e43f899
      Andrey Ignatov 提交于
      == The problem ==
      
      There are use-cases when a program of some type can be attached to
      multiple attach points and those attach points must have different
      permissions to access context or to call helpers.
      
      E.g. context structure may have fields for both IPv4 and IPv6 but it
      doesn't make sense to read from / write to IPv6 field when attach point
      is somewhere in IPv4 stack.
      
      Same applies to BPF-helpers: it may make sense to call some helper from
      some attach point, but not from other for same prog type.
      
      == The solution ==
      
      Introduce `expected_attach_type` field in in `struct bpf_attr` for
      `BPF_PROG_LOAD` command. If scenario described in "The problem" section
      is the case for some prog type, the field will be checked twice:
      
      1) At load time prog type is checked to see if attach type for it must
         be known to validate program permissions correctly. Prog will be
         rejected with EINVAL if it's the case and `expected_attach_type` is
         not specified or has invalid value.
      
      2) At attach time `attach_type` is compared with `expected_attach_type`,
         if prog type requires to have one, and, if they differ, attach will
         be rejected with EINVAL.
      
      The `expected_attach_type` is now available as part of `struct bpf_prog`
      in both `bpf_verifier_ops->is_valid_access()` and
      `bpf_verifier_ops->get_func_proto()` () and can be used to check context
      accesses and calls to helpers correspondingly.
      
      Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and
      Daniel Borkmann <daniel@iogearbox.net> here:
      https://marc.info/?l=linux-netdev&m=152107378717201&w=2Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5e43f899
  3. 29 3月, 2018 1 次提交
    • A
      bpf: introduce BPF_RAW_TRACEPOINT · c4f6699d
      Alexei Starovoitov 提交于
      Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
      kernel internal arguments of the tracepoints in their raw form.
      
      >From bpf program point of view the access to the arguments look like:
      struct bpf_raw_tracepoint_args {
             __u64 args[0];
      };
      
      int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
      {
        // program can read args[N] where N depends on tracepoint
        // and statically verified at program load+attach time
      }
      
      kprobe+bpf infrastructure allows programs access function arguments.
      This feature allows programs access raw tracepoint arguments.
      
      Similar to proposed 'dynamic ftrace events' there are no abi guarantees
      to what the tracepoints arguments are and what their meaning is.
      The program needs to type cast args properly and use bpf_probe_read()
      helper to access struct fields when argument is a pointer.
      
      For every tracepoint __bpf_trace_##call function is prepared.
      In assembler it looks like:
      (gdb) disassemble __bpf_trace_xdp_exception
      Dump of assembler code for function __bpf_trace_xdp_exception:
         0xffffffff81132080 <+0>:     mov    %ecx,%ecx
         0xffffffff81132082 <+2>:     jmpq   0xffffffff811231f0 <bpf_trace_run3>
      
      where
      
      TRACE_EVENT(xdp_exception,
              TP_PROTO(const struct net_device *dev,
                       const struct bpf_prog *xdp, u32 act),
      
      The above assembler snippet is casting 32-bit 'act' field into 'u64'
      to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
      All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
      and in total this approach adds 7k bytes to .text.
      
      This approach gives the lowest possible overhead
      while calling trace_xdp_exception() from kernel C code and
      transitioning into bpf land.
      Since tracepoint+bpf are used at speeds of 1M+ events per second
      this is valuable optimization.
      
      The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
      that returns anon_inode FD of 'bpf-raw-tracepoint' object.
      
      The user space looks like:
      // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
      prog_fd = bpf_prog_load(...);
      // receive anon_inode fd for given bpf_raw_tracepoint with prog attached
      raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);
      
      Ctrl-C of tracing daemon or cmdline tool that uses this feature
      will automatically detach bpf program, unload it and
      unregister tracepoint probe.
      
      On the kernel side the __bpf_raw_tp_map section of pointers to
      tracepoint definition and to __bpf_trace_*() probe function is used
      to find a tracepoint with "xdp_exception" name and
      corresponding __bpf_trace_xdp_exception() probe function
      which are passed to tracepoint_probe_register() to connect probe
      with tracepoint.
      
      Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
      tracepoint mechanisms. perf_event_open() can be used in parallel
      on the same tracepoint.
      Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
      Each with its own bpf program. The kernel will execute
      all tracepoint probes and all attached bpf programs.
      
      In the future bpf_raw_tracepoints can be extended with
      query/introspection logic.
      
      __bpf_raw_tp_map section logic was contributed by Steven Rostedt
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c4f6699d
  4. 27 3月, 2018 1 次提交
  5. 26 3月, 2018 1 次提交
  6. 24 3月, 2018 1 次提交
  7. 21 3月, 2018 1 次提交
  8. 13 3月, 2018 1 次提交
  9. 08 3月, 2018 1 次提交
  10. 15 2月, 2018 1 次提交
    • D
      bpf: fix bpf_prog_array_copy_to_user warning from perf event prog query · 9c481b90
      Daniel Borkmann 提交于
      syzkaller tried to perform a prog query in perf_event_query_prog_array()
      where struct perf_event_query_bpf had an ids_len of 1,073,741,353 and
      thus causing a warning due to failed kcalloc() allocation out of the
      bpf_prog_array_copy_to_user() helper. Given we cannot attach more than
      64 programs to a perf event, there's no point in allowing huge ids_len.
      Therefore, allow a buffer that would fix the maximum number of ids and
      also add a __GFP_NOWARN to the temporary ids buffer.
      
      Fixes: f371b304 ("bpf/tracing: allow user space to query prog array on the same tp")
      Fixes: 0911287c ("bpf: fix bpf_prog_array_copy_to_user() issues")
      Reported-by: syzbot+cab5816b0edbabf598b3@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9c481b90
  11. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  12. 08 2月, 2018 2 次提交
  13. 06 2月, 2018 2 次提交
  14. 24 1月, 2018 7 次提交
  15. 19 1月, 2018 2 次提交
    • S
      tracing: Fix converting enum's from the map in trace_event_eval_update() · 1ebe1eaf
      Steven Rostedt (VMware) 提交于
      Since enums do not get converted by the TRACE_EVENT macro into their values,
      the event format displaces the enum name and not the value. This breaks
      tools like perf and trace-cmd that need to interpret the raw binary data. To
      solve this, an enum map was created to convert these enums into their actual
      numbers on boot up. This is done by TRACE_EVENTS() adding a
      TRACE_DEFINE_ENUM() macro.
      
      Some enums were not being converted. This was caused by an optization that
      had a bug in it.
      
      All calls get checked against this enum map to see if it should be converted
      or not, and it compares the call's system to the system that the enum map
      was created under. If they match, then they call is processed.
      
      To cut down on the number of iterations needed to find the maps with a
      matching system, since calls and maps are grouped by system, when a match is
      made, the index into the map array is saved, so that the next call, if it
      belongs to the same system as the previous call, could start right at that
      array index and not have to scan all the previous arrays.
      
      The problem was, the saved index was used as the variable to know if this is
      a call in a new system or not. If the index was zero, it was assumed that
      the call is in a new system and would keep incrementing the saved index
      until it found a matching system. The issue arises when the first matching
      system was at index zero. The next map, if it belonged to the same system,
      would then think it was the first match and increment the index to one. If
      the next call belong to the same system, it would begin its search of the
      maps off by one, and miss the first enum that should be converted. This left
      a single enum not converted properly.
      
      Also add a comment to describe exactly what that index was for. It took me a
      bit too long to figure out what I was thinking when debugging this issue.
      
      Link: http://lkml.kernel.org/r/717BE572-2070-4C1E-9902-9F2E0FEDA4F8@oracle.com
      
      Cc: stable@vger.kernel.org
      Fixes: 0c564a53 ("tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values")
      Reported-by: NChuck Lever <chuck.lever@oracle.com>
      Teste-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      1ebe1eaf
    • S
      ring-buffer: Fix duplicate results in mapping context to bits in recursive lock · 0164e0d7
      Steven Rostedt (VMware) 提交于
      In bringing back the context checks, the code checks first if its normal
      (non-interrupt) context, and then for NMI then IRQ then softirq. The final
      check is redundant. Since the if branch is only hit if the context is one of
      NMI, IRQ, or SOFTIRQ, if it's not NMI or IRQ there's no reason to check if
      it is SOFTIRQ. The current code returns the same result even if its not a
      SOFTIRQ. Which is confusing.
      
        pc & SOFTIRQ_OFFSET ? 2 : RB_CTX_SOFTIRQ
      
      Is redundant as RB_CTX_SOFTIRQ *is* 2!
      
      Fixes: a0e3a18f ("ring-buffer: Bring back context level recursive checks")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      0164e0d7
  16. 18 1月, 2018 1 次提交
    • Y
      bpf: change fake_ip for bpf_trace_printk helper · eefa864a
      Yonghong Song 提交于
      Currently, for bpf_trace_printk helper, fake ip address 0x1
      is used with comments saying that fake ip will not be printed.
      This is indeed true for 4.12 and earlier version, but for
      4.13 and later version, the ip address will be printed if
      it cannot be resolved with kallsym. Running samples/bpf/tracex5
      program and you will have the following in the debugfs
      trace_pipe output:
        ...
        <...>-1819  [003] ....   443.497877: 0x00000001: mmap
        <...>-1819  [003] ....   443.498289: 0x00000001: syscall=102 (one of get/set uid/pid/gid)
        ...
      
      The kernel commit changed this behavior is:
        commit feaf1283
        Author: Steven Rostedt (VMware) <rostedt@goodmis.org>
        Date:   Thu Jun 22 17:04:55 2017 -0400
      
            tracing: Show address when function names are not found
        ...
      
      This patch changed the comment and also altered the fake ip
      address to 0x0 as users may think 0x1 has some special meaning
      while it doesn't. The new output:
        ...
        <...>-1799  [002] ....    25.953576: 0: mmap
        <...>-1799  [002] ....    25.953865: 0: read(fd=0, buf=00000000053936b5, size=512)
        ...
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      eefa864a
  17. 16 1月, 2018 2 次提交
    • R
      tracing: Prevent PROFILE_ALL_BRANCHES when FORTIFY_SOURCE=y · 68e76e03
      Randy Dunlap 提交于
      I regularly get 50 MB - 60 MB files during kernel randconfig builds.
      These large files mostly contain (many repeats of; e.g., 124,594):
      
      In file included from ../include/linux/string.h:6:0,
                       from ../include/linux/uuid.h:20,
                       from ../include/linux/mod_devicetable.h:13,
                       from ../scripts/mod/devicetable-offsets.c:3:
      ../include/linux/compiler.h:64:4: warning: '______f' is static but declared in inline function 'strcpy' which is not static [enabled by default]
          ______f = {     \
          ^
      ../include/linux/compiler.h:56:23: note: in expansion of macro '__trace_if'
                             ^
      ../include/linux/string.h:425:2: note: in expansion of macro 'if'
        if (p_size == (size_t)-1 && q_size == (size_t)-1)
        ^
      
      This only happens when CONFIG_FORTIFY_SOURCE=y and
      CONFIG_PROFILE_ALL_BRANCHES=y, so prevent PROFILE_ALL_BRANCHES if
      FORTIFY_SOURCE=y.
      
      Link: http://lkml.kernel.org/r/9199446b-a141-c0c3-9678-a3f9107f2750@infradead.orgSigned-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      68e76e03
    • S
      ring-buffer: Bring back context level recursive checks · a0e3a18f
      Steven Rostedt (VMware) 提交于
      Commit 1a149d7d ("ring-buffer: Rewrite trace_recursive_(un)lock() to be
      simpler") replaced the context level recursion checks with a simple counter.
      This would prevent the ring buffer code from recursively calling itself more
      than the max number of contexts that exist (Normal, softirq, irq, nmi). But
      this change caused a lockup in a specific case, which was during suspend and
      resume using a global clock. Adding a stack dump to see where this occurred,
      the issue was in the trace global clock itself:
      
        trace_buffer_lock_reserve+0x1c/0x50
        __trace_graph_entry+0x2d/0x90
        trace_graph_entry+0xe8/0x200
        prepare_ftrace_return+0x69/0xc0
        ftrace_graph_caller+0x78/0xa8
        queued_spin_lock_slowpath+0x5/0x1d0
        trace_clock_global+0xb0/0xc0
        ring_buffer_lock_reserve+0xf9/0x390
      
      The function graph tracer traced queued_spin_lock_slowpath that was called
      by trace_clock_global. This pointed out that the trace_clock_global() is not
      reentrant, as it takes a spin lock. It depended on the ring buffer recursive
      lock from letting that happen.
      
      By removing the context detection and adding just a max number of allowable
      recursions, it allowed the trace_clock_global() to be entered again and try
      to retake the spinlock it already held, causing a deadlock.
      
      Fixes: 1a149d7d ("ring-buffer: Rewrite trace_recursive_(un)lock() to be simpler")
      Reported-by: NDavid Weinehall <david.weinehall@gmail.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      a0e3a18f
  18. 13 1月, 2018 3 次提交
  19. 28 12月, 2017 5 次提交
    • S
      tracing: Fix possible double free on failure of allocating trace buffer · 4397f045
      Steven Rostedt (VMware) 提交于
      Jing Xia and Chunyan Zhang reported that on failing to allocate part of the
      tracing buffer, memory is freed, but the pointers that point to them are not
      initialized back to NULL, and later paths may try to free the freed memory
      again. Jing and Chunyan fixed one of the locations that does this, but
      missed a spot.
      
      Link: http://lkml.kernel.org/r/20171226071253.8968-1-chunyan.zhang@spreadtrum.com
      
      Cc: stable@vger.kernel.org
      Fixes: 737223fb ("tracing: Consolidate buffer allocation code")
      Reported-by: NJing Xia <jing.xia@spreadtrum.com>
      Reported-by: NChunyan Zhang <chunyan.zhang@spreadtrum.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      4397f045
    • J
      tracing: Fix crash when it fails to alloc ring buffer · 24f2aaf9
      Jing Xia 提交于
      Double free of the ring buffer happens when it fails to alloc new
      ring buffer instance for max_buffer if TRACER_MAX_TRACE is configured.
      The root cause is that the pointer is not set to NULL after the buffer
      is freed in allocate_trace_buffers(), and the freeing of the ring
      buffer is invoked again later if the pointer is not equal to Null,
      as:
      
      instance_mkdir()
          |-allocate_trace_buffers()
              |-allocate_trace_buffer(tr, &tr->trace_buffer...)
      	|-allocate_trace_buffer(tr, &tr->max_buffer...)
      
                // allocate fail(-ENOMEM),first free
                // and the buffer pointer is not set to null
              |-ring_buffer_free(tr->trace_buffer.buffer)
      
             // out_free_tr
          |-free_trace_buffers()
              |-free_trace_buffer(&tr->trace_buffer);
      
      	      //if trace_buffer is not null, free again
      	    |-ring_buffer_free(buf->buffer)
                      |-rb_free_cpu_buffer(buffer->buffers[cpu])
                          // ring_buffer_per_cpu is null, and
                          // crash in ring_buffer_per_cpu->pages
      
      Link: http://lkml.kernel.org/r/20171226071253.8968-1-chunyan.zhang@spreadtrum.com
      
      Cc: stable@vger.kernel.org
      Fixes: 737223fb ("tracing: Consolidate buffer allocation code")
      Signed-off-by: NJing Xia <jing.xia@spreadtrum.com>
      Signed-off-by: NChunyan Zhang <chunyan.zhang@spreadtrum.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      24f2aaf9
    • S
      ring-buffer: Do no reuse reader page if still in use · ae415fa4
      Steven Rostedt (VMware) 提交于
      To free the reader page that is allocated with ring_buffer_alloc_read_page(),
      ring_buffer_free_read_page() must be called. For faster performance, this
      page can be reused by the ring buffer to avoid having to free and allocate
      new pages.
      
      The issue arises when the page is used with a splice pipe into the
      networking code. The networking code may up the page counter for the page,
      and keep it active while sending it is queued to go to the network. The
      incrementing of the page ref does not prevent it from being reused in the
      ring buffer, and this can cause the page that is being sent out to the
      network to be modified before it is sent by reading new data.
      
      Add a check to the page ref counter, and only reuse the page if it is not
      being used anywhere else.
      
      Cc: stable@vger.kernel.org
      Fixes: 73a757e6 ("ring-buffer: Return reader page back into existing ring buffer")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      ae415fa4
    • S
      tracing: Remove extra zeroing out of the ring buffer page · 6b7e633f
      Steven Rostedt (VMware) 提交于
      The ring_buffer_read_page() takes care of zeroing out any extra data in the
      page that it returns. There's no need to zero it out again from the
      consumer. It was removed from one consumer of this function, but
      read_buffers_splice_read() did not remove it, and worse, it contained a
      nasty bug because of it.
      
      Cc: stable@vger.kernel.org
      Fixes: 2711ca23 ("ring-buffer: Move zeroing out excess in page to ring buffer code")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      6b7e633f
    • S
      ring-buffer: Mask out the info bits when returning buffer page length · 45d8b80c
      Steven Rostedt (VMware) 提交于
      Two info bits were added to the "commit" part of the ring buffer data page
      when returned to be consumed. This was to inform the user space readers that
      events have been missed, and that the count may be stored at the end of the
      page.
      
      What wasn't handled, was the splice code that actually called a function to
      return the length of the data in order to zero out the rest of the page
      before sending it up to user space. These data bits were returned with the
      length making the value negative, and that negative value was not checked.
      It was compared to PAGE_SIZE, and only used if the size was less than
      PAGE_SIZE. Luckily PAGE_SIZE is unsigned long which made the compare an
      unsigned compare, meaning the negative size value did not end up causing a
      large portion of memory to be randomly zeroed out.
      
      Cc: stable@vger.kernel.org
      Fixes: 66a8cb95 ("ring-buffer: Add place holder recording of dropped events")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      45d8b80c
  20. 18 12月, 2017 1 次提交
  21. 15 12月, 2017 1 次提交
  22. 14 12月, 2017 1 次提交
    • Y
      bpf/tracing: fix kernel/events/core.c compilation error · f4e2298e
      Yonghong Song 提交于
      Commit f371b304 ("bpf/tracing: allow user space to
      query prog array on the same tp") introduced a perf
      ioctl command to query prog array attached to the
      same perf tracepoint. The commit introduced a
      compilation error under certain config conditions, e.g.,
        (1). CONFIG_BPF_SYSCALL is not defined, or
        (2). CONFIG_TRACING is defined but neither CONFIG_UPROBE_EVENTS
             nor CONFIG_KPROBE_EVENTS is defined.
      
      Error message:
        kernel/events/core.o: In function `perf_ioctl':
        core.c:(.text+0x98c4): undefined reference to `bpf_event_query_prog_array'
      
      This patch fixed this error by guarding the real definition under
      CONFIG_BPF_EVENTS and provided static inline dummy function
      if CONFIG_BPF_EVENTS was not defined.
      It renamed the function from bpf_event_query_prog_array to
      perf_event_query_prog_array and moved the definition from linux/bpf.h
      to linux/trace_events.h so the definition is in proximity to
      other prog_array related functions.
      
      Fixes: f371b304 ("bpf/tracing: allow user space to query prog array on the same tp")
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f4e2298e
  23. 13 12月, 2017 2 次提交
    • D
      bpf: fix corruption on concurrent perf_event_output calls · 283ca526
      Daniel Borkmann 提交于
      When tracing and networking programs are both attached in the
      system and both use event-output helpers that eventually call
      into perf_event_output(), then we could end up in a situation
      where the tracing attached program runs in user context while
      a cls_bpf program is triggered on that same CPU out of softirq
      context.
      
      Since both rely on the same per-cpu perf_sample_data, we could
      potentially corrupt it. This can only ever happen in a combination
      of the two types; all tracing programs use a bpf_prog_active
      counter to bail out in case a program is already running on
      that CPU out of a different context. XDP and cls_bpf programs
      by themselves don't have this issue as they run in the same
      context only. Therefore, split both perf_sample_data so they
      cannot be accessed from each other.
      
      Fixes: 20b9d7ac ("bpf: avoid excessive stack usage for perf_sample_data")
      Reported-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      283ca526
    • J
      bpf: add a bpf_override_function helper · 9802d865
      Josef Bacik 提交于
      Error injection is sloppy and very ad-hoc.  BPF could fill this niche
      perfectly with it's kprobe functionality.  We could make sure errors are
      only triggered in specific call chains that we care about with very
      specific situations.  Accomplish this with the bpf_override_funciton
      helper.  This will modify the probe'd callers return value to the
      specified value and set the PC to an override function that simply
      returns, bypassing the originally probed function.  This gives us a nice
      clean way to implement systematic error injection for all of our code
      paths.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9802d865