1. 30 6月, 2016 3 次提交
  2. 20 6月, 2016 1 次提交
  3. 16 6月, 2016 3 次提交
    • D
      bpf, maps: flush own entries on perf map release · 3b1efb19
      Daniel Borkmann 提交于
      The behavior of perf event arrays are quite different from all
      others as they are tightly coupled to perf event fds, f.e. shown
      recently by commit e03e7ee3 ("perf/bpf: Convert perf_event_array
      to use struct file") to make refcounting on perf event more robust.
      A remaining issue that the current code still has is that since
      additions to the perf event array take a reference on the struct
      file via perf_event_get() and are only released via fput() (that
      cleans up the perf event eventually via perf_event_release_kernel())
      when the element is either manually removed from the map from user
      space or automatically when the last reference on the perf event
      map is dropped. However, this leads us to dangling struct file's
      when the map gets pinned after the application owning the perf
      event descriptor exits, and since the struct file reference will
      in such case only be manually dropped or via pinned file removal,
      it leads to the perf event living longer than necessary, consuming
      needlessly resources for that time.
      
      Relations between perf event fds and bpf perf event map fds can be
      rather complex. F.e. maps can act as demuxers among different perf
      event fds that can possibly be owned by different threads and based
      on the index selection from the program, events get dispatched to
      one of the per-cpu fd endpoints. One perf event fd (or, rather a
      per-cpu set of them) can also live in multiple perf event maps at
      the same time, listening for events. Also, another requirement is
      that perf event fds can get closed from application side after they
      have been attached to the perf event map, so that on exit perf event
      map will take care of dropping their references eventually. Likewise,
      when such maps are pinned, the intended behavior is that a user
      application does bpf_obj_get(), puts its fds in there and on exit
      when fd is released, they are dropped from the map again, so the map
      acts rather as connector endpoint. This also makes perf event maps
      inherently different from program arrays as described in more detail
      in commit c9da161c ("bpf: fix clearing on persistent program
      array maps").
      
      To tackle this, map entries are marked by the map struct file that
      added the element to the map. And when the last reference to that map
      struct file is released from user space, then the tracked entries
      are purged from the map. This is okay, because new map struct files
      instances resp. frontends to the anon inode are provided via
      bpf_map_new_fd() that is called when we invoke bpf_obj_get_user()
      for retrieving a pinned map, but also when an initial instance is
      created via map_create(). The rest is resolved by the vfs layer
      automatically for us by keeping reference count on the map's struct
      file. Any concurrent updates on the map slot are fine as well, it
      just means that perf_event_fd_array_release() needs to delete less
      of its own entires.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b1efb19
    • A
      bpf, trace: check event type in bpf_perf_event_read · ad572d17
      Alexei Starovoitov 提交于
      similar to bpf_perf_event_output() the bpf_perf_event_read() helper
      needs to check the type of the perf_event before reading the counter.
      
      Fixes: a43eec30 ("bpf: introduce bpf_perf_event_output() helper")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad572d17
    • A
      bpf: fix matching of data/data_end in verifier · 19de99f7
      Alexei Starovoitov 提交于
      The ctx structure passed into bpf programs is different depending on bpf
      program type. The verifier incorrectly marked ctx->data and ctx->data_end
      access based on ctx offset only. That caused loads in tracing programs
      int bpf_prog(struct pt_regs *ctx) { .. ctx->ax .. }
      to be incorrectly marked as PTR_TO_PACKET which later caused verifier
      to reject the program that was actually valid in tracing context.
      Fix this by doing program type specific matching of ctx offsets.
      
      Fixes: 969bf05e ("bpf: direct packet access")
      Reported-by: NSasha Goldshtein <goldshtn@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19de99f7
  4. 08 6月, 2016 1 次提交
  5. 21 5月, 2016 1 次提交
    • S
      ftrace: Don't disable irqs when taking the tasklist_lock read_lock · 6112a300
      Soumya PN 提交于
      In ftrace.c inside the function alloc_retstack_tasklist() (which will be
      invoked when function_graph tracing is on) the tasklist_lock is being
      held as reader while iterating through a list of threads. Here the lock
      is being held as reader with irqs disabled. The tasklist_lock is never
      write_locked in interrupt context so it is safe to not disable interrupts
      for the duration of read_lock in this block which, can be significant,
      given the block of code iterates through all threads. Hence changing the
      code to call read_lock() and read_unlock() instead of read_lock_irqsave()
      and read_unlock_irqrestore().
      
      A similar change was made in commits: 8063e41d ("tracing: Change
      syscall_*regfunc() to check PF_KTHREAD and use for_each_process_thread()")'
      and 3472eaa1 ("sched: normalize_rt_tasks(): Don't use _irqsave for
      tasklist_lock, use task_rq_lock()")'
      
      Link: http://lkml.kernel.org/r/1463500874-77480-1-git-send-email-soumya.p.n@hpe.comSigned-off-by: NSoumya PN <soumya.p.n@hpe.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      6112a300
  6. 14 5月, 2016 1 次提交
    • S
      ring-buffer: Prevent overflow of size in ring_buffer_resize() · 59643d15
      Steven Rostedt (Red Hat) 提交于
      If the size passed to ring_buffer_resize() is greater than MAX_LONG - BUF_PAGE_SIZE
      then the DIV_ROUND_UP() will return zero.
      
      Here's the details:
      
        # echo 18014398509481980 > /sys/kernel/debug/tracing/buffer_size_kb
      
      tracing_entries_write() processes this and converts kb to bytes.
      
       18014398509481980 << 10 = 18446744073709547520
      
      and this is passed to ring_buffer_resize() as unsigned long size.
      
       size = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
      
      Where DIV_ROUND_UP(a, b) is (a + b - 1)/b
      
      BUF_PAGE_SIZE is 4080 and here
      
       18446744073709547520 + 4080 - 1 = 18446744073709551599
      
      where 18446744073709551599 is still smaller than 2^64
      
       2^64 - 18446744073709551599 = 17
      
      But now 18446744073709551599 / 4080 = 4521260802379792
      
      and size = size * 4080 = 18446744073709551360
      
      This is checked to make sure its still greater than 2 * 4080,
      which it is.
      
      Then we convert to the number of buffer pages needed.
      
       nr_page = DIV_ROUND_UP(size, BUF_PAGE_SIZE)
      
      but this time size is 18446744073709551360 and
      
       2^64 - (18446744073709551360 + 4080 - 1) = -3823
      
      Thus it overflows and the resulting number is less than 4080, which makes
      
        3823 / 4080 = 0
      
      an nr_pages is set to this. As we already checked against the minimum that
      nr_pages may be, this causes the logic to fail as well, and we crash the
      kernel.
      
      There's no reason to have the two DIV_ROUND_UP() (that's just result of
      historical code changes), clean up the code and fix this bug.
      
      Cc: stable@vger.kernel.org # 3.5+
      Fixes: 83f40318 ("ring-buffer: Make removal of ring buffer pages atomic")
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      59643d15
  7. 13 5月, 2016 1 次提交
    • S
      ring-buffer: Use long for nr_pages to avoid overflow failures · 9b94a8fb
      Steven Rostedt (Red Hat) 提交于
      The size variable to change the ring buffer in ftrace is a long. The
      nr_pages used to update the ring buffer based on the size is int. On 64 bit
      machines this can cause an overflow problem.
      
      For example, the following will cause the ring buffer to crash:
      
       # cd /sys/kernel/debug/tracing
       # echo 10 > buffer_size_kb
       # echo 8556384240 > buffer_size_kb
      
      Then you get the warning of:
      
       WARNING: CPU: 1 PID: 318 at kernel/trace/ring_buffer.c:1527 rb_update_pages+0x22f/0x260
      
      Which is:
      
        RB_WARN_ON(cpu_buffer, nr_removed);
      
      Note each ring buffer page holds 4080 bytes.
      
      This is because:
      
       1) 10 causes the ring buffer to have 3 pages.
          (10kb requires 3 * 4080 pages to hold)
      
       2) (2^31 / 2^10  + 1) * 4080 = 8556384240
          The value written into buffer_size_kb is shifted by 10 and then passed
          to ring_buffer_resize(). 8556384240 * 2^10 = 8761737461760
      
       3) The size passed to ring_buffer_resize() is then divided by BUF_PAGE_SIZE
          which is 4080. 8761737461760 / 4080 = 2147484672
      
       4) nr_pages is subtracted from the current nr_pages (3) and we get:
          2147484669. This value is saved in a signed integer nr_pages_to_update
      
       5) 2147484669 is greater than 2^31 but smaller than 2^32, a signed int
          turns into the value of -2147482627
      
       6) As the value is a negative number, in update_pages_handler() it is
          negated and passed to rb_remove_pages() and 2147482627 pages will
          be removed, which is much larger than 3 and it causes the warning
          because not all the pages asked to be removed were removed.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=118001
      
      Cc: stable@vger.kernel.org # 2.6.28+
      Fixes: 7a8e76a3 ("tracing: unified trace buffer")
      Reported-by: NHao Qin <QEver.cn@gmail.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      9b94a8fb
  8. 10 5月, 2016 2 次提交
  9. 04 5月, 2016 2 次提交
    • S
      tracing: Use temp buffer when filtering events · 0fc1b09f
      Steven Rostedt (Red Hat) 提交于
      Filtering of events requires the data to be written to the ring buffer
      before it can be decided to filter or not. This is because the parameters of
      the filter are based on the result that is written to the ring buffer and
      not on the parameters that are passed into the trace functions.
      
      The ftrace ring buffer is optimized for writing into the ring buffer and
      committing. The discard procedure used when filtering decides the event
      should be discarded is much more heavy weight. Thus, using a temporary
      filter when filtering events can speed things up drastically.
      
      Without a temp buffer we have:
      
       # trace-cmd start -p nop
       # perf stat -r 10 hackbench 50
             0.790706626 seconds time elapsed ( +-  0.71% )
      
       # trace-cmd start -e all
       # perf stat -r 10 hackbench 50
             1.566904059 seconds time elapsed ( +-  0.27% )
      
       # trace-cmd start -e all -f 'common_preempt_count==20'
       # perf stat -r 10 hackbench 50
             1.690598511 seconds time elapsed ( +-  0.19% )
      
       # trace-cmd start -e all -f 'common_preempt_count!=20'
       # perf stat -r 10 hackbench 50
             1.707486364 seconds time elapsed ( +-  0.30% )
      
      The first run above is without any tracing, just to get a based figure.
      hackbench takes ~0.79 seconds to run on the system.
      
      The second run enables tracing all events where nothing is filtered. This
      increases the time by 100% and hackbench takes 1.57 seconds to run.
      
      The third run filters all events where the preempt count will equal "20"
      (this should never happen) thus all events are discarded. This takes 1.69
      seconds to run. This is 10% slower than just committing the events!
      
      The last run enables all events and filters where the filter will commit all
      events, and this takes 1.70 seconds to run. The filtering overhead is
      approximately 10%. Thus, the discard and commit of an event from the ring
      buffer may be about the same time.
      
      With this patch, the numbers change:
      
       # trace-cmd start -p nop
       # perf stat -r 10 hackbench 50
             0.778233033 seconds time elapsed ( +-  0.38% )
      
       # trace-cmd start -e all
       # perf stat -r 10 hackbench 50
             1.582102692 seconds time elapsed ( +-  0.28% )
      
       # trace-cmd start -e all -f 'common_preempt_count==20'
       # perf stat -r 10 hackbench 50
             1.309230710 seconds time elapsed ( +-  0.22% )
      
       # trace-cmd start -e all -f 'common_preempt_count!=20'
       # perf stat -r 10 hackbench 50
             1.786001924 seconds time elapsed ( +-  0.20% )
      
      The first run is again the base with no tracing.
      
      The second run is all tracing with no filtering. It is a little slower, but
      that may be well within the noise.
      
      The third run shows that discarding all events only took 1.3 seconds. This
      is a speed up of 23%! The discard is much faster than even the commit.
      
      The one downside is shown in the last run. Events that are not discarded by
      the filter will take longer to add, this is due to the extra copy of the
      event.
      
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      0fc1b09f
    • C
      tracing: Don't display trigger file for events that can't be enabled · 854145e0
      Chunyu Hu 提交于
      Currently register functions for events will be called
      through the 'reg' field of event class directly without
      any check when seting up triggers.
      
      Triggers for events that don't support register through
      debug fs (events under events/ftrace are for trace-cmd to
      read event format, and most of them don't have a register
      function except events/ftrace/functionx) can't be enabled
      at all, and an oops will be hit when setting up trigger
      for those events, so just not creating them is an easy way
      to avoid the oops.
      
      Link: http://lkml.kernel.org/r/1462275274-3911-1-git-send-email-chuhu@redhat.com
      
      Cc: stable@vger.kernel.org # 3.14+
      Fixes: 85f2b082 ("tracing: Add basic event trigger framework")
      Signed-off-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      854145e0
  10. 03 5月, 2016 1 次提交
  11. 30 4月, 2016 6 次提交
  12. 27 4月, 2016 4 次提交
  13. 26 4月, 2016 4 次提交
  14. 20 4月, 2016 10 次提交