1. 23 11月, 2017 3 次提交
    • G
      bpf: change bpf_perf_event_output arg5 type to ARG_CONST_SIZE_OR_ZERO · a60dd35d
      Gianluca Borello 提交于
      Commit 9fd29c08 ("bpf: improve verifier ARG_CONST_SIZE_OR_ZERO
      semantics") relaxed the treatment of ARG_CONST_SIZE_OR_ZERO due to the way
      the compiler generates optimized BPF code when checking boundaries of an
      argument from C code. A typical example of this optimized code can be
      generated using the bpf_perf_event_output helper when operating on variable
      memory:
      
      /* len is a generic scalar */
      if (len > 0 && len <= 0x7fff)
              bpf_perf_event_output(ctx, &perf_map, 0, buf, len);
      
      110: (79) r5 = *(u64 *)(r10 -40)
      111: (bf) r1 = r5
      112: (07) r1 += -1
      113: (25) if r1 > 0x7ffe goto pc+6
      114: (bf) r1 = r6
      115: (18) r2 = 0xffff94e5f166c200
      117: (b7) r3 = 0
      118: (bf) r4 = r7
      119: (85) call bpf_perf_event_output#25
      R5 min value is negative, either use unsigned or 'var &= const'
      
      With this code, the verifier loses track of the variable.
      
      Replacing arg5 with ARG_CONST_SIZE_OR_ZERO is thus desirable since it
      avoids this quite common case which leads to usability issues, and the
      compiler generates code that the verifier can more easily test:
      
      if (len <= 0x7fff)
              bpf_perf_event_output(ctx, &perf_map, 0, buf, len);
      
      or
      
      bpf_perf_event_output(ctx, &perf_map, 0, buf, len & 0x7fff);
      
      No changes to the bpf_perf_event_output helper are necessary since it can
      handle a case where size is 0, and an empty frame is pushed.
      Reported-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a60dd35d
    • G
      bpf: change bpf_probe_read_str arg2 type to ARG_CONST_SIZE_OR_ZERO · 5c4e1201
      Gianluca Borello 提交于
      Commit 9fd29c08 ("bpf: improve verifier ARG_CONST_SIZE_OR_ZERO
      semantics") relaxed the treatment of ARG_CONST_SIZE_OR_ZERO due to the way
      the compiler generates optimized BPF code when checking boundaries of an
      argument from C code. A typical example of this optimized code can be
      generated using the bpf_probe_read_str helper when operating on variable
      memory:
      
      /* len is a generic scalar */
      if (len > 0 && len <= 0x7fff)
              bpf_probe_read_str(p, len, s);
      
      251: (79) r1 = *(u64 *)(r10 -88)
      252: (07) r1 += -1
      253: (25) if r1 > 0x7ffe goto pc-42
      254: (bf) r1 = r7
      255: (79) r2 = *(u64 *)(r10 -88)
      256: (bf) r8 = r4
      257: (85) call bpf_probe_read_str#45
      R2 min value is negative, either use unsigned or 'var &= const'
      
      With this code, the verifier loses track of the variable.
      
      Replacing arg2 with ARG_CONST_SIZE_OR_ZERO is thus desirable since it
      avoids this quite common case which leads to usability issues, and the
      compiler generates code that the verifier can more easily test:
      
      if (len <= 0x7fff)
              bpf_probe_read_str(p, len, s);
      
      or
      
      bpf_probe_read_str(p, len & 0x7fff, s);
      
      No changes to the bpf_probe_read_str helper are necessary since
      strncpy_from_unsafe itself immediately returns if the size passed is 0.
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5c4e1201
    • G
      bpf: remove explicit handling of 0 for arg2 in bpf_probe_read · eb33f2cc
      Gianluca Borello 提交于
      Commit 9c019e2b ("bpf: change helper bpf_probe_read arg2 type to
      ARG_CONST_SIZE_OR_ZERO") changed arg2 type to ARG_CONST_SIZE_OR_ZERO to
      simplify writing bpf programs by taking advantage of the new semantics
      introduced for ARG_CONST_SIZE_OR_ZERO which allows <!NULL, 0> arguments.
      
      In order to prevent the helper from actually passing a NULL pointer to
      probe_kernel_read, which can happen when <NULL, 0> is passed to the helper,
      the commit also introduced an explicit check against size == 0.
      
      After the recent introduction of the ARG_PTR_TO_MEM_OR_NULL type,
      bpf_probe_read can not receive a pair of <NULL, 0> arguments anymore, thus
      the check is not needed anymore and can be removed, since probe_kernel_read
      can correctly handle a <!NULL, 0> call. This also fixes the semantics of
      the helper before it gets officially released and bpf programs start
      relying on this check.
      
      Fixes: 9c019e2b ("bpf: change helper bpf_probe_read arg2 type to ARG_CONST_SIZE_OR_ZERO")
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      eb33f2cc
  2. 14 11月, 2017 1 次提交
  3. 11 11月, 2017 2 次提交
  4. 01 11月, 2017 1 次提交
  5. 27 10月, 2017 2 次提交
    • G
      bpf: remove tail_call and get_stackid helper declarations from bpf.h · 035226b9
      Gianluca Borello 提交于
      commit afdb09c7 ("security: bpf: Add LSM hooks for bpf object related
      syscall") included linux/bpf.h in linux/security.h. As a result, bpf
      programs including bpf_helpers.h and some other header that ends up
      pulling in also security.h, such as several examples under samples/bpf,
      fail to compile because bpf_tail_call and bpf_get_stackid are now
      "redefined as different kind of symbol".
      
      >From bpf.h:
      
      u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5);
      u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
      
      Whereas in bpf_helpers.h they are:
      
      static void (*bpf_tail_call)(void *ctx, void *map, int index);
      static int (*bpf_get_stackid)(void *ctx, void *map, int flags);
      
      Fix this by removing the unused declaration of bpf_tail_call and moving
      the declaration of bpf_get_stackid in bpf_trace.c, which is the only
      place where it's needed.
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      035226b9
    • Y
      perf/bpf: Extend the perf_event_read_local() interface, a.k.a. "bpf: perf... · 7d9285e8
      Yonghong Song 提交于
      perf/bpf: Extend the perf_event_read_local() interface, a.k.a. "bpf: perf event change needed for subsequent bpf helpers"
      
      eBPF programs would like access to the (perf) event enabled and
      running times along with the event value, such that they can deal with
      event multiplexing (among other things).
      
      This patch extends the interface; a future eBPF patch will utilize
      the new functionality.
      
      [ Note, there's a same-content commit with a poor changelog and a meaningless
        title in the networking tree as well - but we need this change for subsequent
        perf work, so apply it here as well, with a proper changelog. Hopefully Git
        will be able to sort out this somewhat messy workflow, if there are no other,
        conflicting changes to these files. ]
      Signed-off-by: NYonghong Song <yhs@fb.com>
      [ Rewrote the changelog. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <ast@fb.com>
      Cc: <daniel@iogearbox.net>
      Cc: <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David S. Miller <davem@davemloft.net>
      Link: http://lkml.kernel.org/r/20171005161923.332790-2-yhs@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7d9285e8
  6. 25 10月, 2017 1 次提交
    • Y
      bpf: permit multiple bpf attachments for a single perf event · e87c6bc3
      Yonghong Song 提交于
      This patch enables multiple bpf attachments for a
      kprobe/uprobe/tracepoint single trace event.
      Each trace_event keeps a list of attached perf events.
      When an event happens, all attached bpf programs will
      be executed based on the order of attachment.
      
      A global bpf_event_mutex lock is introduced to protect
      prog_array attaching and detaching. An alternative will
      be introduce a mutex lock in every trace_event_call
      structure, but it takes a lot of extra memory.
      So a global bpf_event_mutex lock is a good compromise.
      
      The bpf prog detachment involves allocation of memory.
      If the allocation fails, a dummy do-nothing program
      will replace to-be-detached program in-place.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e87c6bc3
  7. 18 10月, 2017 1 次提交
  8. 08 10月, 2017 3 次提交
    • Y
      bpf: add helper bpf_perf_prog_read_value · 4bebdc7a
      Yonghong Song 提交于
      This patch adds helper bpf_perf_prog_read_cvalue for perf event based bpf
      programs, to read event counter and enabled/running time.
      The enabled/running time is accumulated since the perf event open.
      
      The typical use case for perf event based bpf program is to attach itself
      to a single event. In such cases, if it is desirable to get scaling factor
      between two bpf invocations, users can can save the time values in a map,
      and use the value from the map and the current value to calculate
      the scaling factor.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bebdc7a
    • Y
      bpf: add helper bpf_perf_event_read_value for perf event array map · 908432ca
      Yonghong Song 提交于
      Hardware pmu counters are limited resources. When there are more
      pmu based perf events opened than available counters, kernel will
      multiplex these events so each event gets certain percentage
      (but not 100%) of the pmu time. In case that multiplexing happens,
      the number of samples or counter value will not reflect the
      case compared to no multiplexing. This makes comparison between
      different runs difficult.
      
      Typically, the number of samples or counter value should be
      normalized before comparing to other experiments. The typical
      normalization is done like:
        normalized_num_samples = num_samples * time_enabled / time_running
        normalized_counter_value = counter_value * time_enabled / time_running
      where time_enabled is the time enabled for event and time_running is
      the time running for event since last normalization.
      
      This patch adds helper bpf_perf_event_read_value for kprobed based perf
      event array map, to read perf counter and enabled/running time.
      The enabled/running time is accumulated since the perf event open.
      To achieve scaling factor between two bpf invocations, users
      can can use cpu_id as the key (which is typical for perf array usage model)
      to remember the previous value and do the calculation inside the
      bpf program.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      908432ca
    • Y
      bpf: perf event change needed for subsequent bpf helpers · 97562633
      Yonghong Song 提交于
      This patch does not impact existing functionalities.
      It contains the changes in perf event area needed for
      subsequent bpf_perf_event_read_value and
      bpf_perf_prog_read_value helpers.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97562633
  9. 16 8月, 2017 1 次提交
    • D
      bpf: fix bpf_trace_printk on 32 bit archs · 88a5c690
      Daniel Borkmann 提交于
      James reported that on MIPS32 bpf_trace_printk() is currently
      broken while MIPS64 works fine:
      
        bpf_trace_printk() uses conditional operators to attempt to
        pass different types to __trace_printk() depending on the
        format operators. This doesn't work as intended on 32-bit
        architectures where u32 and long are passed differently to
        u64, since the result of C conditional operators follows the
        "usual arithmetic conversions" rules, such that the values
        passed to __trace_printk() will always be u64 [causing issues
        later in the va_list handling for vscnprintf()].
      
        For example the samples/bpf/tracex5 test printed lines like
        below on MIPS32, where the fd and buf have come from the u64
        fd argument, and the size from the buf argument:
      
          [...] 1180.941542: 0x00000001: write(fd=1, buf=  (null), size=6258688)
      
        Instead of this:
      
          [...] 1625.616026: 0x00000001: write(fd=1, buf=009e4000, size=512)
      
      One way to get it working is to expand various combinations
      of argument types into 8 different combinations for 32 bit
      and 64 bit kernels. Fix tested by James on MIPS32 and MIPS64
      as well that it resolves the issue.
      
      Fixes: 9c959c86 ("tracing: Allow BPF programs to call bpf_trace_printk()")
      Reported-by: NJames Hogan <james.hogan@imgtec.com>
      Tested-by: NJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88a5c690
  10. 03 7月, 2017 2 次提交
    • J
      bpf: extend bpf_trace_printk to support %i · 7bda4b40
      John Fastabend 提交于
      Currently, bpf_trace_printk does not support common formatting
      symbol '%i' however vsprintf does and is what eventually gets
      called by bpf helper. If users are used to '%i' and currently
      make use of it, then bpf_trace_printk will just return with
      error without dumping anything to the trace pipe, so just add
      support for '%i' to the helper.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bda4b40
    • D
      bpf: simplify narrower ctx access · f96da094
      Daniel Borkmann 提交于
      This work tries to make the semantics and code around the
      narrower ctx access a bit easier to follow. Right now
      everything is done inside the .is_valid_access(). Offset
      matching is done differently for read/write types, meaning
      writes don't support narrower access and thus matching only
      on offsetof(struct foo, bar) is enough whereas for read
      case that supports narrower access we must check for
      offsetof(struct foo, bar) + offsetof(struct foo, bar) +
      sizeof(<bar>) - 1 for each of the cases. For read cases of
      individual members that don't support narrower access (like
      packet pointers or skb->cb[] case which has its own narrow
      access logic), we check as usual only offsetof(struct foo,
      bar) like in write case. Then, for the case where narrower
      access is allowed, we also need to set the aux info for the
      access. Meaning, ctx_field_size and converted_op_size have
      to be set. First is the original field size e.g. sizeof(<bar>)
      as in above example from the user facing ctx, and latter
      one is the target size after actual rewrite happened, thus
      for the kernel facing ctx. Also here we need the range match
      and we need to keep track changing convert_ctx_access() and
      converted_op_size from is_valid_access() as both are not at
      the same location.
      
      We can simplify the code a bit: check_ctx_access() becomes
      simpler in that we only store ctx_field_size as a meta data
      and later in convert_ctx_accesses() we fetch the target_size
      right from the location where we do convert. Should the verifier
      be misconfigured we do reject for BPF_WRITE cases or target_size
      that are not provided. For the subsystems, we always work on
      ranges in is_valid_access() and add small helpers for ranges
      and narrow access, convert_ctx_accesses() sets target_size
      for the relevant instruction.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Cc: Yonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f96da094
  11. 24 6月, 2017 1 次提交
    • Y
      bpf: possibly avoid extra masking for narrower load in verifier · 23994631
      Yonghong Song 提交于
      Commit 31fd8581 ("bpf: permits narrower load from bpf program
      context fields") permits narrower load for certain ctx fields.
      The commit however will already generate a masking even if
      the prog-specific ctx conversion produces the result with
      narrower size.
      
      For example, for __sk_buff->protocol, the ctx conversion
      loads the data into register with 2-byte load.
      A narrower 2-byte load should not generate masking.
      For __sk_buff->vlan_present, the conversion function
      set the result as either 0 or 1, essentially a byte.
      The narrower 2-byte or 1-byte load should not generate masking.
      
      To avoid unnecessary masking, prog-specific *_is_valid_access
      now passes converted_op_size back to verifier, which indicates
      the valid data width after perceived future conversion.
      Based on this information, verifier is able to avoid
      unnecessary marking.
      
      Since we want more information back from prog-specific
      *_is_valid_access checking, all of them are packed into
      one data structure for more clarity.
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23994631
  12. 15 6月, 2017 1 次提交
    • Y
      bpf: permits narrower load from bpf program context fields · 31fd8581
      Yonghong Song 提交于
      Currently, verifier will reject a program if it contains an
      narrower load from the bpf context structure. For example,
              __u8 h = __sk_buff->hash, or
              __u16 p = __sk_buff->protocol
              __u32 sample_period = bpf_perf_event_data->sample_period
      which are narrower loads of 4-byte or 8-byte field.
      
      This patch solves the issue by:
        . Introduce a new parameter ctx_field_size to carry the
          field size of narrower load from prog type
          specific *__is_valid_access validator back to verifier.
        . The non-zero ctx_field_size for a memory access indicates
          (1). underlying prog type specific convert_ctx_accesses
               supporting non-whole-field access
          (2). the current insn is a narrower or whole field access.
        . In verifier, for such loads where load memory size is
          less than ctx_field_size, verifier transforms it
          to a full field load followed by proper masking.
        . Currently, __sk_buff and bpf_perf_event_data->sample_period
          are supporting narrowing loads.
        . Narrower stores are still not allowed as typical ctx stores
          are just normal stores.
      
      Because of this change, some tests in verifier will fail and
      these tests are removed. As a bonus, rename some out of bound
      __sk_buff->cb access to proper field name and remove two
      redundant "skb cb oob" tests.
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31fd8581
  13. 11 6月, 2017 1 次提交
  14. 05 6月, 2017 1 次提交
  15. 12 4月, 2017 1 次提交
  16. 29 3月, 2017 1 次提交
  17. 18 2月, 2017 1 次提交
  18. 21 1月, 2017 1 次提交
    • G
      bpf: add bpf_probe_read_str helper · a5e8c070
      Gianluca Borello 提交于
      Provide a simple helper with the same semantics of strncpy_from_unsafe():
      
      int bpf_probe_read_str(void *dst, int size, const void *unsafe_addr)
      
      This gives more flexibility to a bpf program. A typical use case is
      intercepting a file name during sys_open(). The current approach is:
      
      SEC("kprobe/sys_open")
      void bpf_sys_open(struct pt_regs *ctx)
      {
      	char buf[PATHLEN]; // PATHLEN is defined to 256
      	bpf_probe_read(buf, sizeof(buf), ctx->di);
      
      	/* consume buf */
      }
      
      This is suboptimal because the size of the string needs to be estimated
      at compile time, causing more memory to be copied than often necessary,
      and can become more problematic if further processing on buf is done,
      for example by pushing it to userspace via bpf_perf_event_output(),
      since the real length of the string is unknown and the entire buffer
      must be copied (and defining an unrolled strnlen() inside the bpf
      program is a very inefficient and unfeasible approach).
      
      With the new helper, the code can easily operate on the actual string
      length rather than the buffer size:
      
      SEC("kprobe/sys_open")
      void bpf_sys_open(struct pt_regs *ctx)
      {
      	char buf[PATHLEN]; // PATHLEN is defined to 256
      	int res = bpf_probe_read_str(buf, sizeof(buf), ctx->di);
      
      	/* consume buf, for example push it to userspace via
      	 * bpf_perf_event_output(), but this time we can use
      	 * res (the string length) as event size, after checking
      	 * its boundaries.
      	 */
      }
      
      Another useful use case is when parsing individual process arguments or
      individual environment variables navigating current->mm->arg_start and
      current->mm->env_start: using this helper and the return value, one can
      quickly iterate at the right offset of the memory area.
      
      The code changes simply leverage the already existent
      strncpy_from_unsafe() kernel function, which is safe to be called from a
      bpf program as it is used in bpf_trace_printk().
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5e8c070
  19. 17 1月, 2017 1 次提交
    • D
      bpf, trace: make ctx access checks more robust · 2d071c64
      Daniel Borkmann 提交于
      Make sure that ctx cannot potentially be accessed oob by asserting
      explicitly that ctx access size into pt_regs for BPF_PROG_TYPE_KPROBE
      programs must be within limits. In case some 32bit archs have pt_regs
      not being a multiple of 8, then BPF_DW access could cause such access.
      
      BPF_PROG_TYPE_KPROBE progs don't have a ctx conversion function since
      there's no extra mapping needed. kprobe_prog_is_valid_access() didn't
      enforce sizeof(long) as the only allowed access size, since LLVM can
      generate non BPF_W/BPF_DW access to regs from time to time.
      
      For BPF_PROG_TYPE_TRACEPOINT we don't have a ctx conversion either, so
      add a BUILD_BUG_ON() check to make sure that BPF_DW access will not be
      a similar issue in future (ctx works on event buffer as opposed to
      pt_regs there).
      
      Fixes: 2541517c ("tracing, perf: Implement BPF programs attached to kprobes")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d071c64
  20. 12 1月, 2017 1 次提交
    • D
      bpf: pass original insn directly to convert_ctx_access · 6b8cc1d1
      Daniel Borkmann 提交于
      Currently, when calling convert_ctx_access() callback for the various
      program types, we pass in insn->dst_reg, insn->src_reg, insn->off from
      the original instruction. This information is needed to rewrite the
      instruction that is based on the user ctx structure into a kernel
      representation for the ctx. As we'd like to allow access size beyond
      just BPF_W, we'd need also insn->code for that in order to decode the
      original access size. Given that, lets just pass insn directly to the
      convert_ctx_access() callback and work on that to not clutter the
      callback with even more arguments we need to pass when everything is
      already contained in insn. So lets go through that once, no functional
      change.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b8cc1d1
  21. 10 1月, 2017 1 次提交
  22. 23 10月, 2016 1 次提交
  23. 10 9月, 2016 2 次提交
    • D
      bpf: add BPF_CALL_x macros for declaring helpers · f3694e00
      Daniel Borkmann 提交于
      This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
      to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
      that are used today. Motivation for this is to hide all the register handling
      and all necessary casts from the user, so that it is done automatically in the
      background when adding a BPF_CALL_<n>() call.
      
      This makes current helpers easier to review, eases to write future helpers,
      avoids getting the casting mess wrong, and allows for extending all helpers at
      once (f.e. build time checks, etc). It also helps detecting more easily in
      code reviews that unused registers are not instrumented in the code by accident,
      breaking compatibility with existing programs.
      
      BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
      fundamental differences, for example, for generating the actual helper function
      that carries all u64 regs, we need to fill unused regs, so that we always end up
      with 5 u64 regs as an argument.
      
      I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
      they look all as expected. No sparse issue spotted. We let this also sit for a
      few days with Fengguang's kbuild test robot, and there were no issues seen. On
      s390, it barked on the "uses dynamic stack allocation" notice, which is an old
      one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
      to the call wrapper, just telling that the perf raw record/frag sits on stack
      (gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
      and they were fine as well. All eBPF helpers are now converted to use these
      macros, getting rid of a good chunk of all the raw castings.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3694e00
    • D
      bpf: add BPF_SIZEOF and BPF_FIELD_SIZEOF macros · f035a515
      Daniel Borkmann 提交于
      Add BPF_SIZEOF() and BPF_FIELD_SIZEOF() macros to improve the code a bit
      which otherwise often result in overly long bytes_to_bpf_size(sizeof())
      and bytes_to_bpf_size(FIELD_SIZEOF()) lines. So place them into a macro
      helper instead. Moreover, we currently have a BUILD_BUG_ON(BPF_FIELD_SIZEOF())
      check in convert_bpf_extensions(), but we should rather make that generic
      as well and add a BUILD_BUG_ON() test in all BPF_SIZEOF()/BPF_FIELD_SIZEOF()
      users to detect any rewriter size issues at compile time. Note, there are
      currently none, but we want to assert that it stays this way.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f035a515
  24. 03 9月, 2016 1 次提交
    • A
      bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type · 0515e599
      Alexei Starovoitov 提交于
      Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
      HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
      correspondingly in uapi/linux/perf_event.h)
      
      The program visible context meta structure is
      struct bpf_perf_event_data {
          struct pt_regs regs;
           __u64 sample_period;
      };
      which is accessible directly from the program:
      int bpf_prog(struct bpf_perf_event_data *ctx)
      {
        ... ctx->sample_period ...
        ... ctx->regs.ip ...
      }
      
      The bpf verifier rewrites the accesses into kernel internal
      struct bpf_perf_event_data_kern which allows changing
      struct perf_sample_data without affecting bpf programs.
      New fields can be added to the end of struct bpf_perf_event_data
      in the future.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0515e599
  25. 13 8月, 2016 2 次提交
  26. 26 7月, 2016 1 次提交
    • S
      bpf: Add bpf_probe_write_user BPF helper to be called in tracers · 96ae5227
      Sargun Dhillon 提交于
      This allows user memory to be written to during the course of a kprobe.
      It shouldn't be used to implement any kind of security mechanism
      because of TOC-TOU attacks, but rather to debug, divert, and
      manipulate execution of semi-cooperative processes.
      
      Although it uses probe_kernel_write, we limit the address space
      the probe can write into by checking the space with access_ok.
      We do this as opposed to calling copy_to_user directly, in order
      to avoid sleeping. In addition we ensure the threads's current fs
      / segment is USER_DS and the thread isn't exiting nor a kernel thread.
      
      Given this feature is meant for experiments, and it has a risk of
      crashing the system, and running programs, we print a warning on
      when a proglet that attempts to use this helper is installed,
      along with the pid and process name.
      Signed-off-by: NSargun Dhillon <sargun@sargun.me>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96ae5227
  27. 20 7月, 2016 1 次提交
  28. 16 7月, 2016 3 次提交
    • D
      bpf: avoid stack copy and use skb ctx for event output · 555c8a86
      Daniel Borkmann 提交于
      This work addresses a couple of issues bpf_skb_event_output()
      helper currently has: i) We need two copies instead of just a
      single one for the skb data when it should be part of a sample.
      The data can be non-linear and thus needs to be extracted via
      bpf_skb_load_bytes() helper first, and then copied once again
      into the ring buffer slot. ii) Since bpf_skb_load_bytes()
      currently needs to be used first, the helper needs to see a
      constant size on the passed stack buffer to make sure BPF
      verifier can do sanity checks on it during verification time.
      Thus, just passing skb->len (or any other non-constant value)
      wouldn't work, but changing bpf_skb_load_bytes() is also not
      the proper solution, since the two copies are generally still
      needed. iii) bpf_skb_load_bytes() is just for rather small
      buffers like headers, since they need to sit on the limited
      BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
      this work improves the bpf_skb_event_output() helper to address
      all 3 at once.
      
      We can make use of the passed in skb context that we have in
      the helper anyway, and use some of the reserved flag bits as
      a length argument. The helper will use the new __output_custom()
      facility from perf side with bpf_skb_copy() as callback helper
      to walk and extract the data. It will pass the data for setup
      to bpf_event_output(), which generates and pushes the raw record
      with an additional frag part. The linear data used in the first
      frag of the record serves as programmatically defined meta data
      passed along with the appended sample.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      555c8a86
    • D
      bpf, perf: split bpf_perf_event_output · 8e7a3920
      Daniel Borkmann 提交于
      Split the bpf_perf_event_output() helper as a preparation into
      two parts. The new bpf_perf_event_output() will prepare the raw
      record itself and test for unknown flags from BPF trace context,
      where the __bpf_perf_event_output() does the core work. The
      latter will be reused later on from bpf_event_output() directly.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e7a3920
    • D
      perf, events: add non-linear data support for raw records · 7e3f977e
      Daniel Borkmann 提交于
      This patch adds support for non-linear data on raw records. It
      extends raw records to have one or multiple fragments that will
      be written linearly into the ring slot, where each fragment can
      optionally have a custom callback handler to walk and extract
      complex, possibly non-linear data.
      
      If a callback handler is provided for a fragment, then the new
      __output_custom() will be used instead of __output_copy() for
      the perf_output_sample() part. perf_prepare_sample() does all
      the size calculation only once, so perf_output_sample() doesn't
      need to redo the same work anymore, meaning real_size and padding
      will be cached in the raw record. The raw record becomes 32 bytes
      in size without holes; to not increase it further and to avoid
      doing unnecessary recalculations in fast-path, we can reuse
      next pointer of the last fragment, idea here is borrowed from
      ZERO_OR_NULL_PTR(), which should keep the perf_output_sample()
      path for PERF_SAMPLE_RAW minimal.
      
      This facility is needed for BPF's event output helper as a first
      user that will, in a follow-up, add an additional perf_raw_frag
      to its perf_raw_record in order to be able to more efficiently
      dump skb context after a linear head meta data related to it.
      skbs can be non-linear and thus need a custom output function to
      dump buffers. Currently, the skb data needs to be copied twice;
      with the help of __output_custom() this work only needs to be
      done once. Future users could be things like XDP/BPF programs
      that work on different context though and would thus also have
      a different callback function.
      
      The few users of raw records are adapted to initialize their frag
      data from the raw record itself, no change in behavior for them.
      The code is based upon a PoC diff provided by Peter Zijlstra [1].
      
        [1] http://thread.gmane.org/gmane.linux.network/421294Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e3f977e
  29. 09 7月, 2016 1 次提交