1. 24 2月, 2018 4 次提交
  2. 27 1月, 2018 1 次提交
  3. 20 1月, 2018 2 次提交
  4. 18 12月, 2017 2 次提交
    • A
      bpf: x64: add JIT support for multi-function programs · 1c2a088a
      Alexei Starovoitov 提交于
      Typical JIT does several passes over bpf instructions to
      compute total size and relative offsets of jumps and calls.
      With multitple bpf functions calling each other all relative calls
      will have invalid offsets intially therefore we need to additional
      last pass over the program to emit calls with correct offsets.
      For example in case of three bpf functions:
      main:
        call foo
        call bpf_map_lookup
        exit
      foo:
        call bar
        exit
      bar:
        exit
      
      We will call bpf_int_jit_compile() indepedently for main(), foo() and bar()
      x64 JIT typically does 4-5 passes to converge.
      After these initial passes the image for these 3 functions
      will be good except call targets, since start addresses of
      foo() and bar() are unknown when we were JITing main()
      (note that call bpf_map_lookup will be resolved properly
      during initial passes).
      Once start addresses of 3 functions are known we patch
      call_insn->imm to point to right functions and call
      bpf_int_jit_compile() again which needs only one pass.
      Additional safety checks are done to make sure this
      last pass doesn't produce image that is larger or smaller
      than previous pass.
      
      When constant blinding is on it's applied to all functions
      at the first pass, since doing it once again at the last
      pass can change size of the JITed code.
      
      Tested on x64 and arm64 hw with JIT on/off, blinding on/off.
      x64 jits bpf-to-bpf calls correctly while arm64 falls back to interpreter.
      All other JITs that support normal BPF_CALL will behave the same way
      since bpf-to-bpf call is equivalent to bpf-to-kernel call from
      JITs point of view.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1c2a088a
    • A
      bpf: fix net.core.bpf_jit_enable race · 60b58afc
      Alexei Starovoitov 提交于
      global bpf_jit_enable variable is tested multiple times in JITs,
      blinding and verifier core. The malicious root can try to toggle
      it while loading the programs. This race condition was accounted
      for and there should be no issues, but it's safer to avoid
      this race condition.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      60b58afc
  5. 04 10月, 2017 1 次提交
  6. 01 9月, 2017 1 次提交
  7. 10 8月, 2017 1 次提交
  8. 30 6月, 2017 1 次提交
  9. 07 6月, 2017 1 次提交
  10. 01 6月, 2017 3 次提交
  11. 09 5月, 2017 1 次提交
  12. 29 4月, 2017 1 次提交
  13. 22 2月, 2017 1 次提交
  14. 18 2月, 2017 2 次提交
    • D
      bpf: make jited programs visible in traces · 74451e66
      Daniel Borkmann 提交于
      Long standing issue with JITed programs is that stack traces from
      function tracing check whether a given address is kernel code
      through {__,}kernel_text_address(), which checks for code in core
      kernel, modules and dynamically allocated ftrace trampolines. But
      what is still missing is BPF JITed programs (interpreted programs
      are not an issue as __bpf_prog_run() will be attributed to them),
      thus when a stack trace is triggered, the code walking the stack
      won't see any of the JITed ones. The same for address correlation
      done from user space via reading /proc/kallsyms. This is read by
      tools like perf, but the latter is also useful for permanent live
      tracing with eBPF itself in combination with stack maps when other
      eBPF types are part of the callchain. See offwaketime example on
      dumping stack from a map.
      
      This work tries to tackle that issue by making the addresses and
      symbols known to the kernel. The lookup from *kernel_text_address()
      is implemented through a latched RB tree that can be read under
      RCU in fast-path that is also shared for symbol/size/offset lookup
      for a specific given address in kallsyms. The slow-path iteration
      through all symbols in the seq file done via RCU list, which holds
      a tiny fraction of all exported ksyms, usually below 0.1 percent.
      Function symbols are exported as bpf_prog_<tag>, in order to aide
      debugging and attribution. This facility is currently enabled for
      root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
      is active in any mode. The rationale behind this is that still a lot
      of systems ship with world read permissions on kallsyms thus addresses
      should not get suddenly exposed for them. If that situation gets
      much better in future, we always have the option to change the
      default on this. Likewise, unprivileged programs are not allowed
      to add entries there either, but that is less of a concern as most
      such programs types relevant in this context are for root-only anyway.
      If enabled, call graphs and stack traces will then show a correct
      attribution; one example is illustrated below, where the trace is
      now visible in tooling such as perf script --kallsyms=/proc/kallsyms
      and friends.
      
      Before:
      
        7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
      
      After:
      
        7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
        [...]
        7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74451e66
    • D
      bpf: remove stubs for cBPF from arch code · 9383191d
      Daniel Borkmann 提交于
      Remove the dummy bpf_jit_compile() stubs for eBPF JITs and make
      that a single __weak function in the core that can be overridden
      similarly to the eBPF one. Also remove stale pr_err() mentions
      of bpf_jit_compile.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9383191d
  15. 09 1月, 2017 1 次提交
  16. 09 12月, 2016 1 次提交
  17. 17 5月, 2016 3 次提交
  18. 24 2月, 2016 2 次提交
    • J
      x86/asm/bpf: Create stack frames in bpf_jit.S · d21001cc
      Josh Poimboeuf 提交于
      bpf_jit.S has several callable non-leaf functions which don't honor
      CONFIG_FRAME_POINTER, which can result in bad stack traces.
      
      Create a stack frame before the call instructions when
      CONFIG_FRAME_POINTER is enabled.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Chris J Arges <chris.j.arges@canonical.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Cc: Pedro Alves <palves@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Link: http://lkml.kernel.org/r/fa4c41976b438b51954cb8021f06bceb1d1d66cc.1453405861.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d21001cc
    • J
      x86/asm/bpf: Annotate callable functions · 2d8fe90a
      Josh Poimboeuf 提交于
      bpf_jit.S has several functions which can be called from C code.  Give
      them proper ELF annotations.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Chris J Arges <chris.j.arges@canonical.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Cc: Pedro Alves <palves@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Link: http://lkml.kernel.org/r/bbe1de0c299fecd4fc9a1766bae8be2647bedb01.1453405861.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2d8fe90a
  19. 19 12月, 2015 2 次提交
    • D
      bpf, x86: detect/optimize loading 0 immediates · 606c88a8
      Daniel Borkmann 提交于
      When sometimes structs or variables need to be initialized/'memset' to 0 in
      an eBPF C program, the x86 BPF JIT converts this to use immediates. We can
      however save a couple of bytes (f.e. even up to 7 bytes on a single emmission
      of BPF_LD | BPF_IMM | BPF_DW) in the image by detecting such case and use xor
      on the dst register instead.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      606c88a8
    • D
      bpf: move clearing of A/X into classic to eBPF migration prologue · 8b614aeb
      Daniel Borkmann 提交于
      Back in the days where eBPF (or back then "internal BPF" ;->) was not
      exposed to user space, and only the classic BPF programs internally
      translated into eBPF programs, we missed the fact that for classic BPF
      A and X needed to be cleared. It was fixed back then via 83d5b7ef
      ("net: filter: initialize A and X registers"), and thus classic BPF
      specifics were added to the eBPF interpreter core to work around it.
      
      This added some confusion for JIT developers later on that take the
      eBPF interpreter code as an example for deriving their JIT. F.e. in
      f75298f5 ("s390/bpf: clear correct BPF accumulator register"), at
      least X could leak stack memory. Furthermore, since this is only needed
      for classic BPF translations and not for eBPF (verifier takes care
      that read access to regs cannot be done uninitialized), more complexity
      is added to JITs as they need to determine whether they deal with
      migrations or native eBPF where they can just omit clearing A/X in
      their prologue and thus reduce image size a bit, see f.e. cde66c2d
      ("s390/bpf: Only clear A and X for converted BPF programs"). In other
      cases (x86, arm64), A and X is being cleared in the prologue also for
      eBPF case, which is unnecessary.
      
      Lets move this into the BPF migration in bpf_convert_filter() where it
      actually belongs as long as the number of eBPF JITs are still few. It
      can thus be done generically; allowing us to remove the quirk from
      __bpf_prog_run() and to slightly reduce JIT image size in case of eBPF,
      while reducing code duplication on this matter in current(/future) eBPF
      JITs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Tested-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Zi Shen Lim <zlim.lnx@gmail.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Acked-by: NYang Shi <yang.shi@linaro.org>
      Acked-by: NZi Shen Lim <zlim.lnx@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b614aeb
  20. 03 10月, 2015 1 次提交
  21. 10 8月, 2015 1 次提交
  22. 31 7月, 2015 1 次提交
  23. 30 7月, 2015 1 次提交
    • D
      ebpf, x86: fix general protection fault when tail call is invoked · 2482abb9
      Daniel Borkmann 提交于
      With eBPF JIT compiler enabled on x86_64, I was able to reliably trigger
      the following general protection fault out of an eBPF program with a simple
      tail call, f.e. tracex5 (or a stripped down version of it):
      
        [  927.097918] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
        [...]
        [  927.100870] task: ffff8801f228b780 ti: ffff880016a64000 task.ti: ffff880016a64000
        [  927.102096] RIP: 0010:[<ffffffffa002440d>]  [<ffffffffa002440d>] 0xffffffffa002440d
        [  927.103390] RSP: 0018:ffff880016a67a68  EFLAGS: 00010006
        [  927.104683] RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: 0000000000000001
        [  927.105921] RDX: 0000000000000000 RSI: ffff88014e438000 RDI: ffff880016a67e00
        [  927.107137] RBP: ffff880016a67c90 R08: 0000000000000000 R09: 0000000000000001
        [  927.108351] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880016a67e00
        [  927.109567] R13: 0000000000000000 R14: ffff88026500e460 R15: ffff880220a81520
        [  927.110787] FS:  00007fe7d5c1f740(0000) GS:ffff880265000000(0000) knlGS:0000000000000000
        [  927.112021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  927.113255] CR2: 0000003e7bbb91a0 CR3: 000000006e04b000 CR4: 00000000001407e0
        [  927.114500] Stack:
        [  927.115737]  ffffc90008cdb000 ffff880016a67e00 ffff88026500e460 ffff880220a81520
        [  927.117005]  0000000100000000 000000000000001b ffff880016a67aa8 ffffffff8106c548
        [  927.118276]  00007ffcdaf22e58 0000000000000000 0000000000000000 ffff880016a67ff0
        [  927.119543] Call Trace:
        [  927.120797]  [<ffffffff8106c548>] ? lookup_address+0x28/0x30
        [  927.122058]  [<ffffffff8113d176>] ? __module_text_address+0x16/0x70
        [  927.123314]  [<ffffffff8117bf0e>] ? is_ftrace_trampoline+0x3e/0x70
        [  927.124562]  [<ffffffff810c1a0f>] ? __kernel_text_address+0x5f/0x80
        [  927.125806]  [<ffffffff8102086f>] ? print_context_stack+0x7f/0xf0
        [  927.127033]  [<ffffffff810f7852>] ? __lock_acquire+0x572/0x2050
        [  927.128254]  [<ffffffff810f7852>] ? __lock_acquire+0x572/0x2050
        [  927.129461]  [<ffffffff8119edfa>] ? trace_call_bpf+0x3a/0x140
        [  927.130654]  [<ffffffff8119ee4a>] trace_call_bpf+0x8a/0x140
        [  927.131837]  [<ffffffff8119edfa>] ? trace_call_bpf+0x3a/0x140
        [  927.133015]  [<ffffffff8119f008>] kprobe_perf_func+0x28/0x220
        [  927.134195]  [<ffffffff811a1668>] kprobe_dispatcher+0x38/0x60
        [  927.135367]  [<ffffffff81174b91>] ? seccomp_phase1+0x1/0x230
        [  927.136523]  [<ffffffff81061400>] kprobe_ftrace_handler+0xf0/0x150
        [  927.137666]  [<ffffffff81174b95>] ? seccomp_phase1+0x5/0x230
        [  927.138802]  [<ffffffff8117950c>] ftrace_ops_recurs_func+0x5c/0xb0
        [  927.139934]  [<ffffffffa022b0d5>] 0xffffffffa022b0d5
        [  927.141066]  [<ffffffff81174b91>] ? seccomp_phase1+0x1/0x230
        [  927.142199]  [<ffffffff81174b95>] seccomp_phase1+0x5/0x230
        [  927.143323]  [<ffffffff8102c0a4>] syscall_trace_enter_phase1+0xc4/0x150
        [  927.144450]  [<ffffffff81174b95>] ? seccomp_phase1+0x5/0x230
        [  927.145572]  [<ffffffff8102c0a4>] ? syscall_trace_enter_phase1+0xc4/0x150
        [  927.146666]  [<ffffffff817f9a9f>] tracesys+0xd/0x44
        [  927.147723] Code: 48 8b 46 10 48 39 d0 76 2c 8b 85 fc fd ff ff 83 f8 20 77 21 83
                             c0 01 89 85 fc fd ff ff 48 8d 44 d6 80 48 8b 00 48 83 f8 00 74
                             0a <48> 8b 40 20 48 83 c0 33 ff e0 48 89 d8 48 8b 9d d8 fd ff
                             ff 4c
        [  927.150046] RIP  [<ffffffffa002440d>] 0xffffffffa002440d
      
      The code section with the instructions that traps points into the eBPF JIT
      image of the root program (the one invoking the tail call instruction).
      
      Using bpf_jit_disasm -o on the eBPF root program image:
      
        [...]
        4e:   mov    -0x204(%rbp),%eax
              8b 85 fc fd ff ff
        54:   cmp    $0x20,%eax               <--- if (tail_call_cnt > MAX_TAIL_CALL_CNT)
              83 f8 20
        57:   ja     0x000000000000007a
              77 21
        59:   add    $0x1,%eax                <--- tail_call_cnt++
              83 c0 01
        5c:   mov    %eax,-0x204(%rbp)
              89 85 fc fd ff ff
        62:   lea    -0x80(%rsi,%rdx,8),%rax  <--- prog = array->prog[index]
              48 8d 44 d6 80
        67:   mov    (%rax),%rax
              48 8b 00
        6a:   cmp    $0x0,%rax                <--- check for NULL
              48 83 f8 00
        6e:   je     0x000000000000007a
              74 0a
        70:   mov    0x20(%rax),%rax          <--- GPF triggered here! fetch of bpf_func
              48 8b 40 20                              [ matches <48> 8b 40 20 ... from above ]
        74:   add    $0x33,%rax               <--- prologue skip of new prog
              48 83 c0 33
        78:   jmpq   *%rax                    <--- jump to new prog insns
              ff e0
        [...]
      
      The problem is that rax has 5a5a5a5a5a5a5a5a, which suggests a tail call
      jump to map slot 0 is pointing to a poisoned page. The issue is the following:
      
      lea instruction has a wrong offset, i.e. it should be ...
      
        lea    0x80(%rsi,%rdx,8),%rax
      
      ... but it actually seems to be ...
      
        lea   -0x80(%rsi,%rdx,8),%rax
      
      ... where 0x80 is offsetof(struct bpf_array, prog), thus the offset needs
      to be positive instead of negative. Disassembling the interpreter, we btw
      similarly do:
      
        [...]
        c88:  lea     0x80(%rax,%rdx,8),%rax  <--- prog = array->prog[index]
              48 8d 84 d0 80 00 00 00
        c90:  add     $0x1,%r13d
              41 83 c5 01
        c94:  mov     (%rax),%rax
              48 8b 00
        [...]
      
      Now the other interesting fact is that this panic triggers only when things
      like CONFIG_LOCKDEP are being used. In that case offsetof(struct bpf_array,
      prog) starts at offset 0x80 and in non-CONFIG_LOCKDEP case at offset 0x50.
      Reason is that the work_struct inside struct bpf_map grows by 48 bytes in my
      case due to the lockdep_map member (which also has CONFIG_LOCK_STAT enabled
      members).
      
      Changing the emitter to always use the 4 byte displacement in the lea
      instruction fixes the panic on my side. It increases the tail call instruction
      emission by 3 more byte, but it should cover us from various combinations
      (and perhaps other future increases on related structures).
      
      After patch, disassembly:
      
        [...]
        9e:   lea    0x80(%rsi,%rdx,8),%rax   <--- CONFIG_LOCKDEP/CONFIG_LOCK_STAT
              48 8d 84 d6 80 00 00 00
        a6:   mov    (%rax),%rax
              48 8b 00
        [...]
      
        [...]
        9e:   lea    0x50(%rsi,%rdx,8),%rax   <--- No CONFIG_LOCKDEP
              48 8d 84 d6 50 00 00 00
        a6:   mov    (%rax),%rax
              48 8b 00
        [...]
      
      Fixes: b52f00e6 ("x86: bpf_jit: implement bpf_tail_call() helper")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2482abb9
  24. 21 7月, 2015 1 次提交
    • A
      bpf: introduce bpf_skb_vlan_push/pop() helpers · 4e10df9a
      Alexei Starovoitov 提交于
      Allow eBPF programs attached to TC qdiscs call skb_vlan_push/pop via
      helper functions. These functions may change skb->data/hlen which are
      cached by some JITs to improve performance of ld_abs/ld_ind instructions.
      Therefore JITs need to recognize bpf_skb_vlan_push/pop() calls,
      re-compute header len and re-cache skb->data/hlen back into cpu registers.
      Note, skb->data/hlen are not directly accessible from the programs,
      so any changes to skb->data done either by these helpers or by other
      TC actions are safe.
      
      eBPF JIT supported by three architectures:
      - arm64 JIT is using bpf_load_pointer() without caching, so it's ok as-is.
      - x64 JIT re-caches skb->data/hlen unconditionally after vlan_push/pop calls
        (experiments showed that conditional re-caching is slower).
      - s390 JIT falls back to interpreter for now when bpf_skb_vlan_push() is present
        in the program (re-caching is tbd).
      
      These helpers allow more scalable handling of vlan from the programs.
      Instead of creating thousands of vlan netdevs on top of eth0 and attaching
      TC+ingress+bpf to all of them, the program can be attached to eth0 directly
      and manipulate vlans as necessary.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e10df9a
  25. 02 6月, 2015 1 次提交
    • I
      x86/debug: Remove perpetually broken, unmaintainable dwarf annotations · 131484c8
      Ingo Molnar 提交于
      So the dwarf2 annotations in low level assembly code have
      become an increasing hindrance: unreadable, messy macros
      mixed into some of the most security sensitive code paths
      of the Linux kernel.
      
      These debug info annotations don't even buy the upstream
      kernel anything: dwarf driven stack unwinding has caused
      problems in the past so it's out of tree, and the upstream
      kernel only uses the much more robust framepointers based
      stack unwinding method.
      
      In addition to that there's a steady, slow bitrot going
      on with these annotations, requiring frequent fixups.
      There's no tooling and no functionality upstream that
      keeps it correct.
      
      So burn down the sick forest, allowing new, healthier growth:
      
         27 files changed, 350 insertions(+), 1101 deletions(-)
      
      Someone who has the willingness and time to do this
      properly can attempt to reintroduce dwarf debuginfo in x86
      assembly code plus dwarf unwinding from first principles,
      with the following conditions:
      
       - it should be maximally readable, and maximally low-key to
         'ordinary' code reading and maintenance.
      
       - find a build time method to insert dwarf annotations
         automatically in the most common cases, for pop/push
         instructions that manipulate the stack pointer. This could
         be done for example via a preprocessing step that just
         looks for common patterns - plus special annotations for
         the few cases where we want to depart from the default.
         We have hundreds of CFI annotations, so automating most of
         that makes sense.
      
       - it should come with build tooling checks that ensure that
         CFI annotations are sensible. We've seen such efforts from
         the framepointer side, and there's no reason it couldn't be
         done on the dwarf side.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      131484c8
  26. 25 5月, 2015 1 次提交
  27. 22 5月, 2015 1 次提交
    • A
      x86: bpf_jit: implement bpf_tail_call() helper · b52f00e6
      Alexei Starovoitov 提交于
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      In this implementation x64 JIT bypasses stack unwind and jumps into the
      callee program after prologue, so the callee program reuses the same stack.
      
      The logic can be roughly expressed in C like:
      
      u32 tail_call_cnt;
      
      void *jumptable[2] = { &&label1, &&label2 };
      
      int bpf_prog1(void *ctx)
      {
      label1:
          ...
      }
      
      int bpf_prog2(void *ctx)
      {
      label2:
          ...
      }
      
      int bpf_prog1(void *ctx)
      {
          ...
          if (tail_call_cnt++ < MAX_TAIL_CALL_CNT)
              goto *jumptable[index]; ... and pass my 'ctx' to callee ...
      
          ... fall through if no entry in jumptable ...
      }
      
      Note that 'skip current program epilogue and next program prologue' is
      an optimization. Other JITs don't have to do it the same way.
      >From safety point of view it's valid as well, since programs always
      initialize the stack before use, so any residue in the stack left by
      the current program is not going be read. The same verifier checks are
      done for the calls from the kernel into all bpf programs.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b52f00e6
  28. 13 5月, 2015 1 次提交